<style>/*<link href='https://www.blogger.com/dyn-css/authorization.css?targetBlogID=1127324727302166121&zx=f88e8fe7-1a97-428e-8a78-2c6ec32c7595' media='none' onload='if(media!='all')media='all'' rel='stylesheet'/><noscript><link href='https://www.blogger.com/dyn-css/authorization.css?targetBlogID=1127324727302166121&zx=f88e8fe7-1a97-428e-8a78-2c6ec32c7595' rel='stylesheet'/></noscript>
<meta name='google-adsense-platform-account' content='ca-host-pub-1556223355139109'/>
<meta name='google-adsense-platform-domain' content='blogspot.com'/>

<script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js?client=ca-pub-9630825446506083&host=ca-host-pub-1556223355139109" crossorigin="anonymous"></script>

<!-- data-ad-client=ca-pub-9630825446506083 -->

</head><body>*/</style>

【论文笔记】MacBert：Revisiting Pre-trained Models for Chinese Natural Language Processing

iioSnail

21 Oct, 2024

文章目录

相关信息
摘要（Abstract）
1. 介绍（Introduction）
2. 相关工作（Related Work）
3. 中文预训练模型（Chinese Pre-trained Language Models）

3.1 BERT-wwm & RoBERTa-wwm
3.2 MacBERT

4. 实验设置（Experiment Setups）

4.1 Setups for Pre-Trained Language Models
4.2 Setups for Fine-tuning Tasks

5. 结果（Results）
6. 讨论（Discussion）
7. 结论（Conclusion）

相关信息

论文年份：2020年04月

论文地址：https://arxiv.org/pdf/2004.13922.pdf

论文代码(官方)：https://github.com/ymcui/MacBERT

论文模型(Hugging Face): hfl/chinese-macbert-base ; hfl/chinese-macbert-large

论文阅读前提：熟悉BERT模型及其前置知识

一句话概括一下本文的内容：作者对原有的BERT的MLM任务进行了魔改，不使用[MASK]作为掩码，而是使用相似的字进行掩码，然后发现Performance提升了，起个新名字MacBERT。

摘要（Abstract）

作者提出了一个中文Bert，起名为MacBert。

该模型采用的mask策略（作者提出的）是 MLM as correction (Mac)

作者用MacBert在8个NLP任务上进行了测试，大部分都能达到SOTA

1. 介绍（Introduction）

作者的贡献：提出了新的MacBert模型，其缓和了pre-training阶段和fine-tuning阶段的gap。采用的方式是“mask字时，采用相似的字进行mask”

2. 相关工作（Related Work）

这个表总结的不错。其他略

3. 中文预训练模型（Chinese Pre-trained Language Models）

3.1 BERT-wwm & RoBERTa-wwm

略（也是相关工作）

3.2 MacBERT

MacBERT的训练使用了两个任务，MLM和SOP（sentence-order prediciton）

对于MLM任务，与BERT类似，但做了如下修改：

作者使用N-gram的方式来选择要mask的token，按照40%,30%,20,10%的比例进行1-gram到4-gram的mask
相对于BERT中使用[MASK]来替换token，作者使用的方式是使用相似的字来进行替换。相似字使用的是Synonyms toolkit
对于要进行mask的token，15%使用[mask]替换，80%使用相似字，10%使用随机字，剩下10%使用原始字。

在原文中，作者使用的是word，其实我也不太清楚他说的word是一个字还是一个词。一般中文的BERT模型都是按字来处理的，所以我这里也认为是word指代的是一个字。

对于SOP任务，其负样本就是将两个连续的句子交换顺序。

4. 实验设置（Experiment Setups）

4.1 Setups for Pre-Trained Language Models

数据集：①中文维基百科，0.4B个字；② 新百科全书(encyclopedia news)+问答网站，5.4B个字

分词工具：LTP(Language Technology Platform) ， 4.2k star，基于深度学习，包括：分词、词性标注、句法分析等

训练方式：①对于BaseModel，基于Chinese BERT-base继续训练；②对于LargeModel，从0开始训练。

其他设置：

句子最大长度： 512
Weight Decay Optimizer：Adam
Optimizer: Lamb
对MacBERT-large：2M steps, 512 batch_size, 1e-4 learning rate

训练细节汇总如下表：

4.2 Setups for Fine-tuning Tasks

本节是关于下游任务的设置，略。

5. 结果（Results）

本章展示了在各个下游任务的实验结果。这里我简单列个表：

任务	Level	MacBERT结果
Machine Reading Comprehension	document-level	最强
Single Sentence Classification	sentence-level	一般，与其他差异不大
Sentence Pair Classification	sentence-level	稍好，平均来讲，比其他模型稍微好一丢丢

6. 讨论（Discussion）

作者做了消融实验，得出了以下结论：

MacBERT对Performance的提升主要是因为N-gram mask和相似词替换(Similar word replacement)这两个机制
SOP(Sentence-order Prediciton)任务虽然对Performance也有提升，但微乎其微。

7. 结论（Conclusion）

略

<style>/*
<script type="text/javascript" src="https://www.blogger.com/static/v1/widgets/432983155-widgets.js"></script>
<script type='text/javascript'>
window['__wavt'] = 'AOuZoY5GROBpW7DqVPI-UVDX4xm-wfQ-oA:1772066594179';_WidgetManager._Init('//www.blogger.com/rearrange?blogID\x3d1127324727302166121','//iiosnail.blogspot.com/2024/10/macbert.html','1127324727302166121');
_WidgetManager._SetDataContext([{'name': 'blog', 'data': {'blogId': '1127324727302166121', 'title': 'iioSnail', 'url': 'https://iiosnail.blogspot.com/2024/10/macbert.html', 'canonicalUrl': 'https://iiosnail.blogspot.com/2024/10/macbert.html', 'homepageUrl': 'https://iiosnail.blogspot.com/', 'searchUrl': 'https://iiosnail.blogspot.com/search', 'canonicalHomepageUrl': 'https://iiosnail.blogspot.com/', 'blogspotFaviconUrl': 'https://iiosnail.blogspot.com/favicon.ico', 'bloggerUrl': 'https://www.blogger.com', 'hasCustomDomain': false, 'httpsEnabled': true, 'enabledCommentProfileImages': true, 'gPlusViewType': 'FILTERED_POSTMOD', 'adultContent': false, 'analyticsAccountNumber': '', 'encoding': 'UTF-8', 'locale': 'en', 'localeUnderscoreDelimited': 'en', 'languageDirection': 'ltr', 'isPrivate': false, 'isMobile': false, 'isMobileRequest': false, 'mobileClass': '', 'isPrivateBlog': false, 'isDynamicViewsAvailable': true, 'feedLinks': '\x3clink rel\x3d\x22alternate\x22 type\x3d\x22application/atom+xml\x22 title\x3d\x22iioSnail - Atom\x22 href\x3d\x22https://iiosnail.blogspot.com/feeds/posts/default\x22 /\x3e\n\x3clink rel\x3d\x22alternate\x22 type\x3d\x22application/rss+xml\x22 title\x3d\x22iioSnail - RSS\x22 href\x3d\x22https://iiosnail.blogspot.com/feeds/posts/default?alt\x3drss\x22 /\x3e\n\x3clink rel\x3d\x22service.post\x22 type\x3d\x22application/atom+xml\x22 title\x3d\x22iioSnail - Atom\x22 href\x3d\x22https://www.blogger.com/feeds/1127324727302166121/posts/default\x22 /\x3e\n\n\x3clink rel\x3d\x22alternate\x22 type\x3d\x22application/atom+xml\x22 title\x3d\x22iioSnail - Atom\x22 href\x3d\x22https://iiosnail.blogspot.com/feeds/321501802091972627/comments/default\x22 /\x3e\n', 'meTag': '', 'adsenseClientId': 'ca-pub-9630825446506083', 'adsenseHostId': 'ca-host-pub-1556223355139109', 'adsenseHasAds': true, 'adsenseAutoAds': true, 'boqCommentIframeForm': true, 'loginRedirectParam': '', 'view': '', 'dynamicViewsCommentsSrc': '//www.blogblog.com/dynamicviews/4224c15c4e7c9321/js/comments.js', 'dynamicViewsScriptSrc': '//www.blogblog.com/dynamicviews/05143eafc7a9da95', 'plusOneApiSrc': 'https://apis.google.com/js/platform.js', 'disableGComments': true, 'interstitialAccepted': false, 'sharing': {'platforms': [{'name': 'Get link', 'key': 'link', 'shareMessage': 'Get link', 'target': ''}, {'name': 'Facebook', 'key': 'facebook', 'shareMessage': 'Share to Facebook', 'target': 'facebook'}, {'name': 'BlogThis!', 'key': 'blogThis', 'shareMessage': 'BlogThis!', 'target': 'blog'}, {'name': 'X', 'key': 'twitter', 'shareMessage': 'Share to X', 'target': 'twitter'}, {'name': 'Pinterest', 'key': 'pinterest', 'shareMessage': 'Share to Pinterest', 'target': 'pinterest'}, {'name': 'Email', 'key': 'email', 'shareMessage': 'Email', 'target': 'email'}], 'disableGooglePlus': true, 'googlePlusShareButtonWidth': 0, 'googlePlusBootstrap': '\x3cscript type\x3d\x22text/javascript\x22\x3ewindow.___gcfg \x3d {\x27lang\x27: \x27en\x27};\x3c/script\x3e'}, 'hasCustomJumpLinkMessage': false, 'jumpLinkMessage': 'Read more', 'pageType': 'item', 'postId': '321501802091972627', 'postImageThumbnailUrl': 'https://blogger.googleusercontent.com/img/a/AVvXsEhcnaef7By8X7GNz81fl2GzzIUce5b2QVbAgfySU519u1lRB2z-d6dzjSklN4HP7dCKIL7fBd-RMCHMyD4yOyNhzPZA1vRsllYgwveOYH5Lr_Fp7zo0alTHre992n3opY9GW4cLsknK3eBfnvty3VofLOkrbzzE5gOCn7nZa3pmnopGIAUv1-rHaO3OWM7E\x3ds72-w640-c-h352', 'postImageUrl': 'https://blogger.googleusercontent.com/img/a/AVvXsEhcnaef7By8X7GNz81fl2GzzIUce5b2QVbAgfySU519u1lRB2z-d6dzjSklN4HP7dCKIL7fBd-RMCHMyD4yOyNhzPZA1vRsllYgwveOYH5Lr_Fp7zo0alTHre992n3opY9GW4cLsknK3eBfnvty3VofLOkrbzzE5gOCn7nZa3pmnopGIAUv1-rHaO3OWM7E\x3dw640-h352', 'pageName': '\u3010\u8bba\u6587\u7b14\u8bb0\u3011MacBert\uff1aRevisiting Pre-trained Models for Chinese Natural Language Processing', 'pageTitle': 'iioSnail: \u3010\u8bba\u6587\u7b14\u8bb0\u3011MacBert\uff1aRevisiting Pre-trained Models for Chinese Natural Language Processing'}}, {'name': 'features', 'data': {}}, {'name': 'messages', 'data': {'edit': 'Edit', 'linkCopiedToClipboard': 'Link copied to clipboard!', 'ok': 'Ok', 'postLink': 'Post Link'}}, {'name': 'template', 'data': {'name': 'custom', 'localizedName': 'Custom', 'isResponsive': true, 'isAlternateRendering': false, 'isCustom': true}}, {'name': 'view', 'data': {'classic': {'name': 'classic', 'url': '?view\x3dclassic'}, 'flipcard': {'name': 'flipcard', 'url': '?view\x3dflipcard'}, 'magazine': {'name': 'magazine', 'url': '?view\x3dmagazine'}, 'mosaic': {'name': 'mosaic', 'url': '?view\x3dmosaic'}, 'sidebar': {'name': 'sidebar', 'url': '?view\x3dsidebar'}, 'snapshot': {'name': 'snapshot', 'url': '?view\x3dsnapshot'}, 'timeslide': {'name': 'timeslide', 'url': '?view\x3dtimeslide'}, 'isMobile': false, 'title': '\u3010\u8bba\u6587\u7b14\u8bb0\u3011MacBert\uff1aRevisiting Pre-trained Models for Chinese Natural Language Processing', 'description': '  \u6587\u7ae0\u76ee\u5f55 \u76f8\u5173\u4fe1\u606f \u6458\u8981\uff08Abstract\uff09 1. \u4ecb\u7ecd\uff08Introduction\uff09 2. \u76f8\u5173\u5de5\u4f5c\uff08Related Work\uff09 3. \u4e2d\u6587\u9884\u8bad\u7ec3\u6a21\u578b\uff08Chinese Pre-trained Language Models\uff09 3.1 BERT-wwm \x26 RoBERTa-ww...', 'featuredImage': 'https://blogger.googleusercontent.com/img/a/AVvXsEhcnaef7By8X7GNz81fl2GzzIUce5b2QVbAgfySU519u1lRB2z-d6dzjSklN4HP7dCKIL7fBd-RMCHMyD4yOyNhzPZA1vRsllYgwveOYH5Lr_Fp7zo0alTHre992n3opY9GW4cLsknK3eBfnvty3VofLOkrbzzE5gOCn7nZa3pmnopGIAUv1-rHaO3OWM7E\x3dw640-h352', 'url': 'https://iiosnail.blogspot.com/2024/10/macbert.html', 'type': 'item', 'isSingleItem': true, 'isMultipleItems': false, 'isError': false, 'isPage': false, 'isPost': true, 'isHomepage': false, 'isArchive': false, 'isLabelSearch': false, 'postId': 321501802091972627}}, {'name': 'widgets', 'data': [{'title': 'Upload Image', 'type': 'Image', 'sectionId': 'upload-image', 'id': 'Image10'}, {'title': 'Logo', 'type': 'HTML', 'sectionId': 'header-main', 'id': 'HTML10'}, {'title': 'Icons, Dark, Search', 'type': 'LinkList', 'sectionId': 'header-main', 'id': 'LinkList10'}, {'title': 'Menu', 'type': 'LinkList', 'sectionId': 'header-main', 'id': 'LinkList11'}, {'title': 'Featured Post', 'type': 'FeaturedPost', 'sectionId': 'before-blog', 'id': 'FeaturedPost1', 'postId': '4563757171272475312'}, {'title': 'Blog Posts', 'type': 'Blog', 'sectionId': 'blog-post', 'id': 'Blog1', 'posts': [{'id': '321501802091972627', 'title': '\u3010\u8bba\u6587\u7b14\u8bb0\u3011MacBert\uff1aRevisiting Pre-trained Models for Chinese Natural Language Processing', 'featuredImage': 'https://blogger.googleusercontent.com/img/a/AVvXsEhcnaef7By8X7GNz81fl2GzzIUce5b2QVbAgfySU519u1lRB2z-d6dzjSklN4HP7dCKIL7fBd-RMCHMyD4yOyNhzPZA1vRsllYgwveOYH5Lr_Fp7zo0alTHre992n3opY9GW4cLsknK3eBfnvty3VofLOkrbzzE5gOCn7nZa3pmnopGIAUv1-rHaO3OWM7E\x3dw640-h352', 'showInlineAds': false}], 'headerByline': {'regionName': 'header1', 'items': [{'name': 'share', 'label': ''}, {'name': 'author', 'label': 'By'}, {'name': 'timestamp', 'label': 'd MMM, yyyy'}]}, 'footerBylines': [{'regionName': 'footer1', 'items': [{'name': 'comments', 'label': 'Comment'}, {'name': 'icons', 'label': ''}]}, {'regionName': 'footer2', 'items': [{'name': 'labels', 'label': ''}]}], 'allBylineItems': [{'name': 'share', 'label': ''}, {'name': 'author', 'label': 'By'}, {'name': 'timestamp', 'label': 'd MMM, yyyy'}, {'name': 'comments', 'label': 'Comment'}, {'name': 'icons', 'label': ''}, {'name': 'labels', 'label': ''}]}, {'title': 'Popular Posts', 'type': 'PopularPosts', 'sectionId': 'sidebar-static', 'id': 'PopularPosts10', 'posts': [{'title': 'Pytorch\u4e2d nn.Transformer\u7684\u4f7f\u7528\u8be6\u89e3\u4e0eTransformer\u7684\u9ed1\u76d2\u8bb2\u89e3', 'id': 6869001144438121320}, {'title': '\u4e2d\u6587\u6587\u672c\u7ea0\u9519(Chinese Spell Checking, CSC)\u4efb\u52a1\u5404\u4e2a\u8bba\u6587\u7684\u8bc4\u4ef7\u6307\u6807', 'id': 259294233427830690}, {'title': 'Simple English Short Stories - Work Theme\uff08\u7b80\u5355\u82f1\u8bed\u5c0f\u6545\u4e8b- \u5de5\u4f5c\u7bc7\uff09', 'id': 1547496421237990624}, {'title': '\u4e2d\u6587\u62fc\u5199\u7ea0\u9519(CSC)\u4efb\u52a1\u5404\u4e2a\u6570\u636e\u96c6\u6c47\u603b\u4e0e\u7b80\u4ecb', 'id': 5378876680643068574}, {'title': '\u5c0f\u6837\u672c\u5b66\u4e60\uff08Few-shot Learning\uff09\u5165\u95e8', 'id': 420097384303877372}]}, {'title': 'Categories', 'type': 'Label', 'sectionId': 'sidebar-static', 'id': 'Label10'}, {'title': 'Hashtag', 'type': 'Label', 'sectionId': 'sidebar-static', 'id': 'Label11'}, {'title': 'Blog Archive', 'type': 'BlogArchive', 'sectionId': 'sidebar-static', 'id': 'BlogArchive10'}, {'title': '#Recent Post', 'type': 'HTML', 'sectionId': 'sidebar-static', 'id': 'HTML19'}, {'title': 'About Us', 'type': 'HTML', 'sectionId': 'footer-widget', 'id': 'HTML21'}, {'title': 'Learn More', 'type': 'LinkList', 'sectionId': 'footer-widget', 'id': 'LinkList13'}, {'title': 'Follow Us', 'type': 'LinkList', 'sectionId': 'footer-widget', 'id': 'LinkList14'}, {'title': 'Newsletter', 'type': 'HTML', 'sectionId': 'footer-widget', 'id': 'HTML22'}, {'title': 'Copyright', 'type': 'HTML', 'sectionId': 'copyright', 'id': 'HTML23'}, {'title': 'SVG Icons', 'type': 'HTML', 'sectionId': 'jet-options', 'id': 'HTML24'}]}]);
_WidgetManager._RegisterWidget('_ImageView', new _WidgetInfo('Image10', 'upload-image', document.getElementById('Image10'), {'resize': false}, 'displayModeFull'));
_WidgetManager._RegisterWidget('_HTMLView', new _WidgetInfo('HTML10', 'header-main', document.getElementById('HTML10'), {}, 'displayModeFull'));
_WidgetManager._RegisterWidget('_LinkListView', new _WidgetInfo('LinkList10', 'header-main', document.getElementById('LinkList10'), {}, 'displayModeFull'));
_WidgetManager._RegisterWidget('_LinkListView', new _WidgetInfo('LinkList11', 'header-main', document.getElementById('LinkList11'), {}, 'displayModeFull'));
_WidgetManager._RegisterWidget('_FeaturedPostView', new _WidgetInfo('FeaturedPost1', 'before-blog', document.getElementById('FeaturedPost1'), {}, 'displayModeFull'));
_WidgetManager._RegisterWidget('_BlogView', new _WidgetInfo('Blog1', 'blog-post', document.getElementById('Blog1'), {'cmtInteractionsEnabled': false, 'lightboxEnabled': true, 'lightboxModuleUrl': 'https://www.blogger.com/static/v1/jsbin/611711711-lbx.js', 'lightboxCssUrl': 'https://www.blogger.com/static/v1/v-css/828616780-lightbox_bundle.css'}, 'displayModeFull'));
_WidgetManager._RegisterWidget('_PopularPostsView', new _WidgetInfo('PopularPosts10', 'sidebar-static', document.getElementById('PopularPosts10'), {}, 'displayModeFull'));
_WidgetManager._RegisterWidget('_LabelView', new _WidgetInfo('Label10', 'sidebar-static', document.getElementById('Label10'), {}, 'displayModeFull'));
_WidgetManager._RegisterWidget('_LabelView', new _WidgetInfo('Label11', 'sidebar-static', document.getElementById('Label11'), {}, 'displayModeFull'));
_WidgetManager._RegisterWidget('_BlogArchiveView', new _WidgetInfo('BlogArchive10', 'sidebar-static', document.getElementById('BlogArchive10'), {'languageDirection': 'ltr', 'loadingMessage': 'Loading\x26hellip;'}, 'displayModeFull'));
_WidgetManager._RegisterWidget('_HTMLView', new _WidgetInfo('HTML19', 'sidebar-static', document.getElementById('HTML19'), {}, 'displayModeFull'));
_WidgetManager._RegisterWidget('_HTMLView', new _WidgetInfo('HTML21', 'footer-widget', document.getElementById('HTML21'), {}, 'displayModeFull'));
_WidgetManager._RegisterWidget('_LinkListView', new _WidgetInfo('LinkList13', 'footer-widget', document.getElementById('LinkList13'), {}, 'displayModeFull'));
_WidgetManager._RegisterWidget('_LinkListView', new _WidgetInfo('LinkList14', 'footer-widget', document.getElementById('LinkList14'), {}, 'displayModeFull'));
_WidgetManager._RegisterWidget('_HTMLView', new _WidgetInfo('HTML22', 'footer-widget', document.getElementById('HTML22'), {}, 'displayModeFull'));
_WidgetManager._RegisterWidget('_HTMLView', new _WidgetInfo('HTML23', 'copyright', document.getElementById('HTML23'), {}, 'displayModeFull'));
_WidgetManager._RegisterWidget('_HTMLView', new _WidgetInfo('HTML24', 'jet-options', document.getElementById('HTML24'), {}, 'displayModeFull'));
</script>
</body>*/</style>