BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Authors: Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova

Date: 11 Oct 2018

Citation: https://doi.org/10.48550/arXiv.1409.0473

Introduction

  • The paper introduces a new language representation model called BERT (Bidirectional Encoder Representations from Transformers). It pre-trains deep bidirectional representations from unlabelled text by jointly conditioning on both left and right context in all layers.

  • This BERT model can be fine-tuned with just one additional output layer to create a state of the art model for a number of different tasks including question answering and language inference, without requiring substantial architecture modifications.

  • This model is highly performant, obtaining state-of-the-art results on eleven natural language processing tasks, including

    • A GLUE score of 80.5% (7.7% point absolute improvement)

    • MultiNLI accuracy to 86.7% (4.6% absolute improvement)

    • SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement)

    • SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

BERT Model Architecture

Figure 1: BERT’s bi-directional transformer

  • BERT’s model architecture used a multi-layer bidirectional Transformer.

  • The number of layers is denoted as L, the hidden size is denoted as H, and the number of self-attention heads as A.3

  • Two model sizes are evaluated in the paper:

    1. BERTBASE (L=12, H=768, A=12, Total Parameters=110M)

    2. BERTLARGE (L=24, H=1024, A=16, Total Parameters=340M).

Input Representation

  • To allow BERT to handle a variety of down-stream tasks, the input representation is able to represent both a single sentence and a pair of sentences (i.e. Question, Answer) in one token sequence.

  • WordPiece embeddings (Wu et al., 2016) with a 30,000 token vocabulary, is used.

  • The first token of every sequence is always a special classification token (CLS).

  • A special token is used for separation (SEP).


Figure 2: BERT input representation. The input embeddings are the sum of the token embeddings, the segmentation embeddings and the position embeddings,

Pre-training BERT

Masked LM (MLM)

  • Some percentage of the input tokens is masked at random, and then those masked tokens are predicted.

  • 15% of all WordPiece tokens are masked at random in each sequence.

  • Only the masked words are predicted rather than reconstructing the entire input.

Next Sentence Prediction (NSP)

  • Downstream tasks such as Question Answering (QA) and Natural Language Inference (NLI) are based on understanding the relationship between two sentences, which is not directly captured by language modelling. A binarised next sentence prediction task is pre-trained that can be trivially generated from any monolingual corpus.

Pretraining Dataset

  • For the pre-training corpus, the BooksCorpus (800M words) (Zhu et al., 2015) and English Wikipedia (2,500M words) are used.

  • For Wikipedia, only the text passages are extracted and lists, tables, and headers are ignored.

Fine-tuning BERT

Figure 3: Illustrations of fine-tuning BERT on different tasks

  • At the input, sentence A and sentence B from pre-training are analogous to (1) sentence pairs in paraphrasing, (2) hypothesis-premise pairs in entailment, (3) question-passage pairs in question answering, and (4) a degenerate text-Ø pair in text classification or sequence tagging.

  • At the output, the token representations are fed into an output layer for token-level tasks, such as sequence tagging or question answering, and the (CLS) representation is fed into an output layer for classification, such as entailment or sentiment analysis.

  • Compared to pre-training, fine-tuning is relatively inexpensive.

Summary

  • BERT employs masked language models to allow for pre-trained deep bidirectional representations. It contrasts unidirectional language models for pre training and a shallow concatenation of independently trained left-to-right and right-to-left LMs.

  • The authors demonstrate that pre-trained representations reduce the need for numerous heavily-engineered task-specific architectures. There are multiple task-specific architectures that BERT outperforms since it is the first representation model that uses fine tuning to attain the current best performance on a wide range of sentence and token level tasks.

Previous
Previous

Attention is all you need

Next
Next

Neural Machine Translation by Jointly Learning to Align and Translate