BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Authors: Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova

Date: 11 Oct 2018

Citation: https://doi.org/10.48550/arXiv.1409.0473

Introduction

The paper introduces a new language representation model called BERT (Bidirectional Encoder Representations from Transformers). It pre-trains deep bidirectional representations from unlabelled text by jointly conditioning on both left and right context in all layers.
This BERT model can be fine-tuned with just one additional output layer to create a state of the art model for a number of different tasks including question answering and language inference, without requiring substantial architecture modifications.
This model is highly performant, obtaining state-of-the-art results on eleven natural language processing tasks, including
- A GLUE score of 80.5% (7.7% point absolute improvement)
- MultiNLI accuracy to 86.7% (4.6% absolute improvement)
- SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement)
- SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

BERT’s model architecture used a multi-layer bidirectional Transformer.
The number of layers is denoted as L, the hidden size is denoted as H, and the number of self-attention heads as A.3
Two model sizes are evaluated in the paper:
1. BERTBASE (L=12, H=768, A=12, Total Parameters=110M)
2. BERTLARGE (L=24, H=1024, A=16, Total Parameters=340M).

To allow BERT to handle a variety of down-stream tasks, the input representation is able to represent both a single sentence and a pair of sentences (i.e. Question, Answer) in one token sequence.
WordPiece embeddings (Wu et al., 2016) with a 30,000 token vocabulary, is used.
The first token of every sequence is always a special classification token (CLS).
A special token is used for separation (SEP).

Some percentage of the input tokens is masked at random, and then those masked tokens are predicted.
15% of all WordPiece tokens are masked at random in each sequence.
Only the masked words are predicted rather than reconstructing the entire input.

Downstream tasks such as Question Answering (QA) and Natural Language Inference (NLI) are based on understanding the relationship between two sentences, which is not directly captured by language modelling. A binarised next sentence prediction task is pre-trained that can be trivially generated from any monolingual corpus.

For the pre-training corpus, the BooksCorpus (800M words) (Zhu et al., 2015) and English Wikipedia (2,500M words) are used.
For Wikipedia, only the text passages are extracted and lists, tables, and headers are ignored.

At the input, sentence A and sentence B from pre-training are analogous to (1) sentence pairs in paraphrasing, (2) hypothesis-premise pairs in entailment, (3) question-passage pairs in question answering, and (4) a degenerate text-Ø pair in text classification or sequence tagging.
At the output, the token representations are fed into an output layer for token-level tasks, such as sequence tagging or question answering, and the (CLS) representation is fed into an output layer for classification, such as entailment or sentiment analysis.
Compared to pre-training, fine-tuning is relatively inexpensive.

BERT employs masked language models to allow for pre-trained deep bidirectional representations. It contrasts unidirectional language models for pre training and a shallow concatenation of independently trained left-to-right and right-to-left LMs.
The authors demonstrate that pre-trained representations reduce the need for numerous heavily-engineered task-specific architectures. There are multiple task-specific architectures that BERT outperforms since it is the first representation model that uses fine tuning to attain the current best performance on a wide range of sentence and token level tasks.