Attention is all you need

21 Aug

Authors: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
Submitted: 12 Jun 2017
Citation: arXiv:1706.03762

Introduction

The ground-breaking paper "Attention Is All You Need" was published in 2017 by Vaswani et al at the Neural Information Processing Systems (NeurIPS) conference. The paper introduced the “Transformer” architecture for the first time, it was a new neural network model for natural language processing (NLP) tasks, which primarily uses attention mechanism to process input sequences. It is one of the most influential papers in the fields of NLP and deep learning with over 85,000 citations as of 2023.
The paper has had a significant impact on NLP and deep learning, contributing to the current explosion in the development of Large Language Models (LLMs)

Usefulness of Attention Mechanism in NLP

The paper demonstrates how effective attention mechanism are in NLP tasks. Attention mechanisms allow Neural Networks to focus selectively on specific parts of an input sequence, allowing the model to capture long-term dependencies and contextual relationships between words in a sentence. This is of particular importance for NLP tasks, where the meaning of a sentence is influenced by adjoining words and the surrounding context.
Historically neural network model for NLP tasks, such as recurrent neural networks (RNNs) and convolutional neural networks (CNNs) relied on fixed-length representations of the input sequence. These models therefore would often struggle to capture long-term dependencies, as well as being computationally expensive to train, particularly for longer sequences.
The authors of the paper argue that attention bases approaches are better than traditional approaches to NLP tasks, as they allow the model to selectively attend to different parts of the input sequence, which allows us to capture the contextual relationships between words in text. Transformer architectures rely on attention mechanism which is very effective at capturing contextual relationships and have become one of the most widely used architectures in NLP.

Transformer Models

The Transformer architecture introduced in this paper, is a neural networks which relies on attention mechanisms to process input sequences. The model is comprised of an encoder and decode, both of which are created using multiple layers of self-attention and feedforward neural networks.
The self-attention mechanism mentioned previously allows the model to attend to different parts of the input sequence and generates context aware representations of each word in the sequence. This can be then used by the model to generate a context aware output sequence.
This paper introduces the concept of multi-head attention, which allows for the model to attend to different parts of the sequence simultaneously. This is useful for capturing different information about the input sequence, including syntactic (e.g. word order and sentence composition) and semantic (e.g. meaning of words) between words.
This self-attention mechanism is particularly good at capturing long term relationships between words in a sentence and requires less training than traditional neural network models in achieving state of the art performance.

Architecture

The Transformer model is built from an encode and a decoder. Both the encoder and decoder are made up of multiple layers, each of which contains two sublayers: a self-attention layer and a feed forward neural network layer.
The self-attention layer computes a weighted sum of the input sequence, where the weights are determined by a learned attention mechanism that assigns higher weights to more relevant parts of the input sequence. This allows the model to focus on different parts of the input sequence at different times and to capture long-range dependencies between words in the sequence.
The feedforward neural network layer applies a non-linear transformation to the output of the self-attention layer, allowing the model to capture complex relationships between words in the sequence.
The encoder takes an input sequence and generates a sequence of hidden representations, which are then used as input to the decoder. The decoder also takes an input sequence and generates a sequence of hidden representations, which are then transformed into a final output sequence using an output layer.
The Transformer model uses a technique called multi-head attention, where the self-attention layer is computed multiple times in parallel with different learned weights. This allows the model to capture different aspects of the input sequence simultaneously and to learn more complex relationships between words in the sequence.
The model also uses layer normalization and residual connections to improve training stability and gradient flow. Layer normalization normalizes the output of each sublayer before passing it to the next sublayer, while residual connections allow gradients to flow more easily through the model.

Addressing Computational Complexity

One of the primary limitations of Transformer model is their computational complexity, which is O(n^2) with respect to the sequence length. This means that as the length of the input sequences increase, the computational cost of training increases exponentially. To address this the authors suggest a number of techniques to reduce the computational cost of training. This includes:

Scaling the model dimensions: which can reduce the number of parameters in the model and improve the efficiency of computation
Using a fixed window approach: this limits the model’s attention to a fixed window of input tokens instead of the entire input sequence. This is particularly useful for tasks where the input sequences are short and the context of surrounding tokens is less important.
Relative position representations: This concept allows the model to capture the relative positions of words in the input sequence without explicitly requiring position encoding. This technique can significantly reduce the computational cost of the model and has been shown to b effective for capturing long-term dependencies in input sequences.

Applications of the Transformer Model

Machine translation
Language modelling
Question answering
Text summarisation

Conclusion:

"Attention Is All You Need" is a ground-breaking paper that introduced the Transformer architecture, a neural network model for NLP tasks that relies on attention mechanisms to process input sequences. The paper's contributions have had a significant impact on the field of deep learning and have inspired further research and advancements in the field.
The Transformer model has become one of the most widely used models in NLP and has been applied to a wide range of tasks, including machine translation, language modelling, question answering, and text summarization. The model's ability to capture long-term dependencies and contextual relationships between words makes it well-suited for many NLP tasks and has enabled significant improvements in performance on these tasks.
The paper also introduced several techniques for reducing the computational complexity of the model, which have made it more feasible to use the model for longer input sequences and larger datasets.
Overall, the "Attention Is All You Need" paper represents a significant milestone in the development of neural network models for NLP tasks and has paved the way for further advancements in the field.

Marek Biernacki