Neural Machine Translation by Jointly Learning to Align and Translate

19 Aug

Authors: Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio

Date: 1 Sep 2014

Citation: https://doi.org/10.48550/arXiv.1409.0473

Introduction

The paper “Neural Machine Translation” introduces attention which is a way of enhancing encoder-decoder architectures. It argues that current traditional encoder-decoder architectures are bottlenecked in performance by using a fixed-length vectors.
They propose improving this by allowing a model to automatically soft search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly. This is applied to Neural Machine Translation (NMT) as an example.

In the new approach, the input sequence is first projected into multiple vectors and the attention mechanism learns to combine/choose from those to produce the output sequence.
In reality this means that there is an individual fixed-width representation (a specific context) for each of the input elements.
These element specific contexts are jointly learned in this sequence-to-sequence (seq2seq) task and are built of two components:
1. Information about other elements surrounding element *i (*called annotations on each element i)
2. Information about how strongly each element should impact the output token (i.e. the weights)

The new proposed method outperforms regular RNN-based encoder-decoders
The new proposed method outperforms regular phrase-based systems
The performance of regular RNN-based encoder-decoders drop significantly on longer sequences, while the new proposed methods does not.