Notes on: Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., … (2017): Attention Is All You Need

Table of Contents


  • Suggests new attention-mechanism, dropping convolutions and recurrent connections
  • Unlike recurrent architectures, allows within sample parallelization
  • Main take-away: "Transformer" architecture, which makes use of a "Scaled Dot-Product Attention" and "Multi-Head attention" to transform each token in the sequence


  • Recurrent models' sequential generation of hidden states precludes parallelization within training examples; becomes critical at longer sequence lengths as memory constraints limit batching