Blog | Le projet THINK | Projet de R&T transverse IN2P3

Peter Izsak is a Staff Research Scientist at Intel Labs, where he explores topics at the intersection of Deep Learning and Natural Language Processing.

Highlights:

Intel Labs performed a language model research study with Tel-Aviv University, University of Washington, and Meta AI.
Results of the study show that language models still perform similarly to standard models, even without explicit positional encoding.

A joint study, led by researchers from Tel-Aviv University (Adi Haviv, Ori Ram and Omer Levy) and collaborated with researchers from the University of Washington (Ofir Press), Intel Labs (Peter Izsak) and Meta AI (Omer Levy), shows that, contrary to popular belief, Language Models (LM) without explicit positional encoding still perform similarly to standard models across a wide variety of datasets, model sizes, and sequence lengths.

Figure 1. The Perplexity of transformers models trained with no positional encoding (NoPos) and with 3 different methods for encoding positional information. Lower perplexity values are better (source).

Since there is no notion of positioning in the transformer module, positional information is added to the encoded word vectors. This is done in order to keep the proper order of the input words and maintain the intended meaning of the sentence.

Is positional information really required for transformer-based LM?

The results of the evaluations described in the paper might indicate that causal LM might derive positional awareness not only from the explicit positions but also internally and potentially allowing for extrapolation methods beyond the original sequence length.

Why Is Positional Encoding Needed?

Transformer-based language models, such as GPT-3 [1], are widely used in many Natural Language Processing applications as an efficient tool for modeling language. By design, the attention mechanism of the transformer has no notion of order with respect to the input; Therefore, to keep the correct form of the input (sentence), the order of the words must explicitly be encoded. Without any positional hints, the sentence will be interpreted as an unordered set of words (bag-of-words).

This positional information injected into the word representations is referred to as Position Encoding. There are a variety of methods for injecting positional information into LM, including absolute and relative positional information.

Absolute Positions

The most intuitive way to encode positional information is by training the positional vectors with the model, that is, we add trainable position vectors that will slightly alter the embedding of the word vectors. This method is commonly referred to as Learned positional embeddings and is commonly used in Masked-Language-Models (MLM) such as BERT [2].

Another method for adding absolute positional encoding is to calculate constant vectors using a non-parametric function given a token’s position. As Vaswani et al. [3] showed for machine translation, and Baevski and Auli [4] demonstrated for language modeling, Sine and cosine functions of different frequencies can be used to encode each dimension of the positional encoding.

A Relative Approach

Absolute positional encoding works well in practice, but a relative approach for injecting order is also possible. Relative Positional encoding was shown by [5] and [6], where they created a method for using pairwise distances as a way of creating positional encodings.

One major difference between the two encoding approaches is that relative positional encoding is not limited to an absolute sequence length that the model is trained on. On the other hand, absolute positional encoding is limited to a fixed number of positions, and will fail predicting correctly on positions outside the width of the positions it was trained on. This means that relative positional encodings can generalize to potentially unlimited sequence lengths, since theoretically the only information it encodes is the relative pairwise distance between two tokens. However, relative encoding methods do have limitations, such as the T5 Bias [8] which is slower and uses more memory compared to Sinusoidal.

Extrapolation

Due to hardware limitations, LM are trained with a fixed input length, including its positional encoding. Press [7] found that absolute positional encoding such as sinusoidal method is weak when it comes to extrapolating beyond the sequence length that the model was trained on, and relative based approaches such as T5 Bias [8] and Rotary Position encoding [9], which should theoretically be capable of extrapolating, show underwhelming improvements.

Figure 2. Source

To that end, Press introduced ALiBi in [7], in which the information about relative distances between tokens is injected by adding negative biases to the attention scores, which grow linearly with the distance between each pair. In contrast to prior approaches, the major breakthrough made by ALiBi was that it has shown strong extrapolation abilities (as seen in the figure above), thus allowing to train LM using short sequences of text and extrapolate on much longer.

Language Models without Positional Encoding

What happens if we train a causal LM (CLM), such as GPT, without any positional information at all?

In our research, we have found that training a CLM with no positional encoding yields competitive abilities with traditional LM trained with positional encoding. In addition, this phenomenon is robust and repeats when training different sized LM, ranging from 125M parameters to 1.3B, and, with different sequence lengths ranging from 256 to 2048 tokens. In contrast, training a MLM, such as BERT, without positional information will fail to converge.

How could it be that causal LMs do not require positional information?

We hypothesize that the model learns positional information by the nature of causality, meaning that the model could potentially guess the position of a token by counting the number of unmasked words in the causal attention module. Such a mechanism would be very similar to Absolute position encoding.

To understand whether a LM without positional encoding learns anything about token order we conducted an experiment. By using small probes attached to each of the LM transformer layer output to predict a tokens position (for example, 1 through 1024), and without changing the model’s parameters, we can measure the mean absolute distance between the probe’s prediction and the actual position of the token.

Figure 3. Source

According to the figure above, the model trained without positional information fails to correctly predict token positions in the first layer, but aligns with the model trained with positional information through higher layers. In summary, the probe shows that the model indeed learns notions of absolute positions.

We hope that by sharing our findings, other researchers will gain a deeper understanding of positional encoding in transformer-based language models, and be able to continue to optimize the performance and potentially extrapolation capabilities of large language models.

A full description of our work is available in our paper —
Adi Haviv, Ori Ram, Ofir Press, Peter Izsak and Omer Levy. Transformer Language Models without Positional Encodings Still Learn Positional Information.

Link to paper: https://arxiv.org/abs/2203.16634
Code and models: https://github.com/adihaviv/nopos

Our findings have been presented at the BlackboxNLP workshop co-located with EMNLP 2022 and have also been published in the EMNLP 2022 Findings.

References

[1] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.

[2] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.

[3] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.

[4] Alexei Baevski and Michael Auli. 2019. Adaptive input representations for neural language modeling. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6–9, 2019. OpenReview.net.

[5] Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp.464–468, New Orleans, Louisiana, June 2018.

[6] Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Ian Simon, Curtis Hawthorne, Noam M. Shazeer, Andrew M. Dai, M. Hoffman, M. Dinculescu, and D. Eck. Music transformer: Generating music with long-term structure. In ICLR, 2019.

[7] Ofir Press, Noah Smith, and Mike Lewis. 2022. Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations.

[8] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-totext transformer. Journal of Machine Learning Research, 21(140):1–67.

[9] Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2021.

The SDSC Voyager supercomputer is an innovative AI system designed specifically for science and engineering research at scale. Funded by the National Science Foundation, Voyager represents a collaboration with the San Diego Supercomputer Center at UC San Diego, Supermicro, and Intel’s Habana Labs, facilitating deep engagement with AI research community, and enabling the application of deep learning techniques to interdisciplinary problems requiring Natural Language Processing (NLP) and image analysis. A key trend in NLP is the development of large language models and efficiently training them is critical to any AI research project. Voyager provides the scale and compute required to train such LLMs.

Le projet THINK

Live Webinar: Getting Started with Habana: Deep Speed Optimization on Large Models

Intel oneAPI AI Tools and Frameworks Quick Reference

AI and HPC DevSummit 2022 Keynote: AI Software and Hardware Acceleration

Is Positional Encoding Required In All Language Models?

Why Is Positional Encoding Needed?

Absolute Positions

A Relative Approach

Extrapolation

Language Models without Positional Encoding

References

Getting started with classical Machine Learning Frameworks using Google Colaboratory

Training Causal Language Models on SDSC’s Gaudi-based Voyager Supercomputing Cluster

Top 10 Intel Labs Posts of 2022

Intel Labs at The Winter Conference on Applications of Computer Vision

AI4Mat NeurIPS 2022 Workshop Recap

Microsoft Azure Cognitive Service Containers on-premises with Intel Xeon Platform

Articles récents

Neural networks news

Intel NN News

Archives

Catégories