Why is "Attention" getting so much attention lately?
Which computing paradigms have attention mechanisms?
Surely, attention is as old as humanity itself. It is a consequence of our limitations of living in a complex environment. Our perception cannot cope with all the sensory experiences at once. Sometimes, we may not be aware of our attention, but it is almost certainly an evolutionary trait that allows us to filter, select, and direct our focus.
Meditators practice attention to calm their minds, or to gain insight.
From the philosophers’ POV, Descartes believed that attention was necessary for knowledge acquisition. Bishop Berkeley believed that attention was necessary for abstraction. Locke believed that attention was a mode of thought. (see Stanford Encyclopedia of Philosophy)
In psychology, Wilhelm Wundt was one of the first to study attention in 1879. William James and Ivan Pavlov were also among the first to study it.
"Every one knows what attention is. It is the taking possession by the mind, in clear and vivid form, of one out of what seem several simultaneously possible objects or trains of thought" —- William James, The Principles of Psychology (1890).
[in the following we used the word attention to refer to self-attention]
Theories of attention in cognition.
There are two major branches of theories of attention. The first sees attention as arising from capacity limitations of the brain and/or of thinking. For example, there could be bottlenecks in information processing. Capacity limitations force us to prioritize or focus on the most important information.
The second branch is characterized by the requirements of maintaining coordination and coherence. Here attention takes the form of constraints for selective directiveness. It also means the exclusion of inputs not related to the “direction”.
For example, when you are reading a book, you need to use attention to follow the storyline, to understand the characters' motivations, and to visualize the setting. You also need to use attention to keep track of what you have read so far and to make sure that you understand the overall meaning of the book.
In both types of theories, attention is always selection with prioritization or directiveness, but both prioritization and directiveness are determined by our past experiences and our current goals. When we see a pattern in a set of stimuli, we are more likely to pay attention to it. Gestalt is another word for it.
Attention in computing.
In computing, interest in the concept of attention and transformers surged after the paper: Bahdanau, Dzmitry (2016-05-19). "Neural Machine Translation by Jointly Learning to Align and Translate". arXiv:1409.0473, which was followed by a paper from the Google research team: Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N.; Kaiser, Lukasz; Polosukhin, Illia (2017-12-05). "Attention Is All You Need". arXiv:1706.03762
It was really these articles that relate attention to the Transformer architecture of deep learning, which in turn led to Large Language Models. As can be confirmed with Google Trends, it was at this moment that attention to “attention mechanism” surged.
The concept of attention in computing predates the above papers. Jürgen Schmidhuber had already discussed attention in several papers, but his research was ahead of its time and was not widely adopted by the Anglo-Saxon research community.
Some of Schmidhuber’s early papers were:
Learning Attentive Vision (1990)
Reducing the Ratio between Learning Complexity and Number of Time-Varying Variables in Fully Recurrent Nets (1993)
Which computing paradigms have attention mechanisms?
We may also ask, why computing paradigms such as Turing machines and the gradient method are not considered attention mechanisms, whereas Transformers are. The answer is that attention in computing mimics attention in cognition, they must have the characteristics of selection and exclusion, pattern recognition, and coherence maintenance.
In Turing machines, the head moves according to what it reads on the tape, it doesn’t have a selection mechanism. In the gradient method, there is selection according to an external objective function, not depending on patterns to be recognized. Transformers fulfill the requirements, they recognize patterns in language while maintaining coherence.
Attention has now started to be studied from the mathematical view, for example as kernel functions and sparse distribution memory. These approaches can be used to make precise distinctions.