While Attention has come to be an important mechanism in deep learning, there remains limited intuition for why it works so well.
chat.openai
In the paper "Attention Approximates Sparse Distributed Memory" by Trenton Bricken and Cengiz Pehlevan, the authors propose that attention mechanisms in deep learning can be seen as approximating the read and write operations of a Sparse Distributed Memory (SDM). The SDM is a type of memory model that is designed to handle large amounts of data and is based on the idea that memories are stored in high-dimensional vector spaces, where each dimension represents a different feature. The read operation in the SDM is done by taking the dot product between the query vector and the memory matrix, and the write operation is done by adding the new memory vector to the memory matrix. The authors show that attention mechanisms in deep learning can be seen as approximating these SDM operations, and they provide experimental evidence to support this claim. In summary, the paper argues that attention mechanisms in deep learning can be seen as approximating read and write operations of a Sparse Distributed Memory, this will help to improve the understanding of why attention works so well in deep learning.
~
Here, we show that Transformer Attention can be closely related under certain data conditions to Kanerva’s Sparse Distributed Memory (SDM), a biologically plausible Associative Memory model. We confirm that these conditions are satisfied in pre-trained GPT2 Transformer models. We discuss the implications of the Attention-SDM map and provide new computational and biological interpretations of Attention.
~
BRICKEN, Trenton and PEHLEVAN, Cengiz, 2021. Attention Approximates Sparse Distributed Memory. In: Advances in Neural Information Processing Systems. Online. Curran Associates, Inc. 2021. p. 15301–15315. [Accessed 16 January 2023]. Available from: https://proceedings.neurips.cc/paper/2021/hash/8171ac2c5544a5cb54ac0f38bf477af4-Abstract.html pdf
SDM is an associative memory model developed in 1988 to solve the “Best Match Problem”, where we have a set of memories and want to quickly find the “best” match to any given query [13, 14]. In the development of its solution, SDM respected fundamental biological constraints, such as Dale’s law, that synapses are fixed to be either excitatory or inhibitory and cannot dynamically switch (see Section 1 for an SDM overview and [13] or [15] for a deeper review). Despite being developed independently of neuroanatomy, SDM’s biologically plausible solution maps strikingly well onto the cerebellum [13, 16].
This cerebellar relationship is additionally compelling by the fact that cerebellum-like neuroanatomy exists in many other organisms including numerous insects (eg. the Drosophila Mushroom Body) and potentially cephalopods [17, 18, 19, 20, 21].
Abstractly, the relationship between SDM and Attention exists because SDM’s read operation uses intersections between high dimensional hyperspheres that approximate the exponential over sum of exponentials that is Attention’s softmax function (Section 2). Establishing that Attention approximates SDM mathematically, we then test it in pre-trained GPT2 Transformer models [3] (Section 3) and simulations (Appendix B.7). We use the Query-Key Normalized Transformer variant [22] to directly show that the relationship to SDM holds well. We then use original GPT2 models to help confirm this result and make it more general.
Using the SDM framework, we are able to go beyond Attention and interpret the Transformer architecture as a whole, providing deeper intuition (Section 4). Motivated by this mapping between Attention and SDM, we discuss how Attention can be implemented in the brain by summarizing SDM’s relationship to the cerebellum (Section 5). In related work (Section 6), we link SDM to other memory models [23, 24], including how SDM is a generalization of Hopfield Networks and, in turn, how our results extend work relating Hopfield Networks to Attention [25, 26]. Finally, we discuss limitations, and future research directions that could leverage our work (Section 7).