The Hamming distance is crucial for determining how many neurons read and write operations are distributed across.
The optimal Hamming distance for the read and write circles denoted d⇤, depends upon the number and distribution of patterns in the vector space and what the memories are being used for (e.g. maximizing the number of memories that can be stored versus the memory system’s robustness to query noise). We provide three useful reference d⇤ values, using equations outlined in Appendix B.5. The Signal-to-Noise Ratio (SNR) optimal dS⇤NR maximizes the probability a noise-free query will return its target pattern [15]. The memory capacity optimal d⇤Mem maximizes the number of memories that can be stored with a certain retrieval probability and also assumes a noise-free query. The critical distance d⇤CD maximizes, for a given number of patterns, the amount of noise that can be applied to a query such that it will converge to its correct pattern [15].
These d⇤s are only approximate reference points for later comparisons to Transformer Attention, first and foremost because they assume random patterns to make their derivations tractable. In addition, Transformer Attention will not be optimizing for just one of these objectives, and likely interpolates between these optimal d⇤s as it wants to have both a good critical distance to handle noisy queries and a reasonable memory capacity. These optimal d⇤ are a function of n, r and m. For the Transformer Attention setting [1], where n = 64, r =2n and m 1024, dS⇤NR = 11, d⇤Mem =5, d⇤CD = 15, as derived in Appendix B.5.
Figure 1: Summarizing the SDM read and write operations. Top Row three patterns being written into nearby neurons. 1. The first write operation; 2. Patterns are stored inside nearby neurons and the original pattern location is shown; 3. Writing a second pattern; 4. Writing a third pattern and neurons storing a superposition of multiple patterns. Bottom Row shows two isomorphic perspectives of the read operation. Neuron view (left) shows the query reading from nearby neurons with the inset showing the number of times each pattern is read. The four blue patterns are a majority which would result in one step convergence. Pattern view (right) is crucial to relating SDM to Attention and defined in Eq. 1 below. We abstract away the neurons by assuming they are uniformly distributed through the space. This allows us to consider the circle intersection between the query and the original locations of each pattern where blue has the largest circle intersection.
~
BRICKEN, Trenton and PEHLEVAN, Cengiz, 2021. Attention Approximates Sparse Distributed Memory. In: Advances in Neural Information Processing Systems. Online. Curran Associates, Inc. 2021. p. 15301–15315. [Accessed 16 January 2023]. Available from: https://proceedings.neurips.cc/paper/2021/hash/8171ac2c5544a5cb54ac0f38bf477af4-Abstract.html
Address busses works well and are at the heart of both processor and memory design on a variety of scales. Address busses make computers a logical machine for when they are properly clocked we can reason knowing all elements have been considered. But this pattern is rare or nonexistent in nature. Let's understand why.
In the paper "Attention Approximates Sparse Distributed Memory" by Trenton Bricken and Cengiz Pehlevan (2021), the authors propose a new method for approximating Sparse Distributed Memory (SDM) using attention mechanisms. The SDM read and write operations involve the use of a "content-based addressing" mechanism, where the memory is accessed by looking up the similarity between the input and the stored patterns in the memory. The write operation involves adding a new pattern to the memory, while the read operation retrieves the closest pattern to the input based on the similarity measure. The authors demonstrate that this approach can achieve similar performance to traditional SDM methods while being more computationally efficient.