Robust High-dimensional Memory-augmented Neural Networks by Karunaratne, Geethan et al.
Robust High-dimensional Memory-augmented Neural Networks
Geethan Karunaratne,1, 2, a) Manuel Schmuck,1, 2, a) Manuel Le Gallo,1 Giovanni Cherubini,1 Luca Benini,2 Abu
Sebastian,1, b) and Abbas Rahimi1, c)
1)IBM Research – Zurich, Sa¨umerstrasse 4, 8803 Ru¨schlikon, Switzerland.
2)Department of Information Technology and Electrical Engineering, ETH Zu¨rich, Gloriastrasse 35, 8092 Zu¨rich,
Switzerland.
Traditional neural networks require enormous amounts of data to build their complex mappings during a slow
training procedure that hinders their abilities for relearning and adapting to new data. Memory-augmented
neural networks enhance neural networks with an explicit memory to overcome these issues. Access to this
explicit memory, however, occurs via soft read and write operations involving every individual memory entry,
resulting in a bottleneck when implemented using the conventional von Neumann computer architecture. To
overcome this bottleneck, we propose a robust architecture that employs a computational memory unit as
the explicit memory performing analog in-memory computation on high-dimensional vectors, while closely
matching 32-bit software-equivalent accuracy. This is enabled by a content-based attention mechanism that
represents unrelated items in the computational memory with uncorrelated high-dimensional vectors, whose
real-valued components can be readily approximated by binary, or bipolar components. Experimental results
demonstrate the efficacy of our approach on few-shot image classification tasks on the Omniglot dataset using
more than 256,000 phase-change memory devices.
I. INTRODUCTION
Recurrent neural networks (RNNs) are able to learn and perform transformations of data over extended periods
of time that make them Turing-Complete1. However, the intrinsic memory of an RNN is stored in the vector of
hidden activations and this could lead to catastrophic forgetting2. Moreover, the number of weights and hence the
computational cost grows exponentially with memory size. To overcome this limitation, several memory-augmented
neural network (MANN) architectures were proposed in recent years3–7 that separate the information processing from
memory storage.
What the MANN architectures have in common is a controller, that is a recurrent or feedforward neural network
model, followed by a structured memory as an explicit memory. The controller can write to, and read from the
explicit memory that is implemented as a content addressable memory (CAM), also called associative memory in
many architectures3,4,8. Therefore, new information can be offloaded to the explicit memory, where it does not
endanger the previously learned information to be overwritten subject to its memory capacity. This feature enables
one-/few-shot learning, where new concepts can be rapidly assimilated from a few training examples of never-seen-
before classes to be written in the explicit memory6. The CAM in MANN architectures is composed of a key memory
(for storing and comparing learned patterns) and a value memory (for storing labels) that are jointly referred to as a
key-value memory5.
Access to the key memory occurs via soft read and write operations, which involve every individual memory entry
instead of a single discrete entry. Between the controller and the key memory there is a content-based attention mech-
anism that computes a similarity score for each memory entry with respect to a given query, followed by sharpening
and normalization functions. The resulting attention vector serves to read out the value memory3. This may lead to
extremely memory intensive operations contributing to 80% of execution time9, quickly forming a bottleneck when
implemented in conventional von Neumann architectures (e.g., CPUs and GPUs), especially for tasks demanding
thousands to millions of memory entries4,10. Moreover, complementary metal–oxide–semiconductor (CMOS) imple-
mentation of key memories is affected by leakage, area, and volatility issues, limiting their capabilities for lifelong
learning11.
One promising alternative is to realize a key memory with non-volatile memory (NVM) devices that can also serve
as computational memory to efficiently execute such memory intensive operations10,11. Initial simulation results have
suggested key memory architectures using NVM devices such as spintronic devices10, resistive random access memory
(RRAM)12, and ferroelectric field-effect transistors (FeFETs)13. To map a vector component in the key memory,
devices have either been simulated with high multibit precision10, or multiple ternary CAM (TCAM) cells by using
a)These two authors contributed equally
b)Electronic mail: ase@zurich.ibm.com
c)Electronic mail: abr@zurich.ibm.com
ar
X
iv
:2
01
0.
01
93
9v
1 
 [c
s.E
T]
  5
 O
ct 
20
20
2intermediate mapping functions and encoding to get a binary code13,14. Besides these simulation results, a recent
prototype has demonstrated the use of a very small scale 2x2 TCAM array based on FeFETs11.
However, the use of TCAM limits such architectures in many aspects. First, TCAM arrays find an exact match
between the query vector and the key memory entries, or in the best case can compute the degree of match up to
very few bits (i.e., limited-precision Hamming distance)11,15, which fundamentally restricts the precision of the search.
Secondly, a TCAM cannot support widely-used metrics such as cosine distance. Thirdly, a TCAM is mainly used for
binary classification tasks16, because it only finds the first-nearest neighbour (i.e., the minimum Hamming distance),
which degrades its performance for few-shot learning, where the similarities of a set of intra-class memory entries
should be combined. Furthermore, a key challenge associated with using NVM devices and in-memory computing is
the low computational precision resulting from the intrinsic randomness and device variability17. Hence there is need
for learned representations that can be systematically transformed to robust bipolar/binary vectors at the interface
of controller and key memory, for efficient inference as well as operation at scale on NVM-based hardware.
One viable option is to exploit robust binary vector representations in the key memory as used in high-dimensional
(HD) computing18, also known as vector symbolic architectures19. This emerging computing paradigm postulates
the generation, manipulation, and comparison of wide vectors that take inspiration from attributes of brain circuits
including high-dimensionality and fully distributed holographic representation. When the dimensionality is in the
thousands, (pseudo)randomly generated vectors are virtually orthogonal to each other with very high probability20.
This leads to inherently robust and efficient behaviour tailor-made for RRAM21 and phase-change memory (PCM)22
devices operating at low signal-to-noise ratio conditions. Further, the disentanglement of information encoding and
memory storage is at the core of HD computing that facilitates rapid and lifelong learning18–20. According to this
paradigm, for a given classification task, generation and manipulation of the vectors are done in an encoder designed
using HD algebraic operations to correspond closely with the task of interest, whereas storage and comparison of the
vectors is done with an associative memory18. Instead, in this work, we provide a methodology to substitute the
process of designing a customized encoder with an end-to-end training of a deep neural network such that it can be
coupled, as a controller, with a robust associative memory.
In the proposed algorithmic-hardware codesign approach, first, we propose a differentiable MANN architecture in-
cluding a deep neural network controller that is adapted to conform with the HD computing paradigm for generating
robust vectors to interface with the key memory. More specifically, a novel attention function guides the powerful
representation capabilities of our controller to store unrelated items in the key memory as uncorrelated HD vectors.
Secondly, we propose approximations and transformations to instantiate a hardware-friendly architecture from our
differentiable architecture for solving few-shot learning problems with a key memory implemented as a computational
memory. Finally, we verify the integrated inference functionality of the architecture through large-scale mixed hard-
ware/software experiments, in which for the first time the largest Omniglot problem (100-way 5-shot) is established,
and efficiently mapped on 256,000 PCM devices performing analog in-memory computation on 512-bit vectors.
II. RESULTS
A. Proposed MANN architecture using in-memory computing
In the MANN architecture, the key-value memory remains mostly independent of the task and input type, while
the controller should be fitted to the task and especially the input type. Few-shot Omniglot23 image classification
has meanwhile established itself as a benchmark for the MANNs6,11,13,14. Commonly known as the transpose of
the MNIST dataset, the Omniglot dataset contains over 1600 classes but only a few samples per class. For this few-
shot classification problem, convolutional neural networks (CNNs) are excellent controllers that provide an embedding
function to map the input image to an internal feature representation. A few-shot classification problem is determined
by the number of ways (i.e., classes to distinguish), and shots (i.e., samples per class to learn from). We define the
support set as the collection of samples from different classes that the model learns from. The query batch is a
collection of samples drawn from the same set of classes as the support set. A key-value memory creates as many
entries as the product of the number of ways by the number of shots. These entries are not accessed by stating a
discrete address, but by comparing a query from the controller’s side with all entries.
In our proposed MANN architecture, schematically depicted in Fig. 1, we provide a methodology for training a CNN
controller to encode complex image inputs to vectors conforming with the HD computing properties. These properties
assign dissimilar images to uncorrelated HD vectors that can be stored, or compared with vectors already stored in an
associative memory as the key-value memory with extreme robustness. Our architecture keeps the interface between
the controller and key-value memory differentiable by using a novel sharpening function, and therefore optimizable
by gradient descent methods. The learning phase uses an episodic training procedure for the CNN by solving various
few-shot problem sets that gradually enhance the quality of the mapping by exploiting classification errors (see the
3learning phase in Fig. 1). Those errors are represented as a loss, which is propagated all the way back to the controller,
whose parameters are then updated to counter this loss and to reach maturity; other optional loss functions can be
considered to closely tune the desired distribution of HD vectors. In this supervised step, the controller is updated
by learning from its own mistakes (also referred to as meta-learning). The controller finally learns to discern different
image classes, mapping them far away from each other in the HD feature space.
The inference phase comprises both giving the model a few examples—that are mutually exclusive from the classes
that were presented during the learning phase—to learn from, and inferring an answer with respect to those examples.
During this phase, when a never-seen-before image is encountered, the controller quickly encodes and updates the
key-value memory contents, that can be later retrieved for classification. This avoids relearning controller parameters
through otherwise expensive and iterative training (see Supplementary Note 1). While our architecture is kept
continuous to avoid violating the differentiability during the learning phase, it is simplified for the inference phase
by applying transformations and approximations to derive a hardware-friendly version. These transformations enable
the key memory to readily use equiprobable binary or bipolar representations with memristive devices, and the
approximations further simplify the inference operations (see the inference phase in Fig. 1). The memristive devices
are assembled in a crossbar array, and the similarity search is efficiently computed as dot product by exploiting
Kirchhoff’s circuit laws in O(1) time complexity. This combination of binary/bipolar computational memory and
the mature controller efficiently handles few-shot learning and classification of incoming unseen examples on the fly
without the need for fine tuning the controller weights.
B. A new attention mechanism appropriate for HD geometry
HD computing starts by assigning a set of random vectors to represent unrelated items, e.g., different letters of an
alphabet18. The HD vector representation can be of many kinds (e.g., real and complex24, bipolar25, or binary26);
however, the key properties are shared independent of the representation, and serve as a robust computational
infrastructure18,21. In HD space, two randomly chosen vectors are nearly orthogonal with very high probability,
which has significant consequences for robust implementation. For instance, when unrelated items are represented by
nearly orthogonal 10,000-bit vectors, more than a third of the bits of a vector can be flipped by randomness, device
variations, and noise, and the faulty vector can still be identified with the correct one, as it is closer to the original
error-free vector than to any unrelated vector chosen so far, with near certainty18. It is therefore highly desirable for
a MANN controller to map samples from different classes, which should be dissimilar in the input space, to nearly
orthogonal vectors in the HD feature space. Besides this inherent robustness, finding nearly orthogonal vectors in
high dimensions is easy and incremental18. In the following, we define conditions under which an attention function
achieves this goal.
Let α be a similarity metric (e.g., cosine similarity) and  a sharpening function. Then σ is the attention function
σ(q,Ki) =
(α(q,Ki))∑mn
j=1 (α(q,Kj))
, α(q,Ki) =
q ·KTi
‖q‖‖Ki‖
where q is a query vector, Ki is a support vector in the key memory, m is the number of ways (i.e., classes), and n is
the number of shots. The attention function performs the (cosine) similarity comparison across the support vectors in
the key memory, followed by sharpening and normalization to compute its output as an attention vector w = σ(q,K)
(see Methods). The cosine similarity has a domain and range of α : Rd × Rd → [−1, 1], where α(x,y) = 1 means
x and y are perfectly similar or correlated, α(x,y) = 0 means they are perfectly orthogonal or uncorrelated, and
α(x,y) = −1 means they are perfectly anticorrelated. From the point of view of attention, two nearly dissimilar (i.e.,
uncorrelated) vectors should lead to a focus closely to 0. Therefore,  should satisfy the following condition:
(α(x,y)) :≈ 0 when α(x,y) ≈ 0. (1)
Equation 1 ensures that there is no focus between a query vector and a dissimilar support vector. The sharpening
function should also satisfy the following inequalities:
(α) ≥ 0 (2)
(α1) ≤ (α2) when α1 < α2 and α1, α2 > 0 (3)
(α1) ≥ (α2) when α1 < α2 and α1, α2 < 0, (4)
Equation 2 implies non-negative weights in the attention vectors, whereas Equations 3 and 4 imply a strictly mono-
tonically decreasing function on the negative axis and a strictly monotonically increasing function on the positive axis.
4Among a class of sharpening functions that can meet the above-mentioned conditions, we propose a soft absolute
(softabs) function:
(α) =
1
1 + e−(β(α−0.5))
+
1
1 + e−(β(−α−0.5))
where β = 10, as a stiffness parameter, which leads to (0) = 0.0134.
As a common attention function, in various works3,4,6,8 the cosine similarity is followed by a softmax operation
that uses an exponential function as sharpening function ((α) = eα). However, the exponential sharpening function
does not satisfy the above-mentioned conditions, and leads to undesired consequences related to the cost function
optimization. In fact, when a query vector q belongs to a different class than some support vector Ki and they
are nearly orthogonal to each other, then nevertheless wi > 0, where wi = σ(q,Ki). This eventually leads to a
probability pj > 0 for class j of support vector i. During model training, a well chosen cost function will penalize
probabilities larger than zero for classes different from the query’s class, and thus force the probability towards 0. This
also forces eα towards 0, or α towards −∞. However, α only has a range of [−1, 1] and the optimization algorithm will
therefore try to make α as small as possible, corresponding to anticorrelation instead of uncorrelation. Consequently
the softmax function unnecessarily leads to anticorrelated instead of uncorrelated vectors, as the controller is forced
to map samples of different classes to those vectors.
The proposed softabs sharpening function leads to uncorrelated vectors for different classes, as they would have
been randomly drawn from the HD space to robustly represent unrelated items (see Fig. 2a and b). It can be seen
that the learned representations by softabs bring the support vectors of the same class close together in the HD
space, while pushing the support vectors of different classes apart. This vector assignment provides higher accuracy,
and retains robustness even when the HD real vectors are transformed to bipolar. Compared to the softmax, the
softabs sharpening function effectively improves the separation margin between inter-class and intra-class similarity
distributions (Fig. 2c and d), and therefore achieves up to 5.0%, 9.6%, 19.6% higher accuracy in 5-way 1-shot, 20-way
5-shot, and 100-way 5-shot problems, respectively (Fig. 2e). By using this new sharpening function, our architecture
not only makes the end-to-end training with backpropagation possible, but also learns the HD vectors with the proper
direction. In the next sections, we describe how this architecture can be simplified, approximated, and transformed
to a hardware-friendly architecture optimized for efficient and robust inference on memristive devices.
C. Bipolar key memory: transforming real-valued HD vectors to bipolar
A key memory trained with real-valued support vectors results in two considerable issues for realization in memristive
crossbars. First, the representation of real numbers demands analog storage capability. This significantly increases the
requirements on the NVM device, and may require a large number of devices to represent a single vector component.
Second, a memristive crossbar which computes a matrix-vector product in a single cycle is not directly applicable for
computing cosine similarities that are at the very core of the MANN architectures. For a single query, the similarities
between the query vector q and all the support vectors in the key memory needs to be calculated, which involves
computing the norm of mn+1 vectors. An approximation strives to use the absolute-value norm instead of the square
root10, however it still involves a vector-dependent scaling of each similarity metric requiring additional circuitry to
be included in the computational memory.
HD computing offers the tools and the robustness to counteract the aforementioned shortcomings of the real
number representation by relying on dense bipolar or binary representation. As common properties in these dense
representations, the vector components can occupy only two states, and pseudo-randomness leads to approximately
equally likely occupied states (i.e., equiprobability). We propose simple and dimensionality-preserving transformations
to directly modify real-valued vectors to dense bipolar and dense binary vectors. This is in contrast to prior work11,13,14
that involves additional quantization, mapping, and coding schemes. In the following, we describe how our systematic
transition first transforms the real-valued HD vectors to bipolar. Subsequently, we describe how the resulting bipolar
HD vectors can be further transformed to binary vectors.
The output of the controller is a d-dimensional real vector as described in Section II B. During the training phase,
the real-valued vectors are directly written to the key memory. However, during the inference phase, the support
vector components generated by the mature controller can be clipped by applying an activation function as shown
in Fig. 1. This function is the sign function for bipolar representations. The key memory then stores the bipolar
components. Afterwards, the query vectors that are generated by the controller also undergo the same component
transformation, to generate a bipolar query vector during the inference phase. The reliability of this transformation
derives from the fact that clipping approximately preserves the direction of HD vectors27.
The main benefit of the bipolar representation is that every two-state component is mapped on two binary devices
(see Supplementary Figure 1). Further, bipolar vectors with the same dimensionality always have the same norm:
5‖ˆˆx‖ = √d, ˆˆx ∈ {− 1,+1}d, where ˆˆx denotes a bipolar vector. This renders the cosine similarity between two vectors
as a simple, constant-scaled dot product, and turns the comparison between a query and all support vectors to a
single matrix-vector operation:
α(ˆˆa, ˆˆb) =
1
d
ˆˆa · ˆˆb
T
(5)
w =
1
d
ˆˆq · ˆˆK
T
(6)
As a result, the normalization in the cosine similarity (i.e., the product of norms in the denominator) can be removed
during inference. The requirement to normalize the attention vectors is also removed (see the inference phase in
Fig. 1).
D. Binary key memory: transforming bipolar HD vectors to binary
To obtain an even simpler binary representation for the key memory, we used the following simple linear equation
to transform the bipolar vectors into binary vectors
xˆ =
1
2
(ˆˆx+ 1) (7)
where xˆ denotes the binary vector. Unlike the bipolar vectors, the binary vectors do not necessarily maintain a constant
norm affecting the simplicity of the cosine similarity in Equation 5. However, the HD property of pseudo-randomness
comes to the rescue. By initializing the controller’s weights randomly, and expanding the vector dimensionality,
we have observed that the vectors at the output of the controller exhibit the HD computing property of pseudo-
randomness. In case ˆˆx has a near equal number of −1 and +1-components, after transformation with Equation 7,
this also holds for xˆ in terms of the number of 0- and 1-components, leading to ‖xˆ‖ ≈
√
d
2 . Hence the transformation
given by Equation 7 approximately preserves the cosine similarity as shown below
α(aˆ, bˆ) ≈ 2
d
aˆ · bˆT (8)
=
1
2d
(ˆˆa+ 1) · (ˆˆb+ 1)T
=
1
2d
(ˆˆa · ˆˆb
T
+
∑
i
ˆˆai +
∑
i
ˆˆbi︸ ︷︷ ︸
≈ 0
+ d)
≈ 1
2
(
1
d
ˆˆa · ˆˆb
T
+ 1
)
=
1
2
(α(ˆˆa, ˆˆb) + 1),
where the approximation between the third and fourth line is attributed to the equal number of −1 and +1-
components. We have observed that the transformed vectors at the output of the controller exhibit 2.08% deviation
from the fixed norm of
√
d
2 , for d = 512 (see Supplementary Note 2). Because this deviation is not significant, we
have used the transformed binary vectors in our inference experiments. We also show that this deviation can be
further reduced to 0.91% by training the controller to closely learn the equiprobable binary representations, using a
regularization method that drives the HD binary vectors towards a fixed norm (see Supplementary Note 2).
The proposed architecture with a binary key memory is shown in Fig. 3. Its major block is the computational key
memory that is implemented in one memristive crossbar array with some peripheral circuitry for read-out. The key
memory stores the dense binary representations of support vectors, and computes the dot products as the similarities
thanks to the binary vectors with the approximately fixed norm. The value memory is at least 5×–100× smaller than
the key memory, depending upon the number of ways, and stores sparse one-hot support labels that are not robust
against variations (see Methods). Therefore, the value memory is implemented in software, where class-wise similarity
responses are accumulated, followed by finding the class with maximum accumulated response (for more details on
sum-argmax ranking see Supplementary Note 3).
6E. Experimental results
Here, we present experimental results where the key memory is mapped to PCM devices and the similarity search
is performed using a prototype PCM chip. We use a simple two-level configuration, namely SET and RESET
conductance states, programmed with a single pulse (see Methods).
The experimental results for few-shot problems with varying complexities are presented. For the Omniglot
dataset, a few problems have established themselves as standards such as 5-way and 20-way with 1-shot and 5-shots
each11,13,14,28–32. There has been no effort for scaling to more complex problems (i.e., more ways/shots) on the
Omniglot dataset so far. This is presumably due to the exponentially increasing computational complexity of the
involved operations, especially the similarity operation. While the “complexity” of writing the key memory scales
linearly with increasing number of ways/shots, the similarity operation (reading) has constant complexity O(1) on
memristive crossbars. We have therefore extended the repertoire of standard Omniglot problems up to 100-way
problems (combined with 1-shot and 5-shot). For each of these problems we show the software classification accuracy
for 32-bit floating point real number, bipolar and binary representations in Fig. 4a. To simplify the inference execu-
tions, we approximate the softabs sharpening function with a regular absolute function (inference(α) = |α|), which is
bypassed for the binary representation due to its always positive similarity scores (see Supplementary Note 4). This
is the only approximation made in the software inference, hence Fig. 4a reflects the net effect of transforming vector
representations: a maximum of 0.45% accuracy drop (94.53% vs. 94.08%) is observed by moving from the real to the
bipolar representation among all three problems. The accuracy drop from the bipolar to the binary is rather limited
to 0.11% because both representations use the cosine similarity, otherwise the drop can be as large as 1.13% by using
the dot product (see Supplementary Note 4). This accuracy drop in the binary representation can be reduced by
using a regularizer as shown in Supplementary Note 2.
We then show in Fig. 4b the classification accuracy of our hardware-friendly architecture that uses the dot product to
approximate the cosine similarity. The architecture adopts both binary and bipolar representations in three settings:
1) an ideal crossbar in the software with no PCM variations; 2) a PCM model to capture the non-idealities such
as drift variability and read out noise variability (see Methods); 3) the actual experiments on the PCM hardware.
As shown the PCM model accuracy is closely matched (±0.2%) by the PCM experiments. By going from the ideal
crossbar to the PCM experiment, a maximum of 1.12% accuracy drop (92.95% vs. 91.83%) is observed for the 100-
way 5-shots problem when using the binary representation (or, 0.41% when using the bipolar representation). This
accuracy drop is caused by the non-idealities in the PCM hardware that could be otherwise larger without using the
sum-argmax ranking as shown in Supplementary Note 3. For the other smaller problems, the accuracy differences are
within 0.58%. Despite the variability of the key memory crossbar of the SET state at the selected conductance state
(see Supplementary Note 5) our binary representations are therefore sufficiently robust against the deviations.
To further verify the robustness of the key memory, we conducted a set of simulations with the PCM model in
Fig. 4c and d. We take the 5-way 1-shot and the 100-way 5-shot problems and compute the accuracy achieved by the
architecture with respect to different levels of relative conductance variations. It can be seen that both the binary
and the bipolar architectures closely maintain their original accuracies (with a maximum of 0.75% accuracy drop) for
up to 31.7% relative conductance deviation in the two problems. This robustness is accomplished by associating each
individual item in the key memory with a HD vector pointing to the appropriate direction. At the extreme case of
100% conductance variation, the binary architecture accuracy degrades by 5.1% and 4.1%, respectively, for the 5-way
1-shot and the 100-way 5-shot problems. The accuracy of the bipolar architecture, with a number of devices doubled
with respect to the binary architecture, degrades only by 0.93% and 0.58%, respectively, for the same problems. The
bipolar architecture exhibits higher classification accuracy and robustness compared to the binary architecture, even
with an equal number of devices, as further discussed in the next section, and illustrated in Supplementary Figure 2.
III. DISCUSSION
HD computing offers a framework for robust manipulations of large patterns to such an extent that even ignoring up
to a third of vector coordinates still allows reliable operation18. This makes it possible to adopt noisy, but extremely
efficient devices for offloading similarity computations inside the key memory. Memristive devices such as PCM often
exhibit high conductance variability in an array, especially when the devices are programmed with a single-shot (i.e.,
one RESET/SET pulse) to avoid iterative program-and-verify procedures that require complex circuits and much
higher energy consumption22. While the RESET state variations are not detrimental because its small conductance
value, the significant SET state variations of up to a relative standard deviation of 50% could affect the computational
accuracy. Equation (9) provides an intuition about the relationship between deviations in the cosine similarity and
7the relative SET state variability:
σ(Λ) =
√
2α
d
σrel (9)
where Λ denotes the result of the noisy cosine similarity operation, α denotes the cosine similarity value between the
noise-free vectors, d is the dimensionality of the vectors, and σrel is the relative SET state variability. It states that
the standard deviation of the noisy cosine similarity inversely scales with the square root of the vector dimensionality.
See Supplementary Note 6 for the proof of Equation (9). Supplementary Figure 3 provides a graphical illustration
of how the robustness of the similarity measurement improves by increasing the vector dimensionality. Hence, even
with an extremely high conductance variability, the deviations in the measured cosine similarity are tolerable when
going to higher dimensions. In the case of our experiments with σrel = 31.7% (see Methods), a dimensionality of
d = 512 and a theoretical cosine similarity of α = 0.5 (e.g., for uncorrelated vectors in the binary representations),
the standard deviation in the measured cosine distance, σ(Λ), is ≈ 0.015.
The values for the standard deviations chosen in Fig. 4c and d are extremely high, yet the performance, particularly
that of the bipolar architecture, is impressive. This could be mainly due to using twice the amount of devices per
vector, as the bipolar vectors have to be transformed into binary vectors with dimensionality 2d to be stored on the
memristive crossbar. However, when the binary and bipolar architectures use an equal number of devices (i.e., a bipolar
architecture operating at half dimension of binary architecture), the bipolar architecture still exhibits lower accuracy
degradation as the conductance variations are increased (see Supplementary Figure 2). This could be attributed to
better approximating the cosine similarity than the architecture with the transformed binary representation. Moreover,
the softabs sharpening function is well-matched to the bipolar vectors that are produced by directly clipping the real-
valued vectors, whereas there could be other sharpening function that favor learning the binary vectors.
Using a single nanoscale PCM device to represent each component of a 512-bit vector leads to a very high density
key memory. Furthermore, the reliance on analog in-memory computing can lead to substantial energy saving. It
was previously shown that an associative memory architecture based on PCM devices with a similar structure but
10,000-bit vectors and 21 classes achieved over 600% higher energy efficiency compared to a digital implementation22.
The key memory can also be realized using other forms of in-memory computing based on resistive random access
memory33 or even charge-based approaches34. There are also several avenues to improve the efficiency of the controller.
Currently it is realized as a deep neural network with four convolutional layers and one fully connected layer (see
Methods). To achieve further improvements in the overall energy efficiency, the controller could be formulated as a
binary neural network35, instead of using the conventional deep network with a clipping activation function at the
end. Another potential improvement for the energy efficiency of the controller is by implementing each of the deep
network layers on memristive crossbar arrays36,37.
Besides the few shot classification task that we highlighted in this work, there are several tantalizing prospects for
the HD learned patterns in the key memory. They form vector-symbolic representations that can directly be used
for reasoning, or multi-modal fusion across separate networks38. The key-value memory also becomes the central
ingredient in many recent models for unsupervised and contrastive learning39–41 where a huge number of prototype
vectors should be efficiently stored, compared, compressed, and retrieved.
In summary, we propose to exploit the robust binary vector representations of HD computing in the context of
MANNs, to perform analog in-memory computing. We provide a novel methodology to train the CNN controller to
conform with the HD computing paradigm that aims at first, generating holographic distributed representations with
equiprobable binary or bipolar vector components. Subsequently, dissimilar items are mapped to uncorrelated vectors
by assigning similarity-preserving items to vectors. The former goal is closely met by setting the controller–memory
interface to operate in the HD space, by random initialization of the controller, and by real-to-binary transformations
that preserve the dimensionality and approximately the distances. The quality of representations can be further
improved by a regularizer if needed. The latter goal is met by defining the conditions under which an attention
function can be found to guide the item to vector assignment such that the semantically unrelated vectors are pushed
further away than the semantically related vectors. With this methodology, we have shown that the controller
representations can be directed toward robust bipolar or binary representations. This allows implementation of the
binary key memory on 256,000 noisy but highly efficient PCM devices, with less than 2.7% accuracy drop compared to
the 32-bit real-valued vectors in software (94.53% vs. 91.83%) for the largest problem ever-tried on the Omniglot. The
bipolar key memory causes less than 1% accuracy loss. The critical insight provided by our work, namely, directed
engineering of HD vector representations as explicit memory for MANNs, facilitates efficient few-show shot learning
tasks using in-memory computing. It could also enable applications beyond classification such as symbolic-level fusion,
compression, and reasoning.
8METHODS
Experimental details
For the experiments, we use a host computer running a Matlab environment to coordinate the experiments, which
is connected via Ethernet with an experimental platform comprising two FPGAs and an analog front end that
interfaces with a prototype PCM chip22. The phase-change memory (PCM) chip contains PCM cells that are based
on doped-Ge2Sb2Te2 (d-GST) and are integrated in 90 nm CMOS baseline technology. In addition to the PCM cells,
the prototype chip integrates the circuitry for cell addressing, on-chip 8-bit ADC for cell readout, and voltage- or
current-mode cell programming. The experimental platform comprises the following main units:
• a high-performance analog-front-end (AFE) board that contains the digital-to-analog converters (DACs) along
with discrete electronics, such as power supplies, voltage, and current reference sources,
• an FPGA board that implements the data acquisition and the digital logic to interface with the PCM device
under test and with all the electronics of the AFE board, and
• a second FPGA board with an embedded processor and Ethernet connection that implements the overall system
control and data management as well as the interface with the host computer.
The PCM array is organized as a matrix of 512 word lines (WL) and 2048 bit lines (BL). The PCM cells were
integrated into the chip in 90 nm CMOS technology using the key-hole process 42. The selection of one PCM cell
is done by serially addressing a WL and a BL. The addresses are decoded and they then drive the WL driver and
the BL multiplexer. The single selected cell can be programmed by forcing a current through the BL with a voltage-
controlled current source. It can also be read by an 8-bit on-chip ADC. For reading a PCM cell, the selected BL is
biased to a constant voltage of 300 mV by a voltage regulator. The sensed current, Iread, is integrated by a capacitor,
and the resulting voltage is then digitized by the on-chip 8-bit cyclic ADC. The total time of one read is 1µs. For
programming a PCM cell, a voltage Vprog generated off-chip is converted on-chip into a programming current, Iprog.
This current is then mirrored into the selected BL for the desired duration of the programming pulse. The RESET
pulse is a box-type rectangular pulse with duration of 400 ns and amplitude of 450µA. The SET pulse is a ramp-down
pulse with total duration of approximately 12µs. This programming scheme yields a 0 S conductance for the RESET
state and 22.8× 10−6 S average conductance with 31.7% variability for the SET state (see Supplementary Table 1).
For the experiments on Omniglot classification, the MANN is implemented as a TensorFlow43 model. For testing,
the binarized, or bipolarized query and support vectors are stored in files. These are then accessed by the Matlab
environment and either programmed onto the PCM devices (support vectors) or applied as read-out voltages (query
vectors) in sequence.
When it comes to programming, in the case of binary representation, all elements of a support vector are pro-
grammed along a bit line so that binary 1 elements are programmed at SET state and binary 0 elements are pro-
grammed at RESET state. For bipolar representation, the support vectors are programmed along a pair of adjacent
bit lines so that +1 elements are programmed to SET state at the corresponding wordline indices of the left bit line
while -1 elements are programmed to SET state at the corresponding wordline indices of the right bit line. The rest of
the locations in the PCM array are programmed to RESET state in bipolar experiments (see Supplementary Figure
1). The relative placement of each support vector is arbitrarily determined for each episode independently.
Since the PCM devices are only accessible sequentially, we measure the analog read-out currents for each device
separately using the on-chip ADC and compute the reduced sum along the bitlines digitally in order to obtain the
attention values. The attention values are in turn stored in files again, which are accessed by the TensorFlow model
to finalize the emulation of the key memory.
Omniglot dataset: evaluation and symbol augmentation
The Omniglot dataset is the most popular benchmark for few-shot image classification23. It is comprised of 1623
different characters from 50 alphabets, each drawn by 20 different people, hence 32460 samples in total. These
data are organized into a training set comprising of samples from 964 character types (approximately 60%) from 30
alphabets, and testing set comprising of 659 character types (approximately 40%) from 20 alphabets such that there
is no overlap of characters (hence, classes) between the training set and the testing set. Before going into the details
of the procedure we used to evaluate a few-shot model, we will present some terminology. A problem is defined as a
specific configuration of number of ways and shots parameters. A run is defined as a fresh random initialization of
9the model (with respect to its weights), followed by training the model (i.e., the learning phase), and finally testing
its performance (i.e., the inference phase).
During a run, a model that is trained usually starts underfitted, at some point reaches the optimal fit and then
overfits. Therefore it makes sense to validate the model at frequent checkpoint intervals during training. A certain
proportion of the training set data (typically 15%) is reserved as the validation set for this purpose. The number of
queries that is evaluated on a selected support set is called batch size. We set the batch size to 32 during both learning
and inference phases.
One evaluation iteration of the model and update of weights in the learning phase, concerning a certain query
batch, is called an episode. During an episode, first the support set is formed by randomly choosing n samples (shots)
from randomly chosen m classes (ways) from training/validation/testing set. Then the query batch is formed by
randomly choosing from the remaining samples from the same classes used for the support set. At the end of an
episode, the ratio between number of correctly classified queries in the batch versus the total queries in the batch is
calculated. This ratio when averaged across the episodes is called training accuracy, validation accuracy, or testing
accuracy depending on the source of the data used for the episode.
Our evaluation setup consists of a maximum 50,000 training episodes in the learning phase, the validation checkpoint
frequency of once every 500 training episodes. At a validation checkpoint, the model is further evaluated on 250
validation episodes. At the end of training, the model checkpoint with the highest validation accuracy is used for the
inference phase. The final classification accuracy that is used to measure the efficacy of a model is the average testing
accuracy across 1000 testing episodes of a single run. This can be further averaged across multiple runs (typically 10)
pertaining to different initializations of the model, since the model’s convergence towards the global minimum of the
loss function is dependent on the initial parameters.
To prevent overfitting and to gain more meaningful representations of the Omniglot symbols, we augmented the
dataset by shifting and rotating the symbols. Specifically, every time we draw a new support set or query batch from
the dataset during training, we randomly augment each image in the batch. For that we have two parameters s and
r that we draw from a normal distribution with mean µ = 0 and a certain standard deviation for every image, and
shift it by s and rotate it by r. We have found that a shifting standard deviation of σs = 2.5 pixels and a rotation
standard deviation of σr =
pi
12 work well for 32× 32 pixel images.
The CNN as a controller for the MANN architecture
For the Omniglot few-shot classification task, we design the embedding function f(x;θ) of our controller as a CNN
inspired by the embedding proposed in14. The input is given by grayscale 32 by 32 pixel images, randomly augmented
by shifting and rotating them before being mapped. The embedding function is a non-linear mapping
f : B32×32 → Rd.
The CNN bears the following structure:
• two convolutional layers, each with 128 filters of shape 5× 5,
• a max-pooling layer with a 2× 2 filter of stride 2,
• another two convolutional layers, each with 128 filters of shape 3× 3,
• another max-pooling layer with a 2× 2 filter of stride 2,
• a fully connected layer with d units,
where the last layer defines the dimensionality of the feature vectors. Each convolutional layer uses a ReLU activation
function. The output of the last dense layer directly feeds into the key memory during learning. During inference the
output of the last dense layer is subjected to a sign or step activation (depending on the representation being bipolar
or binary) before feeding into the key memory. The Adam optimizer44 is used during the training with a learning
rate of 1e-4. For more details of the training procedure refer to Supplementary Note 1.
Details of attention mechanism for the key-value memory
When a key, generated from the controller, belongs to the few-shot support set, it is stored in the key memory as a
support vector during the learning phase and its corresponding label in the value memory as a one-hot support label.
10
When the key corresponds to a query, it is compared to all other keys (i.e., the support vectors stored in the key
memory) using a similarity metric. As part of an attention mechanism, the similarities then have to be transformed
into weightings to compute a weighted sum of the vectors in the value memory. The output of the value memory
represents a probability distribution over the available labels. The weightings (i.e., attention) vector has unit norm
such that the weighted sum of one-hot labels represents a valid probability distribution. For an m-way n-shot problem
with si support samples (i ∈ {1, ...,mn)}) and a query sample x, there is a parameterized embedding function fθ, with
p trainable parameters in the controller, that maps samples to the feature space Rd, where d is the dimensionality
of the feature vectors. Hence, the set of support vectors Ki, which will be stored in the key memory, and the query
vector q are defined as follows:
Ki = fθ(si), q = fθ(x)
K ∈ Rmn×d, q ∈ Rd, θ ∈ Rp.
The attention mechanism is a comparison of vectors followed by sharpening and normalization. Let α be a similarity
metric (e.g., cosine similarity) and  a sharpening function (e.g., exponential function) with α : Rd × Rd → R,  :
R→ R. Then,
σ(q,Ki) =
(α(q,Ki))∑mn
j=1 (α(q,Kj))
mn∑
j=1
σ(q,Kj) = 1
σ : Rd × Rmn×d → [0, 1]mn
is the attention function for a query vector q and key memory K, and its output is the attention vector w = σ(q,K).
Similar support vectors to the query lead to a higher focus at the corresponding index. The normalized attention
vector (i.e.,
∑mn
i=1wi = 1) is used to read out the value memory. The value memory contains the one-hot labels of the
support samples in the proper order. A relative labelling is used that enumerates the support set. Using the value
memory (V ∈ Bmn×m), the output probability distribution and the predicted label are derived as:
p = w ·V
lpredicted = argmax
i∈{1,...,m}
pi.
Note that the output probability distribution p is the weighted sum of one-hot labels (i.e., the probabilities of individual
shots within a class are summed together). We call this ranking sum-argmax that results in higher accuracy in the
PCM inference experiments compared to a global-argmax where there is no summation for the individual probabilities
per class. See Supplementary Note 3 for a comparison between these two ranking criteria.
PCM model and simulations
For the simulations of our architecture we use TensorFlow. A model implemented with the appropriate API calls
can easily be accelerated on a GPU. We have also made use of the high-level library Keras, which is part of TensorFlow
and enables quick and simple construction of deep neural networks in a plug-and-play like fashion. This was mainly
utilized for the construction of the controller. For modeling the PCM computational memory, the low-level library
API was used, since full control over tensors of various shapes and sizes had to be ensured. In order to model the
most important PCM non-idealities, a simple conductance drift behavior has been assumed:
G(t) = Gt0 ·
(
t
t0
)−ν
(10)
where G means conductance after t time since programming. Since we fit these parameters to our measurements, we
can simply chose reference time t0 = 1 sec so that Equation 10 becomes
G(t) = G0 · t−ν .
We then introduce several parameters to model variations (see Supplementary Table 1). The variations are assumed
to be of Gaussian nature. Our final model of the conductance of a single PCM device is the following:
G(t) = N (0, G˜2r) + (G0 · N (1, G˜2p)) · t−ν·N (1,ν˜
2),
11
with N (µ, σ2) being the normal distribution with mean µ and standard deviation σ. Since we model a whole crossbar
and time between successive query evaluations (estimated 1µsec) of a batch is negligible compared to the evaluation
time of the first query of the batch since programming (estimated 20s). With this, our simulation setup is simplified
to batch-wise processing of multiple inputs (i.e. queries) to the crossbar, and thus we solve
I = U ·GT
in one step. Where U is read-out voltages representing the batch of query vectors, G is conductance value of the
PCM array at evaluation time (20s) and I is the corresponding current values received for each query in the batch.
To derive the PCM model parameters, we SET 10,000 devices and measure their conductance over a time spanning
5 orders of magnitude. The distribution of the devices’ conductance at two time instants is shown in Supplementary
Figure 4(a) and 4(b). The drift leads to a narrower conductance distribution over time, yet the relative standard
deviation increases. In the interest of time scales used for the experiments, the PCM model simulations, uses σrel =
31.7 %.
In a second step, we fit a linear curve with offset G0 and steepness −ν in a log-log regime of the measurements
to each device measured (see Supplementary Figure 4(c) for a set of example measurements and their fitted curves).
The mean values of all fitted G0 and ν give us the parameters for the model. They are calculated to be 22.8µS and
0.0715 , respectively. Their relative standard deviation gives us the programming variability G˜p and drift variability
ν˜, which are calculated as 31.7 % and 22.5 % respectively. In order to derive the read-out noise G˜r, we calculate the
deviation of measured conductance from the conductance value obtained from the fit line for each point on the curve
to retrieve the standard deviation. This gives us the standard deviation of the read-out noise as 0.926µS.
REFERENCES
1Siegelmann, H. & Sontag, E. On the computational power of neural nets. Journal of Computer and System Sciences 50, 132 – 150
(1995).
2Goodfellow, I. J., Mirza, M., Xiao, D., Courville, A. & Bengio, Y. An empirical investigation of catastrophic forgeting in gradientbased
neural networks. In Proceedings of International Conference on Learning Representations (ICLR) (2014).
3Graves, A., Wayne, G. & Danihelka, I. Neural turing machines. CoRR abs/1410.5401 (2014). URL http://arxiv.org/abs/1410.5401.
1410.5401.
4Graves, A. et al. Hybrid computing using a neural network with dynamic external memory. Nature 538, 471–476 (2016).
5Weston, J., Chopra, S. & Bordes, A. Memory networks. In Proceedings of International Conference on Learning Representations
(ICLR) (2015).
6Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D. & Lillicrap, T. P. One-shot learning with memory-augmented neural networks.
CoRR abs/1605.06065 (2016). URL http://arxiv.org/abs/1605.06065. 1605.06065.
7Wu, Y., Wayne, G., Graves, A. & Lillicrap, T. The Kanerva machine: A generative distributed memory. In Proceedings of International
Conference on Learning Representations (ICLR) (2018).
8Sukhbaatar, S., szlam, a., Weston, J. & Fergus, R. End-to-end memory networks. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama,
M. & Garnett, R. (eds.) Advances in Neural Information Processing Systems 28, 2440–2448 (2015).
9Stevens, J. R., Ranjan, A., Das, D., Kaul, B. & Raghunathan, A. Manna: An accelerator for memory-augmented neural networks. In
Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO ’52, 794–806 (2019).
10Ranjan, A. et al. X-mann: A crossbar based architecture for memory augmented neural networks. In Proceedings of the 56th Annual
Design Automation Conference 2019, DAC ’19, 130:1–130:6 (2019).
11Ni, K. et al. Ferroelectric ternary content-addressable memory for one-shot learning. Nature Electronics 2, 521–529 (2019).
12Liao, Y. et al. Parasitic resistance effect analysis in rram-based tcam for memory augmented neural networks. In 2020 IEEE International
Memory Workshop (IMW), 1–4 (2020).
13Laguna, A. F., Yin, X., Reis, D., Niemier, M. & Hu, X. S. Ferroelectric fet based in-memory computing for few-shot learning. In
Proceedings of the 2019 on Great Lakes Symposium on VLSI, GLSVLSI ’19, 373–378 (2019).
14Laguna, A. F., Niemier, M. & Hu, X. S. Design of hardware-friendly memory enhanced neural networks. In 2019 Design, Automation
Test in Europe Conference Exhibition (DATE) (2019).
15Rahimi, A., Ghofrani, A., Cheng, K., Benini, L. & Gupta, R. K. Approximate associative memristive memory for energy-efficient gpus.
In 2015 Design, Automation Test in Europe Conference Exhibition (DATE), 1497–1502 (2015).
16Wu, T. F. et al. Brain-inspired computing exploiting carbon nanotube fets and resistive ram: Hyperdimensional computing case study.
In 2018 IEEE International Solid - State Circuits Conference - (ISSCC), 492–494 (2018).
17Sebastian, A., Le Gallo, M., Khaddam-Aljameh, R. & Eleftheriou, E. Memory devices and applications for in-memory computing.
Nature Nanotechnology 15, 529–544 (2020).
18Kanerva, P. Hyperdimensional computing: An introduction to computing in distributed representation with high-dimensional random
vectors. Cognitive Computation 1, 139–159 (2009).
19Gayler, R. W. Vector symbolic architectures answer Jackendoff’s challenges for cognitive neuroscience. In Proceedings of the Joint
International Conference on Cognitive Science. ICCS/ASCS, 133–138 (2003).
20Kanerva, P. Sparse Distributed Memory (MIT Press, Cambridge, MA, USA, 1988).
21Rahimi, A. et al. High-dimensional computing as a nanoscalable paradigm. IEEE Transactions on Circuits and Systems I: Regular
Papers 64, 2508–2521 (2017).
22Karunaratne, G. et al. In-memory hyperdimensional computing. Nature Electronics 3, 327–337 (2020).
12
23Lake, B. M., Salakhutdinov, R. & Tenenbaum, J. B. Human-level concept learning through probabilistic program induction. Science
350, 1332–1338 (2015).
24Plate, T. A. Holographic reduced representations. IEEE Transactions on Neural Networks 6, 623–641 (1995).
25Gayler, R. W. Multiplicative binding, representation operators & analogy. Advances in analogy research: Integration of theory and data
from the cognitive, computational, and neural sciences 1–4 (1998).
26Kanerva, P. Binary spatter-coding of ordered k-tuples. In von der Malsburg, C., von Seelen, W., Vorbru¨ggen, J. C. & Sendhoff, B.
(eds.) Artificial Neural Networks — ICANN 96, 869–873 (1996).
27Anderson, A. G. & Berg, C. P. The high-dimensional geometry of binary neural networks. In Proceedings of International Conference
on Learning Representations (ICLR) (2018).
28Finn, C., Abbeel, P. & Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th
International Conference on Machine Learning, ICML’17, 1126–1135 (2017).
29Vinyals, O., Blundell, C., Lillicrap, T., kavukcuoglu, k. & Wierstra, D. Matching networks for one shot learning. In Lee, D. D., Sugiyama,
M., Luxburg, U. V., Guyon, I. & Garnett, R. (eds.) Advances in Neural Information Processing Systems 29, 3630–3638 (2016).
30Li, A., Luo, T., Xiang, T., Huang, W. & Wang, L. Few-shot learning with global class representations. In Proceedings of the IEEE/CVF
International Conference on Computer Vision (ICCV) (2019).
31Sung, F. et al. Learning to compare: Relation network for few-shot learning. In 2018 IEEE/CVF Conference on Computer Vision and
Pattern Recognition, 1199–1208 (2018).
32Snell, J., Swersky, K. & Zemel, R. Prototypical networks for few-shot learning. In Proceedings of the 31st International Conference on
Neural Information Processing Systems, NIPS’17, 4080–4090 (2017).
33Liu, Q. et al. A fully integrated analog ReRAM based 78.4 TOPS/W compute-in-memory chip with fully parallel MAC computing. In
Proc. of International Solid-State Circuits Conference (ISSCC), 500–502 (IEEE, 2020).
34Verma, N. et al. In-memory computing: Advances and prospects. IEEE Solid-State Circuits Magazine 11, 43–55 (2019).
35Al Bahou, A., Karunaratne, G., Andri, R., Cavigelli, L. & Benini, L. Xnorbin: A 95 top/s/w hardware accelerator for binary convolutional
neural networks. In 2018 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS), 1–3 (2018).
36Joshi, V. et al. Accurate deep neural network inference using computational phase-change memory. Nature Communications 11, 1–13
(2020).
37Yao, P. et al. Fully hardware-implemented memristor convolutional neural network. Nature 577, 641–646 (2020).
38Mitrokhin, A., Sutor, P., Summers-Stay, D., Fermu¨ller, C. & Aloimonos, Y. Symbolic representation and learning with hyperdimensional
computing. Frontiers in Robotics and AI 7, 63 (2020).
39Wu, Z., Xiong, Y., Stella, X. Y. & Lin, D. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition (2018).
40Caron, M. et al. Unsupervised learning of visual features by contrasting cluster assignments (2020). URL http://arxiv.org/abs/2006.
09882. 2006.09882.
41Tian, Y., Krishnan, D. & Isola, P. Contrastive multiview coding (2019). URL http://arxiv.org/abs/1906.05849. 1906.05849.
42Breitwisch, M. et al. Novel lithography-independent pore phase change memory. In Proceedings of the Symposium on VLSI Technology,
100–101 (2007).
43Abadi, M. et al. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and
Implementation (OSDI 16), 265–283 (2016).
44Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. In Proceedings of International Conference on Learning
Representations (ICLR) (2015).
45Kingma, D. & Ba, J. Adam: A method for stochastic optimization. International Conference on Learning Representations (2014).
46Prechelt, L. Early Stopping — But When?, 53–67 (Springer Berlin Heidelberg, Berlin, Heidelberg, 2012).
ACKNOWLEDGEMENTS
This work was partially funded by the European Research Council (ERC) under the European Union’s Horizon
2020 research and innovation programme (grant agreement number 682675).
13
Learning Phase Controller
Normalization
Sharpening
Prediction 
Probabilities
Ground Truth
Real-valued Key Memory
Support
Labels
Sets
Value Memory
Cosine
Similarity
High Dimensional
Support Vectors
Activation
Load
Support 
Evaluate 
Query
Support Image Sets
Query Image Sets
Weighted
Sum
Bipolar/Binary Key Memory
Back-
Propagation
Learning Phase
time
FIG. 1. Proposed robust HD MANN architecture. The learning phase of the proposed MANN involves a controller
which first propagates images in the support set to generate the HD support vector representations that are stored in the
key memory. The corresponding support labels are stored in the value memory. For the evaluation, the controller propagates
the query images to produce the HD vectors for the query. A cosine similarity module then compares the query vector
with each of the support vectors stored in the key memory. Subsequently, the resulting similarity scores are subject to a
sharpening function, normalization, and weighted sum operations to produce prediction probabilities with the value memory.
The prediction probabilities are compared against the ground truth labels to generate an error which is backpropagated through
the network to update the weights of the controller (see red arrows). This episodic training process is repeated across batches
of support and query images from different problem sets until the controller reaches maturity. In the inference phase, we use
a hardware-friendly version of our architecture by simplifying vector representations, similarity, normalization, and sharpening
functions. The mature controller is employed along with an activation function that readily clips the real-valued vectors to
obtain bipolar/binary vectors at the output of controller. The modified bipolar or binary support vectors are stored in the
key memristive crossbar array (i.e., bipolar/binary key memory). Similarly, when the query image is fed through the mature
controller, its HD bipolar or binary representation, as a query vector, is used to obtain similarity scores against the stored
support vectors in the memristive crossbar array. The key memristive crossbar approximates cosine similarities between a query
and all the support vectors with the constant-scaled dot products in O(1) by employing in-memory computing. The results are
weighted and summed by the support labels (in the value memory) after an approximate sharpening step and the maximum
response index is output as the prediction.
14
0 200 400 600 800 1000
testing episode(sorted)
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
C
os
in
e 
Si
m
ila
rit
y
0 200 400 600 800 1000
testing episode(sorted)
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
C
os
in
e 
Si
m
ila
rit
y
Softmax Softabsa b
c d
e
intra class 10th-90th percentile range
inter class 10th-90th percentile range
1st shot
2nd shot
3rd shot
4th shot
5th shot
C
la
ss
 I
nd
ex
 (
w
ay
)
C
la
ss
 I
nd
ex
 (
w
ay
)
Class Index (way)Class Index (way)
5-way 1-shot 20-way 5-shot 100-way 5-shot
50
60
70
80
90
100
softabs
softmax
1st shot
2nd shot
3rd shot
4th shot
5th shot
92.3
88.4
74.3
97.3 98.0
93.9
C
la
ss
ifi
ca
tio
n 
Ac
cu
ra
cy
(%
)
intra class similarity
inter class similarity
overlapping margin
FIG. 2. The role of sharpening functions. The pairwise cosine similarity matrix from the support set of a single testing
episode of 20-way 5-shot problem learned using the softmax (a) and the softabs (b) as the sharpening functions. Intra-class
and inter-class cosine similarity spread across 1000 testing episodes in 20-way 5-shot problem with the softmax (c) and the
softabs (d) as the sharpening functions (the episodes are sorted by the intra-class to inter-class cosine similarity ratio highest
to lowest). In the case of the softmax sharpening function, the margin between 10th percentile of intra-class similarity and
90th percentile of inter-class similarity is reduced, and sometimes becomes even negative due to overlapping distributions. In
contrast, the softabs function leads to a relatively larger margin separation (1.75×, on average) without causing any overlap.
The average margin for the softabs is 0.1371, compared with 0.0781 for the softmax. (e) Classification accuracy in the form of
a box plot from 1000 few-shot episodes, where each episode consists of a batch of 32 queries. The softabs sharpening function
achieves better overall accuracy and less variations across episodes for all few-shot problems. The average accuracy is depicted
in each case.
15
Km×nK3K2 K4
Max Comparator
W
or
dl
in
e 
D
ri
ve
rs
(1)
(2)
(3)
(4)
(512)
Crossbar Hardware
Software
ADC Array
8 bits
Support Images Query ImageSupport Labels
m×n elements
1 2 m
1 2 n
Value Memory
Binary Key Memory
K1
Mature
Controller
1x32x32
128x16x16
12
8x
8x
8
12
8x
4x
4
12
8x
2x
2
51
2x
1
512 bits
FIG. 3. The MANN architecture with the binary key memory using analog in-memory computations. The
architecture is simplified for efficient few-shot inference: 1) The transformed HD support vectors are stored in a memristive
crossbar array as the binary key memory; the query vectors are binarized too. 2) The cosine similarity (α) between the input
query vectors and the support vectors is computed through in-memory dot products in the crossbar using Equation (8). 3) To
further simplify the inference pipeline, the normalization of the attention vectors and the regular absolute sharpening function
are bypassed. The accumulation of similarity responses belonging to the same support label in the value memory and finding
the class with maximum accumulated response are implemented in software. The binary query/support vectors have 512
dimensions. m and n stand for ‘way’ and ‘shot’ of the illustrated problem respectively. A similar architecture with the bipolar
key memory is shown in Supplementary Figure 1.
16
0.85 0.9 0.95
c d
a b
5-way 1-shot 20-way 5-shot 100-way 5-shot
91
92
93
94
95
96
97
98
C
la
ss
ifi
ca
tio
n 
A
cc
ur
ac
y 
(%
)
5-way 1-shot 20-way 5-shot 100-way 5-shot
91
92
93
94
95
96
97
98
C
la
ss
ifi
ca
tio
n 
A
cc
ur
ac
y 
(%
)
0 20 40 60 80 100
Conductance Variation (%)
91
92
93
94
95
96
97
98
C
la
ss
ifi
ca
tio
n 
A
cc
ur
ac
y 
(%
)
5-way 1-shot problem
0 20 40 60 80 100
Conductance Variation (%)
88
89
90
91
92
93
94
95
C
la
ss
ifi
ca
tio
n 
A
cc
ur
ac
y 
(%
)
100-way 5-shot problem
bipolar
binary
Bipolar Binary
Experiments
ArchitectureRepresentation
bipolar
binary
Model
Ideal 
FIG. 4. Experiments on Omniglot classification. (a) Average software classification accuracy with the real, bipolar, and
binary vector representations on three problems, each using the approximate sharpening function (i.e., the regular absolute),
and the precise similarity function (i.e., the cosine) over 10 test runs each containing 1000 few shot episodes (effectively
10,000 episodes); these capture the net effect of changing vector representations in software. (b) Classification accuracy results
with the hardware-friendly inference architecture on an ideal crossbar without any PCM variations, a crossbar simulated with
the PCM model (see methods), and the actual experiments with the PCM devices (see methods). The ideal and the PCM
model simulations are conducted over 10 test runs each containing 1000 few shot episodes (effectively 10,000 episodes). The
experiments were conducted over one test run containing 1000 episodes. (c) Classification accuracy as a function of percentage
of device conductance variation in the PCM model with bipolar and binary as architectures for the 5-way 1-shot problem, and
the (d) 100-way 5-shot problem.
17
SUPPLEMENTARY FIGURES
Supplementary Figure 1: Architecture to implement bipolar key memory using PCM crossbar arrays
Km×nK2K1K1 Km×nK2
W
or
dl
in
e 
D
ri
ve
rs
(1)
(2)
(3)
(4)
(512)
Crossbar Hardware
Software
ADC Array
m×n elements
Bipolar Key Memory
H
D
 V
ec
to
r
Max Comparator
8 bits
1 2 m
1 2 n
512 bits
Value Memory
Support Images Query Image
Support Labels
abs
abs
abs
abs
abs
abs
abs
abs
abs
Supplementary Figure 1. The MANN architecture with the bipolar key memory using analog in-memory computations. This
architecture is different from the one presented in Fig. 3 in the main manuscript for the binary representations in the following
ways: First, the activation function used at the output of the embedding function in the controller is changed to a sign function,
generating bipolar query and support vectors. Second, the crossbar utilizes twice the number of columns compared to the binary
architecture to store the complementary versions of the support vectors on the crossbar. This effectively doubles the number
of memristive devices. Third, a regular abs function approximates the softabs sharpening function during inference. Fourth,
there are some changes to the peripheral circuits in the way the original support/query vector are fed from the controller: the
complementary version of it is fed to the wordline drivers in a time multiplexed manner. Furthermore, the resulting current
on the bitline from the original support vectors is saved in an array of capacitors and subtracted from the current measured
on the corresponding complementary bitline before sending the net current to the ADC array. The blue columns represent the
original support vectors, whereas the red columns indicate the complementary versions.
18
Supplementary Figure 2: Robustness of bipolar versus binary architectures
(a) (b)
Binary 
Devices           = 1024
Dimensionality = 1024
Binary 
Devices           =  512
Dimensionality =  512
Bipolar
Devices           =  1024
Dimensionality =   512
0 10 20 30 40 50 60 70 80 90 100
Conductance Variation (%)
91
92
93
94
95
96
97
98
C
la
ss
ifi
ca
tio
n 
A
cc
ur
ac
y 
(%
)
5-way 1-shot problem
0 10 20 30 40 50 60 70 80 90 100
Conductance Variation (%)
88
89
90
91
92
93
94
95
C
la
ss
ifi
ca
tio
n 
A
cc
ur
ac
y 
(%
)
100-way 5-shot problem
Supplementary Figure 2. Classification accuracy when using the PCM model as a function of device conductance variations
for the 5-way 1-shot (a) and the 100-way 5-shot problems (b). The bipolar architecture (with dimensionality d = 512 and
hence 1024 devices) is compared against the binary architecture with i) the same dimensionality (d = 512), and ii) the same
number of devices (d = 1024). At zero or low conductance variations, the binary architecture with 1024-dimension outperforms
the same architecture with 512-dimension. This indicates loss of information when representation dimensionality is lowered
from 1024 to 512. This loss is orthogonal to the loss incurred by non-idealities because conductance variation is on the lower
side. The bipolar architecture on the other hand achieves a relatively higher accuracy at the lower dimensionality of 512 and
retains it even when extreme conductance variations are presented, compared to the both binary architectures with the same
dimension, or the same number of devices. This implies that the bipolar architecture is more robust against non-idealities of
in-memory computing for the same power and area constraint. The accuracy distributions were obtained from 8 test runs each
containing 1000 episodes.
19
Supplementary Figure 3: Robustness of similarity measurement for different vector dimensionalities
= 0.317
d=
d=
d=
d=
d=
d=
d=
(a) (b)
(c) (d)
= 0.5
= 0.2= 0.05
Supplementary Figure 3. Theoretic deviations in the computed cosine similarity for different vector dimensionalities d, and
different SET state variabilities σrel. The case σrel = 0.317 represents the variability observed on our PCM crossbar array as
shown in Supplementary Note 5. In this case, the uncorrelated binary vectors (i.e., α = 0.5) exhibit merely ≈ 0.015 standard
deviation in the measured cosine distance when using 512-bit vectors.
20
Supplementary Figure 4: PCM Measurements and Model Fitting
101 102 103 104 105
Time [s]
10 6
10 5
C
on
du
ct
an
ce
 [S
]
Example Curve Fittings
Sample Device 1
Sample Device 2
Sample Device 3
Sample Device 4
0 1 2 3 4 5
Conductance [S] 1e 5
0
20
40
60
80
100
Conductance Distribution at time
t = 11.38s
0 1 2 3 4 5
Conductance [S] 1e 5
0
20
40
60
80
100
Conductance Distribution at time
t = 99997.39s
(a) (b)
(c)
Supplementary Figure 4. Conductance distribution of 10,000 devices in the SET state at the beginning (σrel = 31.7%) (a) and
at timescales three orders of magnitude later (σrel = 36.6%) (b) of the experiment. SET state measurements from 4 example
devices, with their fitted curves (c).
21
SUPPLEMENTARY TABLES
Supplementary Table 1: PCM model parameters
Supplementary Table I. Derived values of the model parameters.
Symbol Description Type Value
G0 mean conductance at time t = 1s - 22.8× 10−6 S
ν mean drift exponent - 0.0598
G˜p programming variability multiplicative 31.7 %
G˜r read-out noise additive 0.496× 10−6 S
ν˜ drift variability multiplicative 9.07 %
22
SUPPLEMENTARY NOTES
Supplementary Note 1: Training and Inference for Key-Value Memory Network
In the following, we describe in detail two main phases, the learning phase and the inference phase, for our proposed
MANN architecture. For a summary, refer to Supplementary Table II that shows the steps of each phase.
1. Learning Phase
The learning phase starts by randomly initializing the trainable parameters θ of the embedding function fθ of
the controller. Randomness is important for the feature vectors to adopt certain important properties of high-
dimensional computing. For example, the number of positive and negative components of the feature vectors should
be approximately equal. The learning phase includes the following steps.
a. Support set loading step The step after initializing the parameters is the very first training step. For that, we
(randomly) draw a support set from the training dataset, which is then mapped to the feature space via the controller
and stored in the key memory. More specifically, this step generates support vectors from the forward pass through
the controller, and writes them in the key memory. Optionally, the dataset training samples can be augmented by
shifting and rotating the symbols to improve the learned representations, as we described in the Methods. Each
class in the support set gets assigned a unique one-hot label and for each support vector in the key memory, the
corresponding one-hot support label is stored in the value memory (see Supplementary Figure 5(a)). After this step
both key and value memories have been written and will remain fixed until the next training episode is presented (see
Supplementary Figure 5(b)). Henceforth, one can query arbitrarily often without altering the architecture’s state at
all. It should be emphasized that, as we have just initialized the parameters, predictions will be random, because the
controller is still immature.
b. Query evaluation step During one episode of the learning phase, a whole batch of query samples is processed
together and later produces a single loss value. There is a maximum size for the query batch which is dependent on
the number of available samples per class in the training dataset. As the query samples stem from the same classes
as the samples in the support set, problems with a higher number of shots leave fewer samples for the query batch.
Then, the query batch is mapped to the feature space in the same way as the support set. This yields a batch of
probability distributions over the potential labels as shown in Supplementary Figure 5(c).
c. Backpropagation step This step has to be supervised, i.e., the labels of the query batch need to be available.
From the ground truth one-hot labels Y and the output of the previous step P, the logarithmic loss λi is computed for
every query i ∈ {1, ..., b}. It is important to employ the logarithmic loss instead of the cross-entropy loss (i.e., including
the additional second term in Supplementary Equation (1)) so that vectors from different classes are pushed further
Supplementary Table II. Summary of the learning and inference phases.
State Before State After
Controller Key-value Memory Controller Key-value Memory
Learning Phase:
Support set loading stepa Immature Empty/Obsolete Unchanged Rewritten
Query evaluation stepb Immature Written Unchanged Unchanged
Backpropagation stepc Immature Written More Mature Unchanged
Episodic training by repeatingd Slightly Mature Written Mature Repeatedly Rewritten
Inference Phase:
Support set loading stepe Mature Empty/Obsolete Unchanged Rewritten
Query evaluation stepf Mature Written Unchanged Unchanged
a Support set from training dataset to fill the key-value memory
b Query batch from training dataset to evaluate predictions
c Loss computed based on classification errors in query phase and backpropagation
d The three above steps are repeated by randomly redrawing support sets and query batches from training dataset
e Support set from test dataset to fill the key-value memory
f Query batch from test dataset
23
apart. The average loss (Supplementary Equation (2)) represents the objective function that has to be minimized
using an appropriate optimization algorithm, e.g., a particular version of stochastic gradient descent like “Adam”45
works well. Notice that only the controller is affected in this step by backpropagating errors through all modules
using the chain rule, while the memories remain fixed as shown in Supplementary Figure 5(d). Hence, the controller
can learn from its own mistakes to progressively identify and distinguish different classes in general for realizing the
meta-learning.
λi = −
m∑
j=1
(Yi,j log (Pi,j) + (1−Yi,j) log (1−Pi,j)) (1)
loss =
1
m
b∑
i=1
λi (2)
Y ∈ Rb×m, P ∈ Rb×m, λ ∈ Rb
d. Episodic-training by repeating the above steps The three aforementioned steps form one training episode.
Several hundreds or even thousands of such training episodes should be conducted so the model can perform well
and provide meaningful predictions. Each episode is administered on a different (random) subset of the training
classes. This prevents the model from overfitting. In the process, the parameters θ of the embedding function fθ are
updated such that objective function in Supplementary Equation 2 is minimized. This procedure is called maturing
the controller.
A mature controller would be an optimally fitted embedding function, right at the verge of under- and overfitting.
To avoid any overfitting, an early stopping46 strategy can be applied. It relies on a subset of the training dataset kept
aside as a validation set. As we are operating in the realm of few-shot learning, “a subset of the training set” implies
non-overlapping classes. The validation set should be chosen large enough to properly represent the data but not too
large as those samples are excluded from training.
During the training procedure, the model’s performance is frequently evaluated on the validation set, without
computing the loss and updating the controller’s parameters. The performance can be measured with an accuracy
metric computed per few-shot problem and states the fraction of correctly classified queries in a batch of size b:
accuracy =
1
b
b∑
i=1
[1 if argmax
j∈{1,...,m}
Pi,j = li else 0]
A moderate number of queries b should be presented per problem and a rather large number of problems drawn from
the validation set in order to average out fluctuations in the problem difficulty during evaluation. The state of the
model yielding the best performance represents the mature controller.
2. Inference Phase
The outcome of the learning phase is the mature controller that is ready to learn and classify images from never-
seen-before classes. During the inference phase, there is no update of the parameters of the mature controller (i.e.,
they are frozen), but the key-value memory will be updated by the controller upon encountering a new few-shot
problem. The inference phase has two main steps similar to the learning phase: the support set loading step, and the
query evaluation step. The first step generates support vectors from the forward pass through the mature controller,
and writes them into the key memory followed by their labels into the value memory. This essentially leads to learning
prototype vectors for the classes that are never exposed in the learning phase. By the end of this loading step, the
key-value memory is programmed for the few-shot classification problem at hand. Then the query evaluation step
similarly generates query vectors at the output of the controller that will be compared to the stored support vectors
generating prediction labels. In a nutshell, if the backpropagation step in Supplementary Figure 5(d) is skipped
and the support set and query batches are sampled from the test split instead of the train split, the sequence in
Supplementary Figure 5 becomes similar to the inference phase (see also Supplementary Table II).
24
W
EmbeddingX ValueMemory
W
Key
Memory
Support
Keys
Unwritten Memories
Support Samples
Immature
Controller
Support Labels
R O R O
Embedding
Written Memories
Immature
Controller
Value
Memory
R O
W
Key
Memory
R O
W
Embedding KeyMemory
Value
Memory
X
Query
Keys
Written MemoriesQuery Samples Predicted
Distribution
Immature
Controller
R O R O
WW
Embedding KeyMemory
Value
Memory
Pr
ed
ic
tio
n
G
ro
un
d 
Tr
ou
th
Loss
Loss Vector
Immature
Controller
R O R O
WW
Written Memories
(a) Support set loading step
(b) State after the support set loading step
(c) Query evaluate step
(d) Backpropagation step
Supplementary Figure 5. Illustration of the learning phase with its steps.
25
Supplementary Note 2: The CNN Controller in Conformity with Dense High-dimensional Representations
We evaluate our training methodology to verify whether the trained CNN controller obeys the “laws” of high-
dimensional computing. Specifically, the pseudo-randomness of the dense representations is crucial for our findings to
be valid. In the theory of high-dimensional computing18, the ratio between the number of −1s and 1s in the bipolar
vectors (respectively, 0s and 1s in the binary vectors)—termed occupancy ratio—closely approximates at 12 . This
holds when the components of a vector are drawn randomly from {−1, 1} (respectively {0, 1}) with equal probability.
In order to verify, whether our controller conforms with this property, we calculate the occupancy ratio for the
embeddings of multiple Omniglot samples for different dimensionalities. Supplementary Figure 6 shows the mean
and standard deviation of the occupancy ratio at different dimensionalities together with their target value according
to the high-dimensional computing theory. While the mean should be constant at µ = 0.5, the standard deviation
is dependent on the dimensionality d, in fact σ = 1/(2
√
d). As shown, by increasing dimensionality (d ≥ 512), the
controller generates vectors that closely follow the equiprobable property of vectors in the high-dimensional computing
theory. We particularly select d = 512 as it also provides sufficient resiliency against the variations in the PCM chip
as shown in Supplementary Figure 3, and leads to comparable accuracy with the real-valued vector representations
as shown in Supplementary Figure 7.
To show that the distribution is as desired and approximately Gaussian, the distribution of the occupancy ratios
for embeddings at d = 512 is shown in Supplementary Figure 8. We observe that the vector representations generated
by the controller is in conformity with the high-dimensional computing theory, but the means of the occupancy ratios
are slightly off. There is 2.08% standard deviation in the occupancy ratio of the controller at d = 512. This deviation
causes 1.02% accuracy drop (93.97% vs. 92.95%) for the 100-way 5-shot problem in the case of binary representation
when the similarity metric is approximated by the dot product as shown in Supplementary Figure 12. We therefore
seek an algorithmic solution to reduce the deviation of the occupancy ratio from the desired value in the following.
Introducing a regularizer to optimize the occupancy ratio
In order to penalize a controller for generating output vectors with an occupancy ratio that deviates from the
desired value, a regularizing term is introduced into the loss function:
Loc = − 1
mn
mn∑
i=1
(
1
d
d∑
j=1
(
1
2
tanh(aKi(j)) +
1
2
)− 0.5)2 (3)
where the function 12 tanh(ax) +
1
2 , also known as softstep, is a differentiable smoothed version of the step function.
This loss is minimized when the occupancy ratio is 0.5 - in other words - the number of positive vector components
(Ki(j) > 0) is equal to the number of negative vector components (Ki(j) < 0), for the i
th support vector, hence
leading to the desired occupancy ratio. However, there is another condition that can alternatively minimize the loss
function by driving the support vector components towards the origin (Ki(j) = 0). It was observed that the controller
101 102 103
Dimensionality
0.46
0.48
0.50
0.52
0.54
M
ea
n 
R
at
io
Mean occupancy ratio
HD theory target
Statistics from the controller
101 102 103
Dimensionality
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
St
an
da
rd
 D
ev
ia
tio
n 
of
 R
at
io
Standard deviation of occupancy ratio
HD theory target
Statistics from the controller
(a) (b)
Supplementary Figure 6. Occupancy ratio of the embeddings generated from the CNN controller at different dimensionalities
versus the high-dimensional (HD) theory target: Mean (left) and standard deviation (right) of the occupancy ratio.
26
101 102 103
dimensionality (d)
30
40
50
60
70
80
90
100
C
la
ss
ifi
ca
tio
n 
A
cc
ur
ac
y 
(%
) real
bipolar
binary
Supplementary Figure 7. Classification accuracy of 100-way 5-shot problem as a function of dimensionality for software models
three different representations real, bipolar and binary. The results averaged from 3 independent runs (3000 episodes).
0.42 0.44 0.46 0.48 0.5 0.52 0.54 0.56 0.58
Ratio
0
5
10
15
20
25
30
Pr
ob
ab
ili
ty
 D
en
si
ty
d=1000
d=512
Supplementary Figure 8. Distribution of the occupancy ratios of embeddings for two different dimensionalities, normalized so
that they form a probability density function (PDF ). Their target PDFs are shown as solid lines.
may reach this undesired condition, which sets the value of support vector components near 0, because the softstep
function reaches the same target occupancy ratio of 0.5 when all support vector components approach 0.
To avoid this undesired condition, an auxiliary term is added to the loss function:
Laux = − 1
mnd
mn∑
i=1
d∑
j=1
(
1
2
(tanh(a(Ki(j) + δ)) + 1)− 1
2
(tanh(a(Ki(j)− δ)) + 1)) (4)
27
where parameters a and δ in Supplementary Equation 3 and Supplementary Equation 4 are chosen as 100 and 0.0001,
respectively, based on the distribution of real-valued support vectors elements Ki(j). The loss term Loc is assigned a
weight value of 10, and the loss term Laux is assigned a weight value of 0.1 to keep these losses in a comparable range
as with the original steady state log loss given in Supplementary Equation 1.
After introducing the above regularizing terms, the output of the controller conforms more closely with the behavior
demanded by the high-dimensional computing theory. For example, the standard deviation of the occupancy ratio
from the target 0.5 dropped from 2.08% to 0.91% for d = 512. That is effectively equivalent to the standard deviation
of pseudo randomly generated vectors for d ≈ 3000. As a result, the accuracy is consistently increased across all
three problems, up to 0.74%, by using the regularizer as shown in Supplementary Table III. We have shown that
the binary architecture using the regularizer and the dot product can almost reach to the accuracy obtained from
the cosine similarity without using the regularizer (maximum 0.28% lower accuracy); see Supplementary Table III.
Supplementary Figure 9 also compares the resulting occupancy ratio with and without the regularizer.
0.42 0.44 0.46 0.48 0.5 0.52 0.54 0.56 0.58
Ratio
0
5
10
15
20
25
30
35
40
Pr
ob
ab
ili
ty
 D
en
si
ty
without regularizer
with regularizer
Supplementary Figure 9. Distribution of the occupancy ratios of embeddings for d = 512 with and without regularizer,
normalized so that they form a probability density function (PDF ). Expected respective PDFs with and without regularizer is
shown as a line.
Supplementary Table III. Comparison of average classification accuracy (%) from 10000 episodes (10 runs) with and without
using the regularizer
Problem
without regularizer with regularizer
cosine
similarity
dot product
similarity
dot product
similarity
5-way 1-shot 97.44 96.92 97.26
20-way 5-shot 97.79 97.38 97.92
100-way 5-shot 93.97 92.95 93.69
28
Supplementary Note 3: Comparison of classification accuracy between sum-argmax vs argmax as the ranking criteria
C
la
ss
ifi
ca
tio
n 
A
cc
ur
ac
y 
D
ro
p 
(%
)
Supplementary Figure 10. Comparing classification accuracy drop from the ideal crossbar to the PCM experiment based on the
sum-argmax vs global-argmax across different problems each from 1000 testing episodes of a single run. In all problems except
bipolar 20-way 5-shot problem, sum-argmax results in lower accuracy drop, indicating its robustness as the ranking criterion.
We investigate two different selection criteria to choose the final predicted class from the produced attention vector
during inference. We first consider the typical argmax of the components across the whole attention vector (w). We
call this global argmax that returns the label of a support vector, among all the mn support vectors, whose probability
is the highest. As the second criterion, we propose sum-argmax where the same class components of the attention
vector are first summed together before applying the argmax function on m summed values (i.e., the number of
classes). For an m-way n-shot problem, the global argmax has O(nm) comparison operations, while the sum argmax
has O(m) comparisons together with O(mlog(n)) addition operations; these operations are much fewer than the
number of operations involved in the key memory for the similarity search with the high-dimensional vectors. Hence,
the computational complexity of the selection criterion is not dominant, and does not affect the overall computational
complexity of the key-memory.
However, as shown in Supplementary Figure 10, the sum-argmax exhibits a clear advantage over the global argmax,
in mitigating the accuracy drop in the presence of noise when the key memory is implemented on the PCM devices,
as opposed to the ideal software model without any variations. This lower accuracy drop is observed for both bipolar
and binary representations, especially in the large problem sizes. For instance, the 100-way 5-shot problem with the
binary representation reaches up to 1% better accuracy mitigation by using the sum-argmax instead of the global
argmax function. This is mainly due to the fact that in the sum-argmax function the variations among various
classes programmed on the PCM array can be better averaged out by adding the intra-class probabilities. As a
further observation, the sum-argmax, which yields a more robust ranking criterion, cannot be implemented in the
TCAM-based architectures because the TCAM can only compute argmax as the speed of a matchline discharge.
29
Supplementary Note 4: Effect of approximating sharpening function and similarity function
(a) (b) (c)
Real Bipolar Binary
Representation
97
97.2
97.4
97.6
97.8
98
C
la
ss
ifi
ca
tio
n 
A
cc
ur
ac
y 
(%
)
5-way 1-shot
Real Bipolar Binary
Representation
97
97.5
98
98.5
C
la
ss
ifi
ca
tio
n 
A
cc
ur
ac
y 
(%
)
20-way 5-shot
Real Bipolar Binary
Representation
92
92.5
93
93.5
94
94.5
95
95.5
C
la
ss
ifi
ca
tio
n 
A
cc
ur
ac
y 
(%
)
100-way 5-shot
97.78 97.78
97.35 97.35
97.44 97.44
98.12
98.01 98.11
97.83
97.49
97.78
94.69
94.53
94.81
94.08
93.06
93.97
precise sharpening approximate sharpening
Supplementary Figure 11. Classification accuracy comparison between approximate sharpening function and precise sharpening
function for three problems: 5-way 1-shot(a) 20-way 5-shot(b) 100-way 5-shot(c), from 10 independent runs each with 1000
episodes. For all the problems, the precise similarity function is used.
96.92
97.44
Bipolar Binary
Representation
97
97.5
98
98.5
C
la
ss
ifi
ca
tio
n 
A
cc
ur
ac
y 
(%
)
20-way 5-shot
Bipolar Binary
Problem
92
92.5
93
93.5
94
94.5
C
la
ss
ifi
ca
tio
n 
A
cc
ur
ac
y 
(%
)
100-way 5-shot
97.83 97.84
97.38
97.78
94.08 94.08
92.95
93.97
(b) (c)(a)
approximate similarityprecise simiarity
Bipolar Binary
Representation
96.5
97
97.5
98
C
la
ss
ifi
ca
tio
n 
A
cc
ur
ac
y 
(%
)
5-way 1-shot
97.35 97.35
Supplementary Figure 12. Classification accuracy comparison between approximate similarity function and precise similarity
function for three problems: 5-way 1-shot(a) 20-way 5-shots(b) 100-way 5-shots(c) from 10 independent runs each with 1000
episodes. For all the problems, the approximate sharpening function is used.
Supplementary Figure 11 shows the impact of using the softabs as the precise sharpening function versus the
regular absolute function as the approximate version, during inference for three different representations. As shown,
the softabs function consistently reaches a higher accuracy in both the real and bipolar representations across all the
three problems. However, the softabs has a negative impact on the accuracy of binary representation. Therefore,
using the regular absolute function, which in fact can be bypassed in the binary representation, not only simplifies
the inference architecture but also slightly improves the accuracy.
Supplementary Figure 12 shows the impact of using the cosine similarity as the precise similarity metric versus the
dot product as the approximate version. The approximate sharpening functions are used for each configuration. A
significant accuracy drop is observed in the binary case when the cosine similarity is approximated by dot product.
This drop is due to the fact that the controller produces binary vectors that do not have a fixed norm but a norm
approximately close to
√
d
2 , whereas this is not the case for the bipolar vectors with a constant norm of
√
d. When the
similarity metric is not approximated, the binary is on average better than the bipolar, indicating that the similarity
30
approximation is the cause of accuracy drop in the case of the binary representation. The results are collected from
10 independent runs from each configuration.
31
Supplementary Note 5: Spatial Variability on PCM Crossbar
The particular conductance levels determine the power consumption of the crossbars during read-outs and the
variability of the programmed states. Usually, the lower the conductance level of the state, the larger the deviations.
The conductance of the RESET state for such devices is usually low enough so that currents cannot be detected by
the ADC during read-out. Since our binary representations are extremely robust against deviations, we used low and
thus power-saving conductance levels for the SET state of the key memory crossbar. Supplementary Figure 13(a)
illustrates typical SET variations encountered at programming time of the prototype chip used for the experiments.
co
nd
uc
ta
nc
e 
μS
27.5
25.0
22.5
20.0
17.5
15.0
12.5
10.0
7.5
5.0
0 50 100 150 200 250 300 350 400 450
2.0
2.5
3.0
3.5
4.0
Av
g.
 c
on
du
ct
an
ce
 (
μS
)
2.0
2.5
3.0
3.5
4.0
050100
bin count
(a)
(b) (c) Av
g.
 c
on
du
ct
an
ce
 (
μS
)
Supplementary Figure 13. (a) Spatial variability of SET state conductance of device across the PCM array. (b) Average SET
state conductance per bitline. (c) Distribution of Average SET state conductance per bitline.
The relative standard deviation of the spatial variability of the set state conductance (σrel) at programming
timescales is observed as 31.7% across the entire region of the chip utilized for 100-way 5-shot problem. When
SET state conductance is averaged per bitline there is 5.38% relative standard deviation in the average SET state
conductance across bit lines as shown in Supplementary Figure 13(b) and (c). This spatial variability is the cause for
further classification accuracy loss of 0.11% in PCM experiments when compared with PCM model simulations with
the same parameters as with the prototype chip used for the experiments.
32
Supplementary Note 6: Proof of robustness of noisy cosine similarity with binary high-dimensional vectors
Let us take two binary high-dimensional vectors aˆ and bˆ, which fulfill the property ‖xˆ‖ ≈
√
d
2 . We write bˆ to
the computational memory and use aˆ as the readout voltage. Thus, the latter remains accurate whereas the former
(written into the memory) has to be modeled as a vector of normal random variables Bˆ with
Bˆi = X if bˆi = 1, else 0
E(X) = 1, Var(X) = σ2rel.
In the next step, we consider the (approximate) cosine similarity between the original vectors a fixed value α and
rearrange its expression:
α =
2
d
aˆ · bˆ = 2
d
d∑
i=1
aˆi · bˆi
=
2
d
d∑
i=1
(1 if aˆi = bˆi = 1)
=
2
d
n∑
i=1
1 =
2n
d
,
where n denotes the number of positions where both aˆ and bˆ equal 1. Analogously, we obtain
Λ =
2
d
aˆ · Bˆ = 2
d
d∑
i=1
(X if aˆi = bˆi = 1)
=
2
d
n∑
i=1
X,
where the random variable Λ denotes the result of the noisy cosine similarity operation.
Applying the rules of probability theory, we compute the expected value and standard deviation of Λ as
E(Λ) = E
(
2
d
n∑
i=1
X
)
Var(Λ) = Var
(
2
d
n∑
i=1
X
)
=
2
d
n∑
i=1
E(X) =
(
2
d
)2 n∑
i=1
Var(X)
=
2n
d
= α =
(
2
d
)2
nσ2rel =
(
2
d
)2
αd
2
σ2rel =
2ασ2rel
d
.
This finally leads to
σ(Λ) =
√
2α
d
σrel.
This states that the standard deviation of the cosine similarity is inversely proportional to the square root of the
dimensionality, and is thus able to diminish the influence of the large SET state variability. Furthermore, a deviation
of the expected value E(X) from 1 has a direct influence on the expected value of Λ:
E(Λ) = εα when E(X) = ε.
Therefore, all crossbar devices’ SET state should exhibit the same mean, which otherwise can only be compensated
by the quality of the representations themselves.
