A Taxonomy for Neural Memory Networks by Ma, Ying & Principe, Jose
1A Taxonomy for Neural Memory Networks
Ying Ma, Jose Principe, Life Fellow, IEEE
Abstract—In this paper, a taxonomy for memory networks is
proposed based on their memory organization. The taxonomy
includes all the popular memory networks: vanilla recurrent
neural network (RNN), long short term memory (LSTM ), neural
stack and neural Turing machine and their variants. The taxon-
omy puts all these networks under a single umbrella and shows
their relative expressive power , i.e. vanilla RNN⊆ LSTM⊆neural
stack⊆neural RAM. The differences and commonality between
these networks are analyzed. These differences are also connected
to the requirements of different tasks which can give the user
instructions of how to choose or design an appropriate memory
network for a specific task. As a conceptual simplified class of
problems, four tasks of synthetic symbol sequences: counting,
counting with interference, reversing and repeat counting are
developed and tested to verify our arguments. And we use two
natural language processing problems to discuss how this tax-
onomy helps choosing the appropriate neural memory networks
for real world problem.
Index Terms—RNN, LSTM, neural stack, neural Turing Ma-
chine, DNC, memory network, taxonomy
I. INTRODUCTION
Memory has a pivotal role in human cognition and many
different types are well known and intensively studied[1]. In
neural networks and signal processing the use of memory
is concentrated in preserving in some form (by storing past
samples or using a state model) the information from the past.
A system is said to include memory if the system’s output is a
function of the current and past samples. Feedforward neural
networks are memoryless, but the time delay neural network
[2], the gamma neural model [3] and recurrent neural networks
are memory networks. An important theoretical result showed
that these networks are universal in the space of myopic
functions [4]. A methodology to quantify linear memories was
presented in [3], which proposed an analytic expression for
the compromise between memory depth (how much the past
is remembered) and memory resolution (how specifically the
system remembers a past event). A similar compromise exists
for nonlinear dynamic memories (i.e. using nonlinear state
variables to represent the past), but is depends on the type
of nonlinearity and there is no known close form solution. It
is fair to say that currently the most utilized neural memory
is the recurrent neural networks (RNN) for sequence learning.
Compared to the time delay neural network, RNN keeps a
processed version of the past signal in its state. [5][6] proposed
the first classic version of RNN which introduces memory by
adding a feedback from the hidden layer output to its input
for sequence recognition. They are often referred to as vanilla
RNN nowadays. A large body of work also used RNNs for
dynamic modeling of complex dynamical and even chaotic
The authors are with the Computational NeuroEngineering Labora-
tory, University of Florida, Gainesville, FL 32611 USA (e-mail: maying-
bit2011@gmail.com; principe@cnel.ufl.edu).
systems [7]. However, although RNNs are theoretically Turing
complete if well-trained, they usually do not perform well
when the sequence is long. Long short term memory (LSTM)
[8] was proposed to provide more flexibility to RNNs by
employing an external memory called cell state to deal with the
vanish gradient problem. Three logic gates are also introduced
to adjust this external memory and internal memory. Later on
several variants were proposed [9]. One example needs to be
mentioned is the Gated Recurrent Unit [10]which combines
the forget and input gates and makes some other changes
to make the model simpler. Another variant is the peephole
network [11] which makes the gate layers look not only the
input and hidden state but also the cell state. With the help of
the external memory, the network does not need to squeeze
all the useful past information into the state variable, the cell
state helps to save information from the distant past. The
cooperation of the internal memory and the external memory
outperforms vanilla RNN in a lot of real-world problem tasks
such as language translation [12] video prediction [13], [14]
and so on. However, all these sequence learning models have
difficulties to accomplish some simple memorization tasks
such as copying and reversing a sequence. The problem for
LSTM and its variants is that the previous memories are
erased after they are updated, which also happens continuously
with the RNN state. In order to solve this problem, the
concept of online extracting events in time is necessary and
can be conceptually captured with an external memory bank.
However, a learning system with external event memory must
also learn when to store an event, as well as to use it in the
future. In this way, the old memory does not need to be erased
to make space for the new memory. Neural stack is an example
which uses a stack as its external memory bank and gets access
to the stack top content through push and pop operations. Its
two operations are controlled by either a feedforward network
or an RNN such as [6][8]. The research on neural stack
network was originated in [15], [16] to mimic the working
mechanism of pushdown automata. The continuous push and
pop operations in it give an instruction of how to render
discrete data structure continuous to all the subsequent papers.
The authors in [17], [18] adapted these operations to queue
and double queue to accomplish more complex tasks, such
as sequence prediction. Moreover, [17] extended the number
of stacks in their model. Recently, more powerful network
structures such as neural Turing machine[19] and differentiable
neural computer [20] are proposed. In these network, all the
contents in the memory bank can be accessed. At the same
time, Weston [21] also proposed a sequence prediction method
using an addressable memory and test it on language and
reasoning tasks. Since the accessible content is not restricted
to the top of the memory as neural stack, neural RAM has
more flexibility to handle its memory bank.
ar
X
iv
:1
80
5.
00
32
7v
1 
 [c
s.L
G]
  1
 M
ay
 20
18
2Many memory networks have emerged recently, some of
them adopt internal memory; some of them adopt external
memory; some of them adopt logic gates; some of them
adopts a attention mechanism. As expected, all of them have
advantages for some specific tasks, but it’s hard to decide
which one is optimal for a new task unless we have a
clear understanding of functions of all the components in the
memory networks. Although we can try and test them one by
one from simple to complex, it is a really time consuming
process. And it is also not a good choice to always go for the
most complex network because it needs more resources and
takes more time to be well-trained. Intuitively, we all know
that if the network involves more components, it can make use
of more information, but what kinds of the extra information
they are using and how useful this extra information is, are still
unknown in the current literatures. Understanding the essential
differences and relations between these memory networks and
connecting these differences to the requirements of different
targets is the key to choose the right network. Furthermore,
it can instruct us to design an appropriate memory network
according to the features of the specific sequences to be
learned.
In this paper, we analyze the capabilities of different mem-
ory networks based on how they organize their memories.
Moreover, we propose a memory network taxonomy which
covers the four main classes of memory networks: vanilla
RNN, LSTM, neural stack, neural RAM respectively in our
paper. Here, each class of network has several different realiza-
tions, since the basic idea of all the variants are the same, we
only choose one typical network architecture for each class to
do analyzation. Our conclusion is that there is an hierarchical
organization in the sense that each one of them can be seen
as a special case of another one given the following order, i.e.
vanilla RNN⊆ LSTM⊆neural stack⊆neural RAM. This is no
surprise, because it resembles the hierarchical organization of
language grammars, but here we are specifically interested in
linking the mapping architecture of the learning machine to
its descriptive power, which was not addressed before. This
inclusion relation is both proved mathematically and verified
with four synthetic tasks: counting, counting with interference,
reversing and repeat copying whose requirements of the past
information are increasing.
Our paper is organized as follows: after this introduction
SectionI, Section II outlines the architecture of the four classes
of memory networks. Section III describes the proposed tax-
onomy for memory networks and discussed how they organize
their memories, it also describes the four tasks developed to
test the capabilities of different networks. In Section IV, the
proposed taxonomy is corroborated by conducting four test
experiments. Section IV gives the conclusion and discuss some
future works.
II. MODEL
In this section, we introduce four typical network archi-
tectures which represent four classes of memory networks
adopted in this paper: vanilla RNN, LSTM, neural stack and
neural RAM. Their basic architectures and training methods
would be described in detail.
Figure 1. Vanilla RNN
A. Vanilla RNN
The RNN network [6] is composed of three layers: input,
hidden recurrent and output layer. Besides all the feed forward
connections, there is a feedback connection from the hidden
layer to itself. The number of neurons in each layer is
Ki, Kh, Ko. The architecture of it is shown in Fig.1.
Given a sequence of symbols, S : s1, s2, ..., sT , each
symbol is encoded as single vector and fed as input to the
network one at a time. The dynamic of the hidden layer can
be written as,
ht = f(w
T
xhxt +w
T
hhht−1 + bh), (1)
where xt is the input at time t, which is the encoding vector
of symbol st. wxh is Kh ×Ki weighting matrix from input
layer to hidden layer and whh is Kh ×Kh recurrent weight,
and bh is the Kh × 1 bias. f(x) is the nonlinear activation
function such as the sigmoid activation function 11+e−x . The
output at time t is,
ot = f(w
T
hoht + bo), (2)
where ot is the Ko × 1 output vector, who is the Ko × Kh
output weights, bo is the Ko × 1 bias. RNN can be cascaded
and trained with gradient decent methods called real time
recurrent learning (RTRL) and backpropagation through time
(BPTT) [22] .
The memory of the past at time t is encoded in the hidden
layer variable ht. Although the upper limit of the differential
entropy of ht is always larger than the total entropy of the past
input x1,x2, ...,xt−1, (i.e. the information is always smaller)
the information captured is highly dependent on the network
weight. Even apart of the vanishing gradient problem[23],
there is always a compromise between memory depth and
memory resolution in the RNN. In particular for very long
memory depths, the information is spread amongst many
samples and so there is a chance of overlaps amongst many
past events.
B. LSTM
In LSTM[8], as shown in Fig.2, the feedback connection is
a weighted vector m of the current state and a long term state.
The feedback method is described in Eq.(3) to Eq.(6).
ct = f(w
T
hcht−1 + bc), (3)
3Figure 2. LSTM
mt = gi,tct + gf,tmt−1, (4)
rt =mt, (5)
ht = f(w
T
xhxt +w
T
rhgo,trt + bh), (6)
where gi,t, gf,t, go,t is the input gate, forget gate and output
gate at time t respectively. mt is the N×1 long term memory
at time t which is initialized as zero, ct is the N×1 candidate
vector to put into the long term memory and whc is the
corresponding weight of size N × Kh. Different from the
vanilla RNN, the memory of the past is a combination of the
long term memory mt−1 and current state variable ct. The
weights between these two kinds of memories are decided
by two gates: forget gate gf,t decides how relevant the long
term memory is and the input gate gi,t decides how relevant
the current state is. Moreover, whether the calculated memory
affects next state is decided by the output gate got, these three
gates are calculated as follows:
gi,t = s(w
T
hgiht−1 +w
T
xgixt + bgi), (7)
gf,t = s(w
T
hgf
ht−1 +wTxgfxt + bgf ), (8)
go,t = s(w
T
hgoht−1 +w
T
xgoxt + bgo). (9)
where whgi , whgf , whgo are Kh × 1 weights , whf ,
whf , whf are Ki × 1 weights and bgi , bgi and bgi are bias.
These three gates give flexibility to operate on memories. For
example, the memory in the long past can be obtained by
setting the forget gate as 1 and input gate as 0 for several
consecutive time steps. However, when the memory mt is
updated as in Eq.(4), the old value mt−1 is erased. Hence, for
the tasks which need more than one previous memories, we
must use several feedback loops in parallel as shown in 3.
The output of the network are either the same as the one in
RNN as shown in Eq.(2) or a function of the external memory,
ot = t(w
T
momt + bo), (10)
Figure 3. LSTM with parallel memory slots
Figure 4. Neural Stack
C. Neural Stack
In this subsection, the neural network with an extra stack is
introduced. The diagram of the network is shown in Fig.4.
One stack property is that only the topmost content of the
stack can be read or and written. Writing to the stack is
implemented by three operations: push, adding an element
to the top of the stack; pop, removing the topmost of the
stack; no-operation, keeping the stack unchanged. These three
operations can help the machine organize the memory in a
way to reduce the error. In order to train the network with
BPTT, all operations have to be implemented by continuous
functions over a continuous domain. According to [16], [17],
[18], the domain of the operations are relaxed to any real
value in [0, 1]. This extension adds an amplitude dimension
to the operations. For example, if the push signal dpush = 1,
the current vector will be pushed into the stack as it is, if
dpush = 0.8, the current vector is first multiplied by 0.8 and
then pushed onto the stack. In this paper, the dynamics of the
stack follows [17]. To be specific, elements in stack would be
updated as follows,
st(0) = d
push
t c+ d
pop
t st−1(1) + d
no−op
t st−1(0), (11)
st(i) = d
push
t st−1(i−1)+dpopt st−1(i+1)++dno−opt st−1(i),
4st(i) is the content of the stack at time t in position i,
st(0) is the topmost content, c is the candidate content to be
pushed onto the stack, dpusht , d
pop
t and d
no−op
t are push, pop
and no-operation signals.
Neural network interacts with the stack memory by dpusht ,
dpopt , d
no−op
t , ct and rt. rt is the read vector at time t, rt =
gost(0). d
push
t , d
pop
t , d
no−op
t and ct are decided by the hidden
layer outputs and the corresponding weights,
d = [dpusht , d
pop
t , d
no−op
t ]
T = s(wThdht + bop),
where whd is the 3×Kh weights and bop is the 3× 1 bias.
ct = g(w
T
hcht + bc).
Since the recurrence is introduced by the stack memory,
ht = g(w
T
xhxt +w
T
rhrt + bh), (12)
rt is the read vector at time t, And the output of the
network is the same as (2). As all the variables and weights
are continuous, the error can be back propagated to update the
weights and bias.
With this external memory, all the useful information are
retained. Different from the internal memory, the content of
past is not altered, it is stored in its original form or the
transformation form. What’s more, as the content and the
operation of the past is separated, we can efficiently select
the useful content from this structured memory other than
using the mixture of all the content before. So the external
memory circumvents the compromise between the memory
depth versus memory resolution that is always present in the
state memory.
D. Neural RAM
The last and most powerful network is the Neural RAM
as shown in Fig.5. The neural RAM can be seen as an
improvement of the neural stack in the sense that all the
contents in the memory bank can be read from and written
to. The challenge of the network is that all the memory
addresses are discrete in nature. In order to learn the read
and write addresses by error backpropogation [24], they have
to be continuous. Papers [21], [19], [20] give a solution for
this difficulty: reading and writing to all the positions with
different strengths. These strengths can also be explained as
the probabilities each position would be read from and written
to. One thing to note is that the read and write position do not
need to be the same ones. To be specific, the read vector at
time step t is,
r =
M−1∑
i=0
at(i)mt(i), (13)
m is the memory bank with M memory locations, and at(i)
is the normalized weight for ith location at time twhich
satisfying, ∑
i
at(i) = 1, 0 ≤ at(i) ≤ 1. (14)
For the writing process, the forget and input gates arrangement
is also applicable here, for memory location i,
ct(i) = f(w
T
hcht−1 + bc), (15)
mt(i) = gi,t(i)ct(i) + gf,t(i)mt−1(i),∀i (16)
here gi,t(i) and gf,t(i) together can be seen as the write head
for memory slot i at time t. The dynamic of the hidden layer
is,
ht = f(w
T
xhxt +w
T
rhrt + bh), (17)
Here the read weight at(i) can be learned as,
at(i) = f(w
T
haht−1 + ba), (18)
wha is the M ×Kh weight. The nonlinear activation function
f is usually set as softmax function. The write weight and the
output gate can be learned the same way as Eq.(7) and Eq.(8)
in LSTM,
gi,t = s(w
T
hgiht−1 +w
T
xgixt + bgi), (19)
gf,t = s(w
T
hgf
ht−1 +wTxgfxt + bgf ), (20)
here gi,t = [gi,t(1), gi,t(2), ..., gi,t(M)]T, gf,t =
[gf,t(i), gf,t(2), ..., gf,t(M)]
T, and whgi , whgf , whgo are
Kh × M weights , whf , whf , whf are Ki × M weights
and bgi , bgi and bgi are M × 1 bias. In practice, instead of
learning the read and write head from scratch, some methods
were proposed to simplify the learning process. For example
in Neural Turing machine [19], gft(i) is coupled with git(i)
, gft(i) = 1 − git(i). And read weight at and write weight
git are obtained by content-addressing and location-addressing
mechanisms. The content-addressing mechanism gives the
weights at(i) (or git(i)) by checking the similarity of the key
d with all the contents in the memory, the normalized version
is,
at(i) =
exp(αK[d,mt(i)])∑
j(exp(αK[d,mt(j)]))
,
where α is the parameter to control the precision of the focus,
K is a similarity measure. Then, the weights will be further
adapted by the location-addressing mechanism. For example,
the weights obtained by content addressing can firstly blend
with the previous weight and then shfited for several steps,
at(i) = gtat−1(i) + (1− gt)at(i),
at(i) = at([i− n]M ).
gt is the gate to balance the previous weight and current
weight, n is the shifting steps, [i − n]M means the circu-
lar shift for M entities. Since the shifting operation is not
differentiable, the method in [19] should be utilized as an
approximation.
Another example is [20] which improves the performance
even more. To be specific, for reading, a matrix to remember
5Figure 5. Neural RAM
the order of memory locations they are written to can be
introduced. With this matrix, the read weight is a combination
of the content-lookup and the iterations through the memory
location in the order they are written to. And for writing, a
usage vector is introduced, which guides the network to write
more likely to the unused memory. With this modification,
the neural RAM gets flexibility similar to working memory of
human cognition which makes it more suitable to intelligent
prediction. With these modifications, the training time for the
neural RAM is also reduced.
It should be pointed out that the RAM network can be seen
as a LSTM with several parallel feedback loops which coupled
together in a non-trivial way.
III. A UNIFYING MEMORY NETWORK FRAMEWORK
A. Memory network taxonomy
In this section, we propose a hierarchy developed according
to the way the four kinds of memory described in the last
section as shown in Fig.6. A network in the outer circle can
always implement a network in the inner circle as a special
case; however, the network in the inner circle will not have
the capability to implement the network in the outer circle’s
functionality, and so it will display poorer performance. This
hierarchy can help us choose the proper network for a specific
task. Our principle is always to choose the simplest network
since the complex network needs more resources. In this
section, we would first prove the inclusion relationship and
then visualize how these networks organize their memory
space.
B. Inclusion Relationship Derivations
In this subsection, we will prove neural stack is a special
case of neural RAM, LSTM is a special case of neural stack
and vanilla RNN is a special case of LSTM.
1) From Neural RAM to Neural Stack: Neural RAM is
more powerful than neural stack because it has access to all
the contents in the memory bank. If we restrict the read and
write vector, neural RAM is degraded to neural stack. To be
specific, for the read head at, if all the read weights except
the topmost are set to zeros, then,
Figure 6. Memory Network Taxonomy
a(i) =
{
0 if i 6= 0
t(wThaht−1 + ba), if i = 0
, (21)
here wha is 1×Kh vector and ba is the scalar.
And in the writing process, instead of learning all the
contents to be written to the stack as in Eq.(15), only the
content to be put into M0 is learned as
ct(0) = t(w
T
hcht−1 + bc) + αct−1(1), (22)
all other contents are calculated as,
ct(i) = αct−1(i− 1), if i 6= 0. (23)
Finally only the input and forget gates for the topmost
element are learned, all others just copy the values of the
topmost’s gates,
gi,t(i) = gi,t(0), (24)
gf,t(i) = gf,t(0). (25)
Since Eq.(21) is a special case of Eq.(18), Eq.(22) , Eq.(23)
can be seen as a special case of Eq.(15), Eq.(24) and Eq.(25)
can be seen as a special case of Eq.(19) and Eq.(20), hence the
neural stack can be treated as a special case of neural RAM.
Here at(0) works as the output gate, gi(0), αgi(0) and gf (0)
works as the push, pop and no-operate operations respectively.
2) From Neural Stack to LSTM : According to Eq.(6) and
Eq.(12), the dynamics of the neural stack have similar form
as LSTM except for the reading vector, i.e., the reading vector
is rt = gost(0). If we set the pop signal as zero, d
pop
t = 0,
and no operation on the stack contents except for the topmost
elements is avaiable, then
st(0) = d
push
t c+ d
no−op
t st−1(0),
st(i) = 0, if i 6= 0.
Since the dpusht , d
no−op
t are calculated in the same way the
input gate gi,t and forget gate gf,t are calculated in LSTM as
shown in Eq.(7) to Eq.(8), dpusht can be seen as the input gate
and dno−opt can be seen as the forget gate. In this manner, this
is exactly how the LSTM organizes its memory. Hence, it is
6proved that LSTM can be seen as the special case of neural
stack.
3) From LSTM to RNN: Compared to RNN, LSTM intro-
duces an external memory and the gate operation mechanism.
So if we set the output gate go = 0, input gate gi = 1 and the
forget gate gf = 0 instead of learning from the sequences, the
dynamics of LSTM is degraded to RNN as follows,
ht = t(w
T
xhxt +w
T
rhgort + bh) (26)
= t[wTxhxt +w
T
rhrr + bh] (27)
= t[wTxhxt +w
T
rhmt + bh]
= t[wTxhxt +w
T
rh(gict + gfmt−1) + bh]
= t(wTxhxt +w
T
rhct + bh) (28)
= t[wTxhxt +w
T
rht1(w
T
hcht−1 + bc) + bh] (29)
= t(wTxhxt +w
T
rhht−1 + bh) (30)
Here (27) is due to go = 0, (28) is due to gi = 1 and
gf = 0, (30) is because the weight whc and bias bc are set
as constants and the activation function t(x) is set as linear
activation function,
whc = I,
bc = 0,
t1(x) = x.
Since Eq.(26) is the dynamic of LSTM and Eq.(30) is the
dynamic of RNN, the argument that RNN is a special case of
LSTM is proved.
Remark: From the derivation we can draw a conclusion that,
the innovation of LSTM is the incorporation of an external
memory and three gates to balance the external memory and
internal memory; the innovation of Neural stack is to extend
one external memory to several external memories and to
propose a method to visit the memory slots in a certain order;
the innovation of Neural RAM is to remove the constraint of
the memory visiting order, which mean any memory slot can
be visited at any time.
C. Memory Space Visualization
The analysis in this section ignores the influence of input.
Fig. 7 shows the state transition diagram of RNN, where
s0, s1, ..., s4 represents the state at time t0, t1, ..., t4 respec-
tively. The blue arrow shows the variables’ dependency rela-
tionship. For example, state s1 is decided by s0, s2 is decided
by s1 and so on. The vanilla RNN has an hidden Markov
assumption for its input sequences, in other words, the current
state can be decided if the previous state is given. Thus for
Markov sequence, the RNN always performs well. However,
for a lot of sequences we need to deal with, the Markov
assumption is not valid. In this situation, the past memory
helps a lot when we need to decide what’s the next state is.
LSTM is the architecture which firstly take this memory into
consideration.
In Fig.8, a blue belt named M0 is used to save previous
memory: at t0, memory M00 is generated and saved, at time
t2, M00 is updated to M01 and at time t4, M01 is updated
to M02. Every state is decided by its previous one state and
some older memory. The weight between these two kinds of
past information are decided by the input and forget gates. For
example, s2 is decided by s1 and M00, s4 is decided by s3
and M01. A property of the memory is to forget the older
memory after it is updated. For instance, at time t2, when
the memory is updated from M00 to M01 M00 is forgotten.
Thus, for the future state s3, s4, s5, ..., they don’t have access
to memory M0. According to this property, this architecture
is extremely useful when the previous states don’t need to
be addressed again when they are updated. The capability of
LSTM is greater than vanilla RNN. Actually, RNN is LSTM
without the memory belt. In other words, LSTM is a RNN if
the previous memory is not used (forget gate is 0 and input
gate is 1), as shown in the green dashed block in Fig.8.
A more advanced memory neural network is the neural
stack. It should be mentioned that, in the current literature,
there is no forget and input gates embedded in the neural stack
structure which makes it work worse for some specific tasks.
But these gates can be added into the network the same way as
LSTM. The push and pop operation provide a way to save and
address the previous memory as shown in Fig.9. Let’s assume,
the network first saves state M00 in belt M0 and updates it to
M01. At time t1, instead of replacing M01 with a new state
M10, a new belt M1 is created to save M10. In this way,
both M01 and M10 are kept. Similarly, at time t5, M20 is
saved in another belt M2. In time t5, the content in the stack
is M01, M10, M20 and M20 is the topmost element. Since
all the useful past states are saved, they can be addressed in
the future. However, although it can go back to the previous
memory, it has two constraints. Firstly, it can not jump to any
memory position, the previous memory should be addressed
and updated sequentially. For example, as shown in the second
line in Fig.9, if we want to go back to memory in belt M1,
we have to go pass memory in belt M2 first. Secondly, all the
memory can only be accessed twice, in other words, after the
memory content is popped out of the stack, it will be forgotten.
For example, at time t14, memory in belt M2 is popped out,
so in the future time step as shown in the third line, content
in belt M2 can not be accessed and updated any more.
LSTM can be seen as a special case of the neural stack
if only the push operation is allowed, as shown in the green
dashed block in Fig.9. Since all the contents in the stack below
the topmost element will never be addressed, only one belt is
enough. Hence, the stack can be squeezed to length 1 as shown
Fig.9(b), which has exactly the same structure as LSTM.
From the state transition analysis above we can draw the
conclusion that, for the tasks where the previous memory need
to be addressed sequentially and at most twice, the stack neural
network is our first choice.
The most powerful memory access architecture in our
hierarchy is the neural RAM. Different from stack neural
network, this kind of network saved all the previous memory
and can access any of them. There is no requirement for
the order of saving, updating and accessing the memory. For
example, in Fig.10(a), at time t0, memory M00 is saved in
belt M0, at time t1, it can directly jump to belt M2. This
neural RAM network can be degraded to the stack if the order
7Figure 7. vanilla RNN
Figure 8. LSTM
Figure 9. Neural Stack
8Figure 10. Neural RAM
of memory saving and accessing is restricted as shown in Fig.
10(b). Hence, it can be further degraded to the LSTM as shown
in Fig.10(c).
All in all, since neural RAM can be degraded to neural
stack, neural stack can be degraded to LSTM, LSTM can be
degraded to RNN, our inclusion hierarchy is valid.
D. Architecture Verification
In order to verify the proposed taxonomy, 4 types of
tasks using strings of characters from easy to difficult are
developed: counting, counting with interference, reversing,
repeat copying.
For the counting task, the input sequence is the vector of
as and the output sequence is the number of as. For instance,
when the input sequence is aaabcaa, the output sequence
would be 1233345. For this kind of sequence, the state variable
is needed to remember the number of as. As long as the
network has the feedback loop, the counting can be achieved.
Hence, vanilla RNN is the best among all the memory network
(In terms of error rate, all of them have very small errors after
training with enough time; in terms of training speed, RNN
outperforms all other method since it uses least resources).
This experiment also proves the argument “LSTM is always
better than RNN” is not correct.
For the counting with interference task, the input sequence
are the mixture of a, b and c. We still want to count the number
of a, but if the input is b or c, the output should also be b or
c. For example, if the input is aabbaca, the output sequence
is 12bb3c4. Assuming we have three neurons for both input
layer and output layer, the input sequence and output sequence
after a vector coding is,
time step input sequence output sequence
1 [1 0 0] [1 0 0]
2 [1 0 0] [2 0 0]
3 [0 1 0] [0 1 0]
4 [0 1 0] [0 1 0]
5 [1 0 0] [3 0 0]
6 [0 0 1] [0 0 1]
7 [1 0 0] [4 0 0]
For this kind of problem, an external memory cache is
required, because when b or c is encountered, the hidden
layer’s output value will be overlaid. If we want to recall
the memory of the number of as, it need to be saved in an
external memory for future usage. Thus the memory bank m in
LSTM, neural stack and neural RAM can work as this external
memory. Hence, all of them except vanilla RNN can complete
the counting with interference task.
The third task is sequence reversing. For example. if the
input sequence is abacdeδ−−−−−−, the output sequence
should be −−−−−−−edcaba. δ is the delimiter symbol,
− means any symbol. When encountering δ in the input
sequence, no matter what the following symbols are, the output
would be the input symbols before δ in a reverse order. For
this task, all the useful past information should be stored and
then retrieved in a reverse order. Hence, the memory should
have the ability to store more than one contents and the read
order is related to the write order. Since RNN does not have
this memory bank and LSTM’s memory is forgotten after
it is updated, these two networks fails for this task. On the
other hand, both neural stack and neural RAM can save more
than one contents and the task satisfies the “first in last out”
principle, they can solve this task.
The last task adopted here to verify the capability of
networks is the repeat copying task, by which we mean the
9Figure 11. Learning rate comparison for counting task
Figure 12. Vanilla RNN: Internal memory content
output sequence is several times a repetition of the input
sequences. For example, if the input sequence is adbcδ3 −
− − − − − − − − − − − − − , the output should be
− − − − − − −adbcadbcadbcaε. ε is the end symbol. That
is, when encountering the ending signal δ, the output will be
the previous input sequence for three times. For this kind of
task, not only more than one past content need be saved, they
should be retrieved more than one time, here the number is 3.
In the neural stack, since all the saved information is forgotten
after being popped out, they can not be revisited again. Thus,
neural RAM is the only network that can handle this kind of
task.
These four synthetic sequences are good examples to show
how the memory networks operate on their respective mem-
ories to achieve a certain goal. The details of the memory
working mechanism is shown in the simulation results in part
IV.
IV. EXPERIMENT
A. Synthetic data
In this section, preliminary experimental results would be
presented on four synthetic symbol sequences. The goal is to
test the capability of different memory networks and show
how they organize their memories. For all the experiments,
four models are compared: vanilla RNN, LSTM, neural stack,
neural RAM.
1) Counting : From the analysis in section III, as long as
the network has a feedback loop to introduce memory of the
Figure 13. LSTM: External memory content
Figure 14. Neural stack: stack content
past, it can count. In this experiment sequences of symbols,
a, b, c are tested. All the symbols are fed into the network
one at a time as input vector after a one-hot encoder and the
network are trained with BPTT.
In vanilla RNN, the activation function in the hidden layer is
Relu and the activation function in the output layer is sigmoid.
In LSTM, the external memory’s content are initialized as
zero. In the neural stack, the push, pop and no-op operations
are initialized as random with mean 0 and variance 1. At first,
there is only one content in the stack which is initialized as
zero. The depth of the stack can increase to any number as
required. In neural RAM, the word size and memory depth
are set as 3. The length of read and write vectors are also set
as 3. In LSTM, neural stack and neural RAM, the nonlinear
activation functions for all the gates are sigmoid functions and
others are tanh. The number of input neurons , hidden neurons
and output neurons are 3. All the weights are initialized as
random variables with mean 0 and variance 1, all the bias are
initialized as 0.1.
The model is trained with the synthetic sequences up to
length 20. When the input is a, the first elements in the
output vector would add one, otherwise, the output vector is
unchanged.
The learning curve measured in mean square error (MSE)
at the output layer is shown in Fig.11. From the results we can
see that, after 1000 training sequences, all the four models’
errors are less than 0.1. Fig.12 to Fig15 show the details of
the memory contents after the models are well-trained. They
are tested on a input sequence bbacacbabababcc .Fig.12 shows
10
(a) memory content
(b) read
(c)write
Figure 15. Neural RAM: Memory bank content and corresponding read and
write operation
that when a is received, the first element of the hidden layer is
increased by 1. Fig.14 shows that when receiving a, the first
element of the memory is decreased by 0.3, the second and
third elements have a similar pattern, but the increment is not
exactly a constant. However, as long as there is at least one
element in the memory learning the pattern, after multiplying
with the weight vector, the output of the network can give the
expected values. Fig.14 shows how the neural stack uses its
memory. Although the neural stack has the potential to use
unbounded number of stack contents, it only uses the topmost
content here, i.e. the push and no-operation cooperate to learn
the pattern. Fig.14 shows the memory contents of the neural
RAM and the corresponding operations of it. From Fig.15(a),
we can see that all the three memory banks are learning the
pattern, hence, the read vector, all the three elements are all
around 0.3 as shown in Fig.15 (b). From this experiment we
can see that two memory banks are redundant here.
The goal of the counting experiment here is to show on
the one hand the capability of the four memory networks
to remember one past state, and on the other hand, to show
the redundancy of the LSTM, neural stack and neural RAM,
i.e., the gate mechanism in LSTM, the unbound number of
the stack content in neural stack, and the multiple memory
contents in neural RAM.
Figure 16. Learning rate comparison for counting with interference task
Figure 17. LSTM: External memory content
Figure 18. Neural stack: stack content
11
(a) memory content
(b)
read
(c)write
Figure 19. Neural RAM: Memory bank content and corresponding read and
write operation
2) Counting with interference: The second experiment is
to test the external memory, i.e., the capability of putting the
memory aside and using it when it is needed. This is the new
feature of LSTM, neural stack and neural RAM implemented
by the gate mechanism. All the settings for the experiments
are the same as the counting task except when inputting b and
c. In the counting task, the desired operation is the same as
the one in the last time step when inputting b and c. Here,
the desired response is [0, 1, 0] when inputting b and [0, 0, 1]
when inputting c. In order to accomplish this task, the useful
memory of the past should be put aside and not disturbed.
Since in vanilla RNN, the only memory of the past is the
internal memory, it will be refreshed when inputting b and
c, vanilla RNN can never learn the pattern. Fig.16 shows the
learning curve for these four networks, all the networks except
the vanilla RNN have errors less than 0.1 after enough training
samples, which is in consistent with our analysis.
Fig.17 to Fig.19 shows the memory usage of LSTM, neural
stack and neural RAM. Fig.17 shows that every time the sym-
bol a is input, the third element of the memory content would
increase by around 0.2. Fig.18 and Fig.19 also show the similar
incremental patterns of neural stack and neural RAM. An
notable difference between Fig.17-Fig.18 and Fig.13-Fig.14
is the usage of the memory. When dealing with counting task,
the output gates are always 1, however, when dealing with
counting with interference task, the output gates are 0 when
inputting b and c, this helps to cut off the interference from the
Figure 20. Learning rate comparison for reversing task
Figure 21. Neural stack: stack content
memory. Similarly, the read vector are always around [0.3, 0.3,
0.3] in Fig.15, however, in Fig. 19, the read vector’s elements
are almost zeros when encountering b and c. The read vector
here works as the output gates in LSTM and neural stack. It
also shows why neural RAM does not need an output gate.
All in all, the goal of this experiment is to show the effect
of the gate mechanism and the redundancy of the unbound
stack content in neural stack and multiple memory banks in
neural RAM.
3) Reversing: For the sequence reversing problem, every
symbol in the first half of the sequence is randomly picked
in the set {a, b, c, d, e}, then a delimiter symbol δ follows,
and the second half of the sequence is the reverse of the first
half. The performance is measured on the error rate in output
entropy for the second half.
In this experiment, some setting are different from the first
two experiments. In vanilla RNN, the activation function in the
hidden layer is sigmoid function since we use entropy instead
of mean square error as the cost function. In neural RAM,
12
(a) memory content
(b) read
(c) write
Figure 22. Neural RAM: Memory bank content and corresponding read and
write operation
Figure 23. Learning rate comparison for repeat copying task
the word size and memory depth are set as 16. The length
of read and write vectors are also set as 16. The number of
input neurons, hidden neurons and output neurons are 6, 64,
6. The model is trained with sequences up to length 20. To
finish this task, all the input samples have to be saved in the
memory and be visited in the reverse order. Hence networks
with no external memory as RNN or a external memory as
LSTM fail. Learning curves of the four networks are shown
in Fig.21. We can see that vanilla RNN and LSTM do not
have the capability of reversing the sequence no matter how
many samples are used for training. Fig.21 shows how neural
stack utilizes its stack memory to solve this problem. Since
each memory bank’s word size is 16, here we only use colors
instead of the specific numbers to show the values of contents
in memory. Different from the first two tasks, the function of
the stack is finally exploited. In the first half sequence, the
input symbols are encoded as 16-elements vectors and pushed
into the stack. In the second half of the sequence, the contents
in the stack are popped out sequentially. It should be noticed
that as long as the the contents are popped out, they can not be
revisited anymore. Different from neural stack, the contents in
neural stack are never wiped as shown in Fig.22. The contents
in the memory banks are only wiped if they are useless in the
future or the memory banks are not enough so they have to
be wiped to make space for new stuffs. Another feature of
the memory bank for neural RAM is the memory banks are
not used in order such as M0, M1, M2...In this example, the
memory banks are used in the order M0, M2, M7, M13.... But
as long as the network knows the writing order, the task can
be accomplished. Fig22(b)(c) shows the reading and writing
weights, we can see that the second half of the reading weights
is the mirror of the first half of the sequence of the writing
weights, which means the network learns to reverse.
Overall, the goal of this task is to show the advantage of the
multiple memory banks and how the neural RAM and neural
stack organize their memory banks.
4) Repeat copying: The last and hardest problem we are
going to implement is the repeat copying task. To accomplish
13
(a) memory content
(b) read
(c) write
Figure 24. Neural RAM: Memory bank content and corresponding read and
write operation
this task, the contents saved in the memory banks can not be
wiped when they are used for one time. Hence, neural stack
is the only network type which can handle this problem. In
this experiments, the training sequences are composed of a
starting symbol , some symbols in set {a, b, c, d, e} followed
by a repeating number symbol δ and some random symbols.
, a, b, c, d, e are one-hot encoded with on value 1 and off
value 0; δ is encoded with on value n and off value 0, n is
the repeating number. Fig.23 shows the learning curves for
the four network and the fact that neural RAM is the only
network that can handle this problem. Fig.24 shows how the
neural RAM solves this problem. From the writing weights we
can see that, the starting symbol is saved in M0, and symbols
needed to be repeated are save in M2, M4, M6, M9. After
t=4, the network would read from M2/M5, M4,M6,M9. At the
beginning of every loop, the network reads from both M2 and
M5 probably because the repeating time symbol δ is saved in
M5. The value in M5 can tell the network whether to continue
repeating or to output the ending symbol. We can see from
Fig.24(b), at time t=22, after reading from M2 and M5, the
network stops reading from M4 to M9 and turns to M0.
The goal of this task is to test whether the networks can
operate their memory banks and show the advantage of neural
RAM compared to other memory networks.
B. Real world problem
In this section, we will use two natural language processing
problems to show the different capabilities of these four kinds
of networks. These two examples also give us some hints on
how to choose the right memory networks according to the
specific tasks.
1) Sentiment Analysis: The first experiment is sentiment
analysis problem, by which we mean giving a paragraph of
texts, determining whether the emotional tone of the text is
negative or positive. For example, an example from lmdb
movie review dataset with negative emotion is,
“Outlandish premise that rates low on plausibility and
unfortunately also struggles feebly to raise laughs or interest.
Only Hawn’s well-known charm allows it to skate by on very
thin ice. Goldie’s gotta be a contender for an actress who’s
done so much in her career with very little quality material at
her disposal.”
And a positive text is,
“I absolutely loved this movie. I bought it as soon as I
could find a copy of it. This movie had so much emotion,
and felt so real, I could really sympathize with the characters.
Every time I watch it, the ending makes me cry. I can really
identify with Busy Phillip’s character, and how I would feel
if the same thing had happened to me. I think that all high
schools should show this movie, maybe it will keep people
from wanting to do the same thing. I recommend this movie
to everybody and anybody. Especially those who have been
affected by any school shooting. It truly is one of the greatest
movies of all time.”
The output of the neural network should be [1, 0] for the first
paragraph and [0, 1] for the second paragraph. After encoding
all the words into vectors, they are fed into the network one by
14
Table I
ERROR RATE FOR MOVIE REVIEW
vanilla RNN LSTM neural Stack neural RAM
error rate 31±5 19±2.5 23±10 20±9
one. The decision of the tone of the paragraph will be made
at the end of the paragraph. Here we use a pretrained model:
GloVe [25] to create our word vector. The matrix contains
400,000 word vectors, each with a dimensionality of 50. The
matrix is created in a way that words having similar definitions
or context reside in the relatively same position in the vector
space. The dataset adopted here is the lmdb movie review data
which has 12500 positive reviews and 12500 negative reviews.
Here we use 11500 reviews for training and 1000 data for
testing. In neural RAM, the word size and memory depth are
set as 64. The number of read and write head are 4 and 1. In
LSTM, neural stack and neural RAM, the nonlinear activation
functions for all the gates are sigmoid. The activation functions
at the output layer is sigmoid and others are tanh. The number
of input neurons, hidden neurons and output neurons are 50,
64, 2.
In order to judge the emotional tone of the text as the end,
an external memory whose value would be affected by some
key words is useful. And since the goal here is to classify the
emotional tone as either 1 or 0, the specific contents are not
very important here so there is no need to store all of them.
Hence, the memory banks do not show advantages here. Since
LSTM has this external memory and it needs less time to train
compared to neural stack and neural RAM, it should be best
choice among these four kinds of networks for this task.
Table I shows averaging error rates of 5 runs. We can see
that all the networks with external memories have similar
performance, which is in compliance with our analysis.
2) Question Answering: In this section, we investigate
the performance of these four networks on three question
answering tasks. The target is to give an answer after reading
a little story followed by a question. For example, the story is
“Mary got the milk there. John moved to the bedroom. San-
dra went back to the kitchen. Mary travelled to the hallway.”
And the question looks like, “Where is the milk?”. The ma-
chine is expected to give answer “hallway”. For this problem,
in order to give the right answer, machine should memorize
the facts that Mary got the milk and travelled to the hallway.
What’s more, since the machine doesn’t know the question
when reading the stories, it has to store all the useful facts in
the story. Thus a large memory bank where all the contents
can be visited is useful here. According to our analysis in part
III, neural RAM should perform the best here.
In order to verify our conjecture, we test these four networks
on three tasks from bAbI dataset[26]. For each task, we use
the 10,000 questions to train and report the error rates on
the test set in Table II. In vanilla RNN and neural stack, the
nonlinear activation functions for all the gates are sigmoid. The
activation functions at the output layer is sigmoid and others
are tanh. The number of input neurons, hidden neurons and
output neurons are 150, 64, 150. The experimental settings for
LSTM and neural RAM are the same as [20] and the results
Table II
ERROR RATE FOR THREE TASKS FROM BABI TASKS
Task vanilla RNN LSTM neural Stack neural RAM
1 supporting fact 52±1.5 28.4±1.5 41±2.0 9.0±12.6
2 supporting facts 79±2.5 56.0±1.5 75±6 39.2±20.5
3 supporting facts 85±2.5 51.3±1.4 78±6.4 39.6±16.4
for these two networks are from [20]. From the results, we
can see that neural RAM achieves the best performance. On
thing to be mentioned here is, although the mean error rate
of the neural RAM is the lowest, the variance is larger than
all others. We believe the reason for this is the complexity of
the network, which leads to too many local minimal points.
Since the point here is to check the capabilities of different
neural memory networks, a better way to utilize the external
memory and train the network will be our future work.
From these two examples, we can see that the taxonomy
proposed in this paper can helps us to analyze the properties
of the specific problem and choose the right neural network.
But people still have to analyze the problem by themselves.
A data-driven method to analyze the tasks and choose the
appropriate network is our next step.
V. CONCLUSION AND FUTURE WORK
In this paper, we propose a taxonomy based on the state
structure for the memory networks recently proposed. The
taxonomy are proved mathematically and verified with simple
synthetic sequences. Moreover, this work only analyzes what
tasks these networks can or can not do, the next step is to
analyze the performance of these network and explore the
method to improve the memory utilization efficiency. How to
use this taxonomy to design an appropriate network for some
real-wold problem is our future work.
REFERENCES
[1] Joaquin Fuster. The prefrontal cortex. Academic Press, 2015.
[2] Alexander Waibel, Toshiyuki Hanazawa, Geoffrey Hinton, Kiyohiro
Shikano, and Kevin J Lang. Phoneme recognition using time-delay
neural networks. In Readings in speech recognition, pages 393–404.
Elsevier, 1990.
[3] Bert De Vries and Jose C Principe. The gamma model?a new neural
model for temporal processing. Neural networks, 5(4):565–576, 1992.
[4] Irwin W Sandberg and Lilian Xu. Uniform approximation of multidi-
mensional myopic maps. IEEE Transactions on Circuits and Systems I:
Fundamental Theory and Applications, 44(6):477–500, 1997.
[5] Jeffrey L Elman. Finding structure in time. Cognitive science,
14(2):179–211, 1990.
[6] Michael I. Jordan. Attractor dynamics and parallelism in a connectionist
sequential machine. In Proceedings of the Eighth Annual Conference of
the Cognitive Science Society, pages 531–546. Hillsdale, NJ: Erlbaum,
1986.
[7] Simon Haykin and Jose Principe. Making sense of a complex world
[chaotic events modeling]. IEEE Signal Processing Magazine, 15(3):66–
81, 1998.
[8] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.
Neural computation, 9(8):1735–1780, 1997.
[9] Klaus Greff, Rupesh K Srivastava, Jan Koutník, Bas R Steunebrink, and
Jürgen Schmidhuber. Lstm: A search space odyssey. IEEE transactions
on neural networks and learning systems, 2017.
[10] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bah-
danau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning
phrase representations using rnn encoder-decoder for statistical machine
translation. arXiv preprint arXiv:1406.1078, 2014.
15
[11] Felix A Gers and Jürgen Schmidhuber. Recurrent nets that time and
count. In Neural Networks, 2000. IJCNN 2000, Proceedings of the
IEEE-INNS-ENNS International Joint Conference on, volume 3, pages
189–194. IEEE, 2000.
[12] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence
learning with neural networks. In Advances in neural information
processing systems, pages 3104–3112, 2014.
[13] William Lotter, Gabriel Kreiman, and David Cox. Deep predictive
coding networks for video prediction and unsupervised learning. arXiv
preprint arXiv:1605.08104, 2016.
[14] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. Unsu-
pervised learning of video representations using lstms. In International
conference on machine learning, pages 843–852, 2015.
[15] Guo-Zheng Sun. Learning context-free grammar with enhanced neural
network pushdown automaton. In Grammatical Inference: Theory,
Applications and Alternatives, IEE Colloquium on, pages P6–1. IET,
1993.
[16] GZ Sun, C Lee Giles, HH Chen, and YC Lee. The neural network
pushdown automaton: Model, stack and learning simulations. arXiv
preprint arXiv:1711.05738, 2017.
[17] Armand Joulin and Tomas Mikolov. Inferring algorithmic patterns with
stack-augmented recurrent nets. In Advances in neural information
processing systems, pages 190–198, 2015.
[18] Edward Grefenstette, Karl Moritz Hermann, Mustafa Suleyman, and Phil
Blunsom. Learning to transduce with unbounded memory. In Advances
in Neural Information Processing Systems, pages 1828–1836, 2015.
[19] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines.
arXiv preprint arXiv:1410.5401, 2014.
[20] Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Dani-
helka, Agnieszka Grabska-Barwin´ska, Sergio Gómez Colmenarejo, Ed-
ward Grefenstette, Tiago Ramalho, John Agapiou, et al. Hybrid com-
puting using a neural network with dynamic external memory. Nature,
538(7626):471–476, 2016.
[21] Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks.
CoRR, abs/1410.3916, 2014.
[22] Paul J Werbos. Backpropagation through time: what it does and how to
do it. Proceedings of the IEEE, 78(10):1550–1560, 1990.
[23] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty
of training recurrent neural networks. In International Conference on
Machine Learning, pages 1310–1318, 2013.
[24] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learn-
ing representations by back-propagating errors. nature, 323(6088):533,
1986.
[25] Jeffrey Pennington, Richard Socher, and Christopher D. Manning.
Glove: Global vectors for word representation. In Empirical Methods
in Natural Language Processing (EMNLP), pages 1532–1543, 2014.
[26] Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M Rush, Bart
van Merriënboer, Armand Joulin, and Tomas Mikolov. Towards ai-
complete question answering: A set of prerequisite toy tasks. arXiv
preprint arXiv:1502.05698, 2015.
[27] Rafal Jozefowicz, Wojciech Zaremba, and Ilya Sutskever. An empirical
exploration of recurrent network architectures. In Proceedings of the
32nd International Conference on Machine Learning (ICML-15), pages
2342–2350, 2015.
[28] Michael Sipser. Introduction to the Theory of Computation, volume 2.
Thomson Course Technology Boston, 2006.
[29] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learn-
ing internal representations by error propagation. Technical report,
California Univ San Diego La Jolla Inst for Cognitive Science, 1985.
