Analog computation via neural networks  by Siegelmann, Hava T & Sontag, Eduardo D
Theoretical Computer Science 131 (1994) 331-360 
Elsevier 
331 
Analog computation via neural 
networks* 
Hava T. Siegelmann 
Department of Mathematics and Computer Science, Bar-I/an University, Ramat-gan, Israel 
Eduardo D. Sontag 
Department of Mathematics, Rutgers University, New Brunswick, NJ 08903, USA 
Communicated by Eli Shamir 
Received October 1992 
Revised June 1993 
Abstract 
Siegelmann, H.T. and E.D. Sontag, Analog computation via neural networks, Theoretical Computer 
Science 131 (1994) 331-360. 
We pursue a particular approach to analog computation, based on dynamical systems of the type 
used in neural networks research. 
Our systems have a fixed structure, invariant in time, corresponding to an unchanging number of 
“neurons”. If allowed exponential time for computation, they turn out to have unbounded power. 
However, under polynomial-time constraints there are limits on their capabilities, though being 
more powerful than Turing machines. (A similar but more restricted model was shown to be 
polynomial-time equivalent to classical digital computation in the previous work (Siegelmann and 
Sontag, 1992)) Moreover, there is a precise correspondence between nets and standard nonuniform 
circuits with equivalent resources, and as a consequence one has lower bound constraints on what 
they can compute. This relationship is perhaps surprising since our analog devices do not change in 
any manner with input size. 
We note that these networks are not likely to solve polynomially NP-hard problems, as the 
equality “p=NP” in our model implies the almost complete collapse of the standard polynomial 
hierarchy. 
In contrast to classical computational models, the models studied here exhibit at least some 
robustness with respect to noise and implementation errors. 
Correspondence to: H.T. Siegelmann, Department of Mathematics and Computer Science, Bar-Ilan Uni- 
versity, Ramat-gan, Israel. Email: hava@sunlight.cs.biu.ac.il; sontag@control.rutgers.edu. 
*This research was supported in part by US Air Force Grant AFOSR-91-0343. 
0304-3975/94/$07.00 0 1994-Elsevier Science B.V. All rights reserved 
SSDI 0304-3975(93)E0165-Z 
332 H.T. Siegelmann, E.D. Sontag 
1. Introduction 
“Neural networks” have attracted much attention lately as models of analog 
computation. Such nets consist of ajinite number of simple processors, each of which 
computes a scalar - real-valued, not binary - function of an integrated input. This 
scalar function, or “activation”, is meant to reflect the graded response of biological 
neurons to the net sum of excitatory and inhibitory inputs affecting them. The 
existence of feedback loops in the interconnection graph gives rise to a dynamical 
system. In this paper, we introduce a mathematical model for such recurrent neural 
networks, and we study their computational abilities. 
1.1. Main results 
We focus on recurrent neural networks. In these networks, the activation of each 
processor is updated according to a certain type of piecewise affine function of the 
activations (Xj) and inputs (Uj) at the previous instant, with real coefficients - also 
called weights - (aij, bij, cJ. Each processor’s state is updated by an equation of the 
type 
x;(t+ l)=a c 
( 
M 
aijXj(t)+ 1 bijuj(t)+ci ) i= 1, ...) NY 
1 
(1) 
j=l j=l 
where N is the number of processors and M is the number of external input signals. 
The function [T is the simplest possible “sigmoid”, namely the saturated-linear 
function: 
i 
0 if x<O, 
C(X):= x if 06x61, (2) 
1 ifx>l. 
We will give later a precise definition of language acceptance for these computational 
models. 
We prove that neural networks can recognize in polynomial time the same class of 
languages as those recognized by Turing machines that consult sparse oracles in 
polynomial time (the class P/poly); they can recognize all languages, including of 
course noncomputable ones, in exponential time. Furthermore, we show that almost 
every language requires exponential recognition time. (For simplicity, we give our 
main results in terms of recognition; it is also possible to provide a more general 
version regarding the computation of more general functions.) 
The proofs of the above results will be consequences of the following equivalence. 
For functions T: N --f N and S : N + N, let NET(T) be the class of all functions computed 
by neural networks in time T(n) - that is, recognition of strings of length n is in time at 
most T(n) - and let CIRCUIT(S) be the class of functions computed by nonuniform 
families of circuits of size S(n) - that is, circuits for input vectors of length n have size at 
Analog computation via neural networks 333 
most S(n). We show that if F is so that F(n)an, then 
NET(F(n)) G CIRCUIT(POly( F(n))) 
and 
This equivalence will allow us to make use of results from the theory of (nonuniform) 
circuit complexity. 
As our model is highly homogeneous and extremely simple, one may suspect that it 
is weaker than other possible more complex models. For example, in many applica- 
tions of neural networks to language recognition, each neuron is allowed to compute 
inside its sigma function a polynomial combination of its input values rather than 
affine combinations only. Furthermore, in both applications and biologically plaus- 
ible models, the activation function is usually more complicated than the saturated- 
linear function used in our model; for instance, one encounters the classical sigmoid 
l/(1 + e-‘) or other activations. 
We show that if one allows multiplications in addition to only linear operations in 
each neuron, that is, if one considers instead what are often called high-order neural 
nets, the computational power does not increase. Even further, and perhaps more 
surprising, no increase in computational power (up to polynomial time) can be 
achieved by letting the activation function be not necessarily the simple saturated- 
linear one in equation (2) but any function which satisfies certain reasonable assump- 
tions. Also, no increase results even if the activation functions are not necessarily 
identical in the different processors. 
One might ask about using such analog models, maybe high-order nets, to “solve” 
NP-hard problems in polynomial time. We introduce a nondeterministic model and 
show that the equality P=NP is unlikely in the nets model. 
The models used here have a weak property of “robustness” to noise and to 
implementation error, in the sense that small enough changes in the network would 
not affect the computation. The robustness includes changes in the precise form of the 
activation function, in the weights of the network, and even an error in the update. In 
classical models of (digital) computation, this type of robustness cannot even be 
properly defined. 
I .l .I. A previous related result 
In our work [20], we showed that if one restricts attention to those nets all 
interconnection weights of which are rational numbers (rational nets), then one 
obtains a model of computation that is polynomially related to Turing machines. 
More precisely, given any multi-tape Turing machine, one can simulate it in real time 
by some network with rational weights, and of course the converse simulation in 
polynomial time is obvious. Here we are interested in the case when weights are 
arbitrary real numbers. (It turns out that, as far as the results given here, the existence 
of just one irrational weight is all that is needed.) 
334 H.T. Siegelmann, ED. Sontag 
1.2. The model 
The model we work with is that described by an iteration equation such as (1). 
For notational simplicity, we often summarize this equation, writing “x+(t),’ in- 
stead of “x(t+ 1)” and then dropping arguments t; we also write this in vector form, 
as 
x+ =a(Ax+Bu+c), (3) 
where x is now a vector of size N = number of processors, u is a vector of size M = 
number of inputs, c is an N-vector, and A and B are, respectively, real matrices of sizes 
N x N and N x M. (Now o denotes application of r~ into each coordinate of x.) Of 
course, one can drop the vector c from this description at the cost of adding 
a coordinate x0 = 1, but it is often useful to have c explicitly, and this allows us to take 
the initial state to be x =O, which corresponds to the intuitive idea that the system is at 
rest before the first input appears. 
As part of the description, we assume that we have singled out a subset of the 
N processors, say Xi,, . . . , Xi,; these are the p output processors, and they are used to 
communicate the outputs of the network to the environment. Thus a net is specified 
by the data (A,B,c) together with a subset of its nodes. 
In our further development, both input and output channels will be forced to carry 
only binary data. Input and output are streams, that is, one input letter is transferred 
at each time (via M binary lines) and one output letter is produced at a time (and 
appears in the output via p binary lines). As opposed to the I/O, the computations 
inside the network will in general involve continuous real values. 
We call a system defined by equations such as (3) simply a network or processor 
network. In the neural network literature, these are called recurrentjrst-order neural 
nets. We show later that considering higher-order nets, those in which multiplications 
of activations and/or inputs are allowed, does not result in any gain in computational 
capabilities (up to a polynomial increase in time). 
1.2.1. The finite structure 
We should emphasize from the outset that our networks are built up ofjnitely many 
processors, whose number does not increase with the length of the input. There is 
a small number of input channels (just two in our main result), into which inputs get 
presented sequentially. We assume that the structure of the network, including the 
values of the interconnection weights, does not change in time but rather remains 
constant. What changes in time are the activation values, or outputs of each proces- 
sor, which are used in the next iteration. (A synchronous update model is used.) In this 
sense our model is very “uniform” in contrast with certain models used in the past, 
including those used in [9] or in the cellular automata literature, which allow the 
number of units to increase over time and often even the structure to change 
depending on the length of inputs being presented. 
Analog computation via neural networks 335 
1.2.2. The meaning of (non computable) real weights 
One may ask about the meaning of real weights. In response, we recall that our 
intention is to model systems in which certain real numbers - corresponding to values 
of resistances, capacitances, physical constants, and so forth - may not be directly 
measurable, indeed may not even be computable real numbers, but they affect the 
“macroscopic” behavior of the system. For instance, imagine a spring/mass system. 
The dynamical behavior of this system is influenced by several real-valued constants, 
such as stiffness and friction coefficients. On any finite time interval, one could replace 
these constants by rational numbers, and the same qualitative behavior is observed, 
but the long-term characteristics of the system depend on the true values. We take 
this use of real numbers as a basic feature of analog computation. (Another charac- 
teristic would be the use of differential as opposed to difference equations, but 
technical difficulties make that further study harder, and we will defer it to future 
work.) 
What is interesting is to find a class of such systems which on the one hand is rich 
enough to exhibit behavior that is not captured by digital computation, while still 
being amenable to useful theoretical analysis, and in particular so that the imposition 
of resource constraints results in nontrivial reduction of computational power. That 
this is in accordance with models currently used in neural net studies is especially 
attractive. 
1.3. Previous work 
Many authors have reported successful applications when using neural networks 
for various computational tasks, including classification and optimization problems. 
Special-purpose analog chips are being built to implement these solutions directly in 
hardware; see for instance [1,6]. However, very little work has been done in the 
direction of exploring the ultimate capabilities of such devices from a theoretical 
standpoint. Part of the problem is that, much interesting work not withstanding, 
analog computation is hard to model, as difficult questions about precision of data 
and readout of results are immediately encountered - see for instance [21], and the 
many references therein. 
With the constraint of an unchanging structure, it is easy to see that classical 
McCulloch-Pitts - i.e. binary ~ neurons would have no more power than finite 
automata, which is not an interesting situation from a theoretical complexity point of 
view. Therefore, and also because this is what is done in practical applications of 
neural nets, and because it provides a closer analogy to biological systems, we take 
our neurons to have a graded, analog, response. For mathematical tractability, we 
pick this response function to be the saturation function G defined in equation (2). This 
is a “sigmoidal” nonlinearity; one could also develop a theory using instead of 
0 a differentiable function such as 
l/(1 +e-“), 
336 H.T. Siegelmann, E.D. Sontag 
but this presents technical difficulties which we prefer to avoid in this presentation. 
We show in Section 8 that sigmoidal networks are not more powerful, when consider- 
ing the discrete input-output convention, than networks with the saturated-linear 
function. (One may ask about the capabilities of sigmoidal networks with specific 
activation functions such as the above. One step in understanding this issue, for 
first-order nets, was taken in [ll]. On the other hand, if high-order nets are allowed, 
such sigmoidal nets can be proved to have the same power as the ones considered in 
this paper.) 
It is important to note that graded responses, as opposed to a threshold-binary 
output, are more reasonable in models of computing devices, as it is not reasonable to 
assume that physical devices can discern clearly two values which are arbitrarily close. 
This continuity in behavior is a basic characteristic of our model. 
In [22], Wolpert studies a class of machines with just linear activation functions, 
and shows that this class is at least as powerful as any Turing machine (and clearly has 
super-Turing capabilities as well). It is essential in that model, however, that the 
number of “neurons” be allowed to be infinite - as a matter of fact, in [22] the number 
of such units is even uncountable - as the construction relies on using different 
neurons to encode different possible tape configurations in Turing machines. 
The work closest to ours seems to be that on real-number-based computation 
started by Blum, Shub and Smale (see e.g. [4]); we believe that our setup is far simpler, 
and is much more appropriate if one is interested in studying neural networks or 
distributed processor environments. In the related previous paper [12], there were 
different models for each input size, the model allowed for no loops, and the emphasis 
was on comparisons with similar models made up of binary processors. 
The remainder of this paper is organized as follows. Section 2 includes the basic 
definitions of networks and circuits, and states the main theorem regarding the 
relationships between these two models. Sections 3 and 4 contain the proof of this 
theorem: Section 3 shows that cIRcurT(~(n))GNET(POly(F(n))) and Section 4 proves 
that NET(F(n))~cIRcuIT(Poly(F(n))). In Section 5, we show the equivalence between 
networks and threshold circuits. As Boolean and threshold circuits are polynomially 
equivalent, this proof does not add any conceptually new ideas to those in previous 
sections. Nonetheless, the direct connection and simulation may shed insight when 
a finer comparison is desired. Furthermore, the proof techniques differ in the two 
proofs. Section 6 states some corollaries for neural networks which follow from the 
above relation with circuits. We also define there a notion of nondeterministic 
network. In Section 7, we briefly compare our model to the Blum, Shub, and Smale 
model of computation over the reals. In Section 8, we show that our model does 
not gain power if one lets each neuron compute a polynomial function - rather 
than just affine combinations - of the activations of all the neurons and the ex- 
ternal inputs, or by allowing more general activation functions than the piecewise 
linear one. We conclude in Section 9 with a discussion on analog and non-Turing 
computation. 
We now turn to precise definitions. 
Analog computation via neural networks 337 
2. Basic definitions 
As discussed above, we consider synchronous networks which can be represented as 
dynamical systems whose state at each instant is a real vector x(t)~[W~. The ith 
coordinate of this vector represents the activation value of the ith processor at time t. 
In matrix form, the equations are as in (3), for suitable matrices A,B and vector c. 
Given a system of equations such as (3), an initial state x(l), and an infinite input 
sequence 
u=u(l),u(2), . ..) 
we can define iteratively the state x(t) at time t, for each integer ta 1, as the value 
obtained by recursively solving the equations. This gives rise, in turn, to a sequence of 
output values, by restricting our attention to the output processors; we refer to this 
sequence as the “output produced by the input U” starting from the given initial state. 
2.1. Recognizing languages 
To define what we mean by a net recognizing a language 
Lc{O,l}‘, 
we must first define a formal network, a network which adheres to a rigid encoding of 
its input and output. We proceed as in [20] and define formal nets with two binary 
input lines. The first of these is a data line, and it is used to carry a binary input signal; 
when no signal is present, it defaults to zero. The second is the validation line, and it 
indicates when the data line is active; it takes the value “1” while the input is present 
there and “0” thereafter. We use “D” and “I/” to denote the contents of these two lines, 
respectively, so 
u(t)=(Nt), V@))E{O, I>” 
for each t. We always take the initial state x(1) to be zero and to be an equilibrium 
state, i.e. 
a(AO+BO+c)=a(c)=O. 
We assume that there are two output processors, which also take the role of data and 
validation lines and are denoted 0,(t), O,(t), respectively. 
(The convention of using two input lines allows us to have all external signals 
binary. Of course, many other conventions are possible and would give rise to the 
same results; for instance, one could use a three-valued input, say with values 
{ - l,O, l}, where “0” indicates that no signal is present, and f 1 are the two possible 
binary input values.) 
We now encode each word 
a=a,...akE{O, l}’ 
338 H.T. Siegelmann, E.D. Sontag 
as follows. Let 
where 
V,(t) = 
1 if t=l,...,k, 
0 otherwise 
and 
Da(t) = 
ak if t=l,...,k, 
0 otherwise. 
Given a formal net JV”, with two inputs as above, we say that a word tl is classijed in 
time T if the following property holds: the output sequence 
produced by U, when starting from x(1) =0 has the form 
Od = 0 . . 0 rfolooo . . . ) o,=o...o 1000 . . . . 
Z- &- 
where u],=O or 1. 
Let T: N -+N be a function on natural numbers. We say that the language 
Ls (0, l} + is recognized in time T by the formal net JV provided that each word 
CLE (0, l} + is classified in time r d T( 1 al), and ye, equals 1 when EEL and 0 otherwise. 
2.2. Circuit families 
We briefly recall some of the basic definitions of nonuniform families of circuits. 
A Boolean circuit is a directed acyclic graph. Its nodes of in-degree 0 are called input 
nodes, while the rest are called gates and are labeled by one of the Boolean functions 
AND, OR, or NOT (the first two seen as functions of many variables, the last one as 
a unary function). One of the nodes, which has no outgoing edges, is designated as the 
output node. The size of the circuit is the total number of gates. Adding, if necessary, 
extra gates, we assume that nodes are arranged into levels 0, 1, . . . , d, where the input 
nodes are at level zero, the output node is at level d, and each node only has incoming 
edges from the previous level. The depth of the circuit is d and its width is the 
maximum size of each level. A gate computes the corresponding Boolean function of 
the values from the previous level, and the value obtained is considered as an input to 
be used by the successive level; in this fashion each circuit computes a Boolean 
function of the inputs. 
A family of circuits C is a set of circuits 
Analog computation via neural networks 339 
These have sizes S,(n), depth D,-(n), and width W,(n), n = 1, 2, . . . , which are assumed 
to be monotone nondecreasing functions. If L G (0, l} +, we say that the language L is 
computed by the family C if the characteristic function of 
Ln{O, l}” 
is computed by c,, for each no N . 
The qualifier “nonuniform” serves as a reminder that there is no requirement that 
circuit families be recursively described. It is this lack of classical computability that 
makes circuits a possible model of resource-bounded “computing”, as emphasized in 
[16]. We will show that recurrent neural networks, although more “uniform” in the 
sense that they have an unchanging physical structure, share exactly the same power. 
If L is recognized by the formal net JV in time T, we write $,, = L and T_, = T. If L is 
computed by the family of circuits C, we write &= L. We are interested in comparing 
the functions r, and SC for formal nets and circuits so that 4_,-=&. 
2.3. Statement of result 
Recall that NET(T(YI)) is the class of languages recognized by formal networks (with 
real weights) in time T(n) and that CIRCLJIT(S(~)) is the class of languages recognized by 
(nonuniform) families of circuits of size S(n). 
Theorem 2.1. Let F be such that F(n)3n. Then, NET(F(II)) G CIRCUIT(PO~~(F(~))) and 
clRcuIT(F(n)) L ~~~(poly(F(n))). 
More precisely, we prove the following two facts. For each function F(n)>n: 
0 crRcuITjF(n)) G NET(nF2(n)). 
0 NET(F(n)) G CIRCurT(F3(n)). 
3. Circuit families are simulated by networks 
We start by reducing circuit families to networks. The proof will construct a fixed, 
“universal” net, having roughly N= 1000 processors, which, through the setting of 
a particular real weight which encodes an entire circuit family, can simulate that 
family. 
Theorem 3.1. There exists a positive integer N such that the following property holds: 
for each circuit family C of size Z&(n) there exists an N-processor formal network 
JV = X(C) such that 4.+- = q& and T, (n) = O(n S:(n)). 
The proof is provided in the remainder of this section. 
340 
3.1. The circuit encoding 
H.T. Siegelmann, E.D. Sontag 
Given a circuit c ~ with size s, width w, and Wi gates in the ith level - we encode it as 
a finite sequence over the alphabet {0,2,4,6} as follows. 
0 
l 
The encoding of each level i starts with the letter 6. Levels are encoded successively, 
starting with the bottom level and ending with the top one. 
At each level, gates are encoded successively. The encoding of a gate g consists of 
three parts - a starting symbol, a two-digit code for the gate type, and a code to 
indicate which gate feeds into it: 
- It starts with the letter 0. 
- A two-digit sequence {42,44,22} denotes the type of the gate, {AND, OR, NOT}, 
respectively. 
- If gate g is in level i, then the input to g is represented as a sequence in {2,4)“‘- ‘, 
such that the jth position in the sequence is 4 if and only if the jth gate of the 
(i- 1)th level feeds into gate g. 
The encoding of a gate g in level i is of length (Wi_ 1 + 3). The length of the encoding of 
a circuit c is l(c) E 1 en(c) 1 = O(sw). 
Example 3.2. The circuit c1 in Fig. 1 is encoded as 
en [c 1 ] = 6 042444 044424 022242 6 044444 . 
-v- -v-J 
9, 92 43 94 
For instance, the NOT gate corresponds to the subsequence “022242”: it starts with 
the letter 0, followed by the two digits “22”, denoting that the gate is of type NOT, and 
ends with “242”, which indicates that only the second input feeds into the gate. 
We encode a nonuniform family of circuits, C, of size S(n) as an infinite sequence 
e(C)=8en[c,] 8Ei[cZ] 8Eii[c,]..., (4) 
where en[cJ is the encoding of ci in the reversed order. 
Fig. 1. Circuit cl. 
Analog computation via neural networks 341 
Let b be a natural number, and Y = rl r2 . . . a finite or infinite sequence of natural 
numbers smaller than b. The interpretation of the sequence r in base b is the number 
Generally, two different sequences may result in the same encoding. For instance, 
both r=0999... and r=lOOO... provide rllO= 0.1. However, restricted to the se- 
quences we will consider, the encoding is one-to-one. 
We can interpret formula (4) in base 9. We denote this representation of the family 
of circuits C as 6: 
6=sen[c,]8en[c,]8en[c3]...19. (5) 
Let Ci be the ith circuit in the family. We denote by &[ci] the encoding en[cJ 
interpreted in base 9. 
3.2. Cantor-like set encoding 
A number which encodes a family of circuits, or one which is a suffix of such an 
encoding, is a number between 0 and 1. However, not every number in the range [O, l] 
can appear in this manner. If the first digit to the right of the decimal point is 0, then 
the value of the encoding ranges in [O,$]; if it is 2, the value ranges in [&$I, and so 
forth. The number cannot lie in any of the ranges [(2i - 1)/9,2i/9], for i = 1,2,3,4. The 
second digit after the decimal point decides the possible range relative to the currently 
candidate range; see Fig. 2. 
In summary, not every value in [0, l] appears. The set of possible values is not 
continuous and has “holes”. Such a set of values “with holes” is a Cantor set. Its 
self-similar structure means that bit (base 9) shifts preserve the “holes”. 
Fig. 2. Values of the circuit encoding 
342 H.T. Siegelmann, E.D. Sontag 
The advantage of this approach is that there is never a need to distinguish between 
two very close numbers in order to read the desired circuit out of the encoding; the 
circuit can then be retrieved with finite-precision operations employing a finite 
number of neurons. 
3.3. A circuit retrieval 
Lemma 3.3. For each (nonuniform) family of circuits C there exists a 16-processor 
network JR(C) with one input line such that, starting from the zero initial state and 
given the input signal 
u(l)=1 1...100...12=1-2-“, u(t)=Ofor t>l, 
n 
.&r(C) outputs 
x,= ooo...o G[c,]ooo . . . . 
i 
zn+zCy= 1 f(c,)+4 
Proof. Let C = {0,2,4,6, S}. Denote by %‘g the “Cantor 9-set”, which consists of all 
those real numbers q which admit an expansion of the form 
q=5; 
i=l 
with each Cci~C. Let ,4: R+[O, l] be the function 
A[x]:= 
I 
0 if x<O, 
9x-L9xJ if O<x<l, 
1 if x>l. 
Let E: R-+[O, l] be the function 
i 
0 if x<O, 
Z[x]:= 2L9x/21 if Odx<l, 
1 if x>l. 
Note that, for each 
qE f cli/9iEWg, 
i=l 
we may think of E[q] as the “select left” operation, since 
(6) 
(7) 
(8) 
Analog computation via neural networks 343 
and of /i[q] as the “shift left” operation, as 
nC4l f c(i+l/9iEg9. 
i=l 
For each i30, qM9, 
The following procedure summarizes the task to be performed by the network 
constructed below, which in turn satisfies the requirements of the lemma. 
Procedure Retrieval(d, n) 
Variables counter, y, z 
Begin 
counter+O,y+O,z+d, 
While counter < n 
Parbegin 
z+A [z] 
if E [z] = 8 then increment counter 
Parend, 
While Z [z] < 8 
Parbegin 
z+n [z] 
Y++(Y+~"zl) 
Parend, 
Return(y) 
End 
The functions A and E cannot be programmed within the neural network model 
due to their discontinuity. However, we can program the functions A,z, which 
coincide with A,E, respectively, on Vg (see Figs. 3 and 4): 
iT[q] = i: (- l)ja(9q--j) 
j=O 
(9) 
2[q] =2 5 a(9q-(2j+ 1)) 
j=O 
(10) 
The retrieval procedure is then achieved by the following network: 
x+ =0(9xIo-i), i=O, . . . . 8, 
x9’ = 0(2U), 
344 H.T. Siegelmann, E.D. Sontag 
o- 
I I I I I I 
0 219 419 619 819 1 
Fig. 3. The function x[x]. 
8- 
I 1 I I I I I 
6- 
4- 
2- 
0 I I I I I I I I 
119 219 319 419 519 619 719 819 
Fig. 4. The function g[x]. 
+ x14=42x13+x,-2), 
x:5=&3-x7), 
+ x~fj=fJ(x~~+x,-l). 
If the input u arrives at time 1, then xlo(2k+ 3)=Ak[6] (because of equation (9)). 
Processors x13,x14,x15 serve to implement the counter, and processor xl6 is the 
output processor. This network satisfies the requirements of the lemma. Cl 
Analog computation via neural networks 345 
3.4. Circuit simulation by a network 
Let c(E{O, l}” be a binary sequence. Denote by en[cl] the sequence l (2,4}” that 
substitutes (2Cri+2) for each C(i, and by G[Ca] the interpretation of en[cl] in base 9, 
i.e. en [a] I 9. We next construct a “universal net” for interpreting circuits. 
Lemma 3.4. There exists a network, Ns such that for each circuit c and binary sequence 
CI, starting from the zero initial state and applying the input signal 
u1 =S[c] 00 . . . . 24,=6Z[ca]OO..., 
Ns outpuis 
x~=oo...o yoo . ..) x,=oo...o 100 . ..) 
;: ;: 
where y is the response ofcircuit c on the input c(, and T=O(l(c)+ [al). 
Proof. It is easy to verify that, given any circuit, there is a three-tape Turing machine 
which can simulate the given circuit in time O(I(c)+ [al). This Turing machine would 
employ its tapes to store the circuit encoding, the input and output encoding, and the 
current level’s calculation, respectively. Now we can simulate this machine by a net. 
Indeed, we proved in [20] that if M is a k-tape Turing machine with s states which 
computes in time T a function f on binary input strings, then there exists a rational 
network JY, which consists of 
processors, that computes the same function fin time O(T). More careful counting 
shows that less than 1000 processors suffice. q 
Remark 3.5. If the lemma would only require an estimate of a polynomial number of 
processors, as opposed to the more precise estimate that we obtain, the proof would 
have been immediate from the consideration of the circuit value problem (CVP). This is 
the problem of recognizing the set of all pairs (x, y), where XE{O, l}‘, and y encodes 
a circuit with Ix I input lines which outputs 1 on input x. It is known that CVP EP 
[3, Vol. I, p. 1101. 
3.5. Proof: Circuit families are simulated by networks 
Proof of Theorem 3.1. Let C be a circuit family. We construct the required formal 
network as a composition of the following three networks: 
l An input network, Xi, which receives the input 
ui =x00..., uz=l 1 . . . 100 . ..) 
7 Z 
346 H.T. Siegelmann, E.D. Sontag 
and computes %[~a] and a2 j2, for each a~(0, 11’. This network is trivial to 
implement. 
A retrieval network, J$(c), as described in Lemma 3.3, which receives u2 ( 2 from J,, 
and computes &i[c,,,]. (Note that during the encoding operation, network M, pro- 
duces an output of zero, and JlrR(c) remains in its initial state 0.) 
A simulation network, J&, as stated in Lemma 3.4, which receives &[c,,,] and 
65 [a 1, and computes 
x0=00 . ..o qf&)OO . ..) oo...o 100 . . . . 
T ?- 
Notice that out of the above three networks, only _& depends on the specific family 
of circuits C. Moreover, all weights can be taken to be rational numbers, except for the 
one weight that encodes the entire circuit family. 
The time complexity to compute the response of C to the input a is dominated by 
that of retrieving the circuit description. Thus, the complexity is of order 
T=O 9 I(Ci) 
i 1 i=l 
We remarked that the length of the encoding l(ci) is of order O(W,(i)S,(i))= 
0($(i)). Since SC(i)d&(i+ 1) for i= 1,2, . . . , we achieve the claimed bound 
~=O(I~IS:(I~I)). 0 
Remark 3.6. In the case of bounded fan-in, the “standard encoding” of circuit c, is of 
length I(c,)=O(Sc(n)log(Sc(n))). The total running time of the algorithm is then 
O(n S,(n) log(S&))). 
4. Networks are simulated by circuit families 
We next state the reverse simulation, of nets by nonuniform families of circuits. 
Theorem 4.1. Let A’” be aformal network that computes in time T: N + N. There exists 
a nonuniformfamily of circuits C(N) of size 0(T3), depth O(Tlog(T)), and width O(T’) 
that accepts the same language as .Af does. 
The proof is given in Sections 4.1 and 4.2. In the first part, we replace a single 
formal network by a family of formal networks with small rational weights. (This is 
unrelated to the standard fact for threshold gates that weights can be taken to have 
nlog II bits.) In the second part, we simulate such a family of formal networks by 
circuits. 
Analog computation via neural networks 
4.1. Linear precision sujices 
341 
Define a processor to be a designated output processor if its activation value is used 
as an output of the network (i.e. it is an output processor) and is not fed into any other 
processor. A formal network, for which its two output processors are designated, is 
called an output designated network. Its processors, which are not the designated 
output processors, are called internal processors. 
For the next result, we introduce the notion of a q-truncation net. This is a processor 
network in which the update equations take the form 
N M 
Xl =q-truncation 0 C aijXj+ 1 bijUj+Ci 
H 
, 
j=l j=l )I 
where q-truncation means the operation of truncating after q bits. 
Lemma 4.2. Let _Af be an output designated network. If JV computes in time T, there 
exists a family of T(n)-truncation output designated networks XI(n) such that: 
For each n, MI(n) has the same number of processors and input and output channels as 
M does. 
The weights feeding into the internal processors of MI(n) are like those of JV, but 
truncated after O(T(n)) bits. 
For each designated output processor in -rS, if this processor computes XT = o( f ), 
where f is a linear function of processors and inputs, then the respective processor in 
MI(n) computes 0(27--0.5), where Jis the same as the linear function f but applied 
instead to the processors of MI(n) and with weights truncated at O(T(n)) bits. 
The respective output processors of Jf and MI(n) have the same activation values at 
all times t < T(n). 
Proof. We first measure the difference (error) between the activations of the corres- 
ponding internal processors of MI(n) and JV at time t d T(n). This calculation is 
analogous to that of the chop error in floating point computation [2]. 
We use the following notations: 
_ N is the number of processors, M the number of input lines, LEN + M + 1. 
- w’ is the largest absolute value of the weights of M, WE W’+ 1. 
_ xi(t) is the value of processor i of network ,V at time t. 
- 6,~(0,1) and 6, > 0 are the truncation errors at weights and processors, respectively. 
- E, > 0 is the largest accumulated error at time t in processors of ~9’“~ (n). 
_ ~(0, l}” is the input to both JV” and MI(n). (u(t)=OM for t >n.) 
_ aij, bij, and ci are the weights influencing processor i of network Jf. 
_ afi( dij, Kij, and c”i are the respective activation values of processors, and weights of 
N1 (4 
Network XI(n) computes at each step 
H 
N M 
52: = q-truncation CJ C dij~j+ 1 KijUj+ c”i 
j=l i=l )I 
348 H.T. Siegelmann, E.D. Sontag 
We assume by induction on t that, for all internal processors i,j, 
I~i(t)-xi(t)lGst, 
Ifiij(t)-Qij(t)I G6wt 
Ihj(t)-bij(t)I G6w, 
I~i(t)-Ci(t)lG~w. 
Using the global Lipschitz property 1 o(a) - o(b)1 < 1 a-b 1, it follows that 
E*~N(W’+S,)E,_1+(N+M+1)6,+6pdLW&,_1+L6,+6,. 
Therefore, 
t-1 
E,< c (Lw)‘(L6,+6,)B(Lw)‘(L6,+6,). 
i=O 
We now analyze the behavior of the output processors. We need to prove that 
cr(2f10.5) = 0,l when c(f) = 0, 1, respectively. That is, 
f<O =S f;$ 
and 
This happens if If-f”1 <a. Arguing as earlier, the condition E, < $ suffices. This is 
translated into the requirement 
If both 6, and 6, are bounded by &(LW)-“-‘), this inequality holds. This happens 
when the weights and the processor activations are truncated after O(t log(LW)) bits. 
As L and W are constants, we conclude as desired that a sufficient truncation for 
a computation of length T is O(T). 0 
4.2. The network simulation by a circuit 
Lemma 4.3. Let JfI be a family of T(n)-truncation output designated networks, where 
all networks JV~ (n) consist of N processors and the weights are all rational numbers with 
O(T) bits. Then, there exists a circuit family C of size O(T’), depth O(T log(T)), and 
width 0(T2) such that c, accepts the same language as MI(n) does on (0, l}“. 
Proof. We sketch the construction of the circuit c, which corresponds to the T(n)- 
truncation output designated net N1 (n). 
The network _VI(n) has two input lines: data and validation, where the validation 
line sees n consecutive l’s followed by 0’s. We think of the n data bits on the data line 
Analog computation via neural networks 349 
which appear simultaneously with the l’s in the validation line, as data input of size n. 
These II bits are fed simultaneously into c, via n input nodes. 
To simulate the sequential input in Xl(n), we construct an input subcircuit which 
preserves the input as it is to be released one bit at a time in later times of the 
computation. The input subcircuit is of size &I,(n). 
Let 
P, p=l, . . ..N. 
be a processor of J’“r(n). We associate with each processor p a subcircuit SC(P). Each 
processor pans computes a truncated sum of up to N + 2 numbers, N of which are 
multiplications of two T-bit numbers. Hardwiring the weights, we can say that each 
processor computes a sum of (TN +2) (2T)-bit numbers. Using the carry-look- 
ahead method [19], the summation can be computed via a subcircuit of depth 
O(log(TN)), width O(T’N), and size O(T2 N). (This depth is of the same order as the 
lower bound of similar tasks, see [S, 71.) 
As for the saturation, one gate, pU, is sufficient for the integer part. As only O(T) bits 
are preserved, the activation of each processor can be represented in binary by the unit 
gate, pU, and the most significant gates. 
pi, i=l,..., O(T), 
after the operation 
AND(pi,l(p,)), i= 1, . . . . O(T). 
Let sc(p’) be a subcircuit of largest depth. Pad the other sc(p)‘s with “demi-gates” 
(e.g. an AND gate of a single input) so that all sc(p)‘s are of equal depth. The output of 
circuit SC(P) is called the activation of SC(P). 
We place the N subcircuits 
SC(P), p=l, . . ..N. 
to compute in parallel. We call this subcircuit a layer. A layer corresponds to one step 
in the computation of Xl(n). As Xl(n) computes in time T(n), T(n) layers are 
connected sequentially. Each layer i receives the ith input bit from the input subcircuit, 
and the N activation values of its preceding layer (except for layer 1, which receives 
input only). This main subarchitecture is of size 0(T3), depth O(Tlog(T)), and width 
O(T’), where T= T(n). 
As MI(n) may compute the response to different strings of size n in different times of 
order O(T), we construct an output subcircuit which forces the response to every string 
of size n to appear at the top of the circuit. 
For each layer i=l, . . . . T, we apply the AND function to the output of the 
subcircuits sc(pl), sc(p,), where pl, p2 are the output processors of XI(n). We transfer 
these values and apply the OR functions to them. The resulting value is the output of 
the circuit. When OR is applied at each layer, only DC(n) gates are needed for this 
subcircuit. 
350 H.T. Siegelmann, E.D. Sontag 
The resources of the total circuit are dominated by those of the main subarchitec- 
ture. 0 
The proof of Theorem 4.1 follows immediately from Lemmas 4.2 and 4.3. 
5. Real networks versus threshold circuits 
A threshold circuit is defined similarly to a Boolean circuit, but the function 
computed by each node is now a linear threshold function rather than one of the 
Boolean functions (AND, OR, NOT). Each gate i computes 
J : EPH B, 
where B = (0, 11, thus giving rise to the activation updates 
Xi(t+1)=h(Xil~Xi2~ ...,Xin)Ec7? C UijXij(t)+Ci . 
i 
ni 
1 
(11) 
j=l 
Here xij are the activations of the processors feeding into it, and the aij and ci are 
integer constants associated with the gate. Without loss of generality, one may assume 
that these constants can each be expressed in binary with at most nilOg(ni) bits; see 
[15]. If xi is on the bottom level, its input is the external input. The function &? is the 
threshold function 
Z(z) = 
i 
1, z>,o, 
0, z-co. (12) 
The relationships between threshold circuits and Boolean circuits are well studied (see 
for example [17]. They are known to be polynomial equivalent in size. We provide 
here an alternative direct relationship between threshold circuits and real networks, 
without passing through Boolean circuits. 
5.1.1. Statement of result 
Recall that NEr( T(n)) is the class of languages recognized by formal networks (with 
real weights) in time T(n) and define T-cIRcuIT(S(n)) as the class of languages 
recognized by (nonuniform) families of threshold circuits of size S(n). 
Theorem 5.1. Let F be such that F(n) 2 n. Then, NET(F(n)) E T-cIRcuir(Poly(F(n))) and 
T-cIRcuITF(n)) E NET(POly(F(n))). 
More precisely, we prove the following two facts. For each function F(n)>n: 
l T-ctRcuIr(F(n)) G NET(nF3(n)log(F(n))). 
0 NET(F(n)) E T-cIRCuIT(F 2(n)). 
The first implication is proven similarly to the Boolean circuit case. Each threshold 
gate is encoded in a Cantor-like way, including the description of the weights. We next 
state the reverse simulation, of nets by nonuniform families of threshold circuits. 
Analog computation via neural networks 351 
Theorem 5.2. Let JV be a formal network that computes in time T: N --f N. There exists 
a nonuniform family of threshold circuits C(N) of size 0(T2), depth O(T), and width 
O(T) that accepts the same language as M does. 
We start with simulating JV by the family of T(n)-truncation output designated 
networks MI(n) as described in Lemma 4.2. Next, we simulate this family of networks 
of depth T(n) and size O(T(n)) via a family of threshold circuits of depth 2T(n) and size 
O(T”(n)). 
Assume M’=JfI(n) is an m-truncation network for input of size n; JV’ has depth 
T(n) and m= O(T(n)). Each gate of JV’ computes an addition of N m-bit numbers; 
then, it applies the cr function to it. Using a technique similar to the one provided in 
[17, Ch. 71 (threshold circuits), we show how to simulate each (T gate of Jlr’ via 
a threshold circuit of size O(m) and depth 2. We achieve the simulation in two steps: 
first we add the N numbers and then we simulate the application of the saturation 
functions. 
5.1.2. Simulating a saturated gate in an m-truncation network by a threshold circuit 
Step 1: Adding N m-bit numbers. Suppose the numbers are 
zl, . . ..zN. 
each having m-bit representation: 
The sum of the N m-bit numbers has <m+Llog NJ+ 1 bits in the representation (as 
the upper bound on the absolute value of the result is N(2” - 1)). Generally, the sum is 
Zll 212 ..’ Zlm 
+ ; 
zN1 zN2 ‘.. ZNm 
y_, . . . Y-l Yo Yl Y2 ... Ytn 
As the network is an m-truncation network, we only need to compute y,,y,, . . ..y.. 
We show below how to compute y,, k> 1. The circuit for y, is very similar. 
TO compute yk, we need to consider only Zij for all i and j> k. Look at the sum: 
Ztk ... Zlm 
+ f 
ZNk ... ZNm 
c-1 . . . c-1 co yk . . . ym 
352 H.T. Siegelmann, E.D. Sontag 
It is easy to verify that 
z”kGC_l... c-i C(JYk . ..y.= ; f (Zij29. 
i=l j=k 
To extract from the sum the y, th bit, we build the following circuit. 
Level 1: For each possible value i of c_I . . . celcO (i= 1, . . ..2’+i). we have a pair of 
threshold gates 
~kiO=~(Zk-C_1...C_lCO lOO...O), 
Y 
~ki1=~(-Zk+C_I...C_1Cgl 11 ... 1 ). 
mYk 
If y, = 0, exactly one of each pair is active; if yk = 1, one of the pairs has both gates 
active and the rest one only. Thus, the y, bit can be computed by counting if more 
than half of the gates in the first level are active. 
Level 2: It includes one gate only that computes the desired bit: 
*1+ 1 
c (.?kiO+jkil)-(2’+1+ 1) 
i=l 
(13) 
Step 2: Applying the saturated function. The value of the kth bit is 
bk= 
yk, CO=@ 
0, co=l. 
First, we have to compute cO. We allocate 2’ pairs of threshold gates in the first level: 
~kio=c%(~k-C_~...C-1 1 OO...O), Ekil=*((-Zk$c-[...c-i 1 11 ... 1). 
% m+\;-k 
The majority of these gates is the value of cO: 
2’ 
Cg= C (~kiO+~kil)-21. 
i=l 
We change equation (13) to compute b, directly without computing first yk: 
( 
2’+ L 
bk=H 1 (jkilJ$jkil)-(21C1+ I)-co . 
1 
(14) 
i=l 
The size of the circuit that computes the kth bit is then 0(2’), where I=LlogN]. We 
copy this circuit for each of the m bits to simulate one threshold gate. Thus, each CJ gate 
is simulated via a threshold circuit of depth 2 and size O(m). The network itself is 
hence simulated via Nm copies of those. As m = O(T) and N is considered a constant, 
the simulating threshold circuit has the size 0(T2), and it doubles the depth of the 
network Jf’. 
Analog computation via neural networks 353 
6. Corollaries 
Let NET-P and NET-EXP be the classes of languages accepted by formal networks in 
polynomial time and exponential time, respectively. Let CIRCUIT-P and CIRCUIT-EXP be 
the classes of languages accepted by families of circuits in polynomial and exponential 
size, respectively. 
Corollary 6.1. NET-P=CIRCUIT-P and NET-EXP=CIRCUIT-EXP. 
The class CIRCUIT-P is often called “P/poly” and coincides with the class of languages 
recognized by Turing machines “with advice sequences” in polynomial time. The 
following corollary states that this class also coincides with the class of languages 
recognized in polynomial time by Turing machines that consult oracles, where the 
oracles are sparse sets. A sparse set S is a set in which, for each length n, the number of 
words in S of length at most n is bounded by some polynomial function. For instance, 
any tally set, i.e. a subset of I*, is an example of a sparse set. The class P(S), for a given 
sparse set S, is the class of all languages computed by Turing machines in polynomial 
time and using queries from the oracle S. 
From [3, Vol. I, Theorem 5.5, p. 1121 and Corollary 6.1, we conclude as follows. 
Corollary 6.2. 
NET-P= u P(s). 
Ssparse 
From [3, Vol. I, Theorem 5.11, p. 1223 (originally [14]), we conclude as follows. 
Corollary 6.3. NET-EXP includes all possible binary languages. Furthermore, most 
Boolean functions require exponential time complexity. 
6.1.1. Nondeterministic neural networks 
The concept of a nondeterministic circuit family is usually defined by means of an 
extra input, whose role is that of an oracle. Similarly, we define a nondeterministic 
network to be a network having an extra binary input line, the guess line, in addition to 
the data and validation lines. A language L is said to be accepted by a nondeterminis- 
tic formal network N in time B if 
L=(al3 a guess Y, #,..+-(a,y)=l, T&y)dB(laI)}. 
It is easy to see that Corollary 6.1, stated for the deterministic case, holds for the 
nondeterministic case as well. That is, if we define NET-NP to be the class of languages 
accepted by nondeterministic formal networks in polynomial time, and CIRCUIT-NP to 
be the class of languages accepted by nondeterministic nonuniform families of circuits 
of polynomial size, then we have the following corollary. 
354 H.T. Siegelmann, E.D. Sontag 
Corollary 6.4. NET-NP = CIRCUIT-NP. 
Since NP E NET-NP (one may simulate a nondeterministic Turing machine by a non- 
deterministic network with rational weights), the equality NET-NP = NET-P implies 
NP G CIRCUIT-P = P/poly. Thus, from [IO], we conclude the following theorem. 
Theorem 6.5. If NET-NP = NET-P then the polynomial hierarchy collapses to C,. 
The above result says that a theory of computation similar to that which arises in 
the classical case of Turing machine computation is also possible for our model of 
analog computation. In particular, even though the two models have very different 
power, the question of knowing if the verification of solutions to problems is really 
easier than finding solutions, at the core of modern computational complexity, has 
a precise corresponding version in our setup, and its solution will be closely related to 
that of the classical case. Of course, it follows from this that it is quite likely that 
NET-NP is strictly more powerful than NET-P. 
7. Complexity over the reals 
Blum, Shub, and Smale introduced in [4] a powerful model of computation over 
the real numbers. This model allows one possible formalization of the notion of 
analog computing. Three main characteristics differentiate our neural network model 
from the BSS model, namely: 
l The BSS model allows real-valued inputs rather than only binary. 
l In the BSS model, values can be compared for exact equality to any particular 
value, e.g. zero. That is, exact precision is available. This is not possible in our 
model, as discontinuities are not allowed. By an appropriate choice of weights, we 
are able, however, to distinguish, for any fixed E > 0, between any two values x and 
y with ly-xl>&. 
l The BSS model allows for an infinite range of values in registers - which corres- 
pond to our “neurons” - whereas our model restricts the possible range of values to 
an adjustable, but finite, bound. 
The BSS model is closely related to the model that is obtained if two types of 
neurons are available: “Heaviside” neurons that compute linear threshold functions 
and identity neurons. This model allows for discontinuous branching, as in the BSS 
model. We conclude that in polynomial time, the BSS model computes at least 
the class NET-~(P/p~ly), and the nondeterministic version computes at least 
NET-NP(NP/pOly). 
8. Equivalence of different dynamical systems 
We show that a large class of different networks and dynamical systems has no 
more computational power than our neural (first-order) model with real weights. 
Analog computation via neural networks 355 
Analogously to Church’s thesis of computability (see e.g. [23, p. 98]), our results 
suggest the following thesis of time-bounded analog computing: “Any reasonable 
analog computer will have no more power (up to polynomial time) than first-order 
recurrent networks.” 
We consider dynamical systems - which we will call generalized processor net- 
works ~ with far less restrictive structure than the recurrent neural network model 
which was described above. 
Let N, M,p be natural numbers. A generalized processor network is a dynamical 
system D that consists of N processors x1,x2,. ..,xN, and receives its input 
u1(t),4), ...> u,(t) via M input lines. A subset of the N processors, say Xii, . . . , Xipr is 
the set of output processors of the system, used to communicate the output of the 
system to the environment. In vector form, a generalized processor network D updates 
its processors via the dynamic equation 
x + =.0x, 4, 
where x is the current state of the network (a vector), u is an external input (also 
possibly a vector), andfis a composition of functions: 
.f=*on, 
where 
7c: lRN+“+[WN 
is some vector polynomial in N + M variables with real coefficients, and 
is any vector function which has a bounded range and is locally Lipschitz. (Thus, the 
composite function f= rl/ 0 z again satisfies the same properties.) 
We also assume, as part of the definition of a generalized processor network, that, at 
least for binary inputs of the type considered in the definition of “formal networks”, 
D outputs “soft” binary information. That is, there exist two constants, CI, /I, satisfying 
~<fl and called the decision thresholds, such that each output neuron of D outputs 
a stream of numbers each of which is either smaller than a or larger than /I. We 
interpret the outputs of each output neuron y as a binary value: 
binary(y) = 
0 if yda, 
1 if yap. 
In the usual model we studied earlier, the values are always binary, but we allow more 
generality to show that even if one allows more general analog values, no increase in 
computational power is attained, at least up to polynomial time. 
Remark 8.1. The above assumptions imply that, for each p > 0, there exists a constant 
C such that, for all x and ,? satisfying 
Ix-II <p and xERange($) 
356 H.T. Siegelmann, E.D. Sontag 
(the absolute value sign indicates Euclidean norm), the following property holds: 
IIcI(x,u)-$(%4I < Clx-21 
for any binary vector U. A similar property holds forf: 
Let T: N -+ N be a function from integers into integers. We say that a generalized 
processor network D computes in time T if, for every input of size no N, D completes its 
output in no more than T(n) steps. 
A neural network is a special case of a generalized processor network, in which all 
coordinates of the function $ compute the same piecewise linear function 0, and the 
polynomial 71 is a first-order polynomial, i.e. an affine function. 
8.1. Generalized networks with bounded precision 
Let D be a generalized processor network 
D: x+ =$(z(x,u)) 
as above. Let Q be a positive integer. The Q-truncation of D, denoted 
Q-truncation(D), 
is the network with dynamics defined by 
x+ = Q-truncation [$(71(x, u))], 
where “Q-truncation” represents the operation of truncating after Q bits. The Q-chop 
of D is the network with dynamics defined by 
x+ = Q-chop[$(n(x, a))] = Q-truncation[$(fia(x, u))], 
where it, is the polynomial ‘II but with coefficients truncated after Q bits. 
The next observations insure that round-off errors due to truncation or chopping 
are not too large. 
Lemma 8.2. Assume D computes in time T with decision thresholds c(, /?. Then, there is 
a constant c such that the function 
q(n) = cT(n) 
satisfies the following property. For each positive integer n, let Q =q(n). Then Q- 
truncation(D) computes the same function as D on inputs of length at most n, with 
decision thresholds 
.&,+!I! 
3 
and a’=/&?. 
Proof. Let D be a generalized processor network satisfying the above conditions, and 
let o”=Q-truncation(D), with Q still to be decided upon. Let 6 be the error due to 
Analog computation via neural networks 357 
truncating after Q bits, i.e. 6 = c12-Q for some constant ci . Finally, let E, be the largest 
accumulated error in all the processors by time t. The following estimates are obtained 
using the Lipschitz property offi 
f- 1 C-1 
&*=B c c’=s- 
i=O C-l ’ 
where C is the Lipschitz constant off for p= 1 (cf. Remark 8.1). To bound the error 
with the amount y=(p--a)/3, we require 
i.e. 
for some constant c”. This requirement is met when 6 is the truncation error corres- 
ponding to 
Q=log 
so we can take 
As a corollary of Lemma 4.2, and using an argument exactly as in the proof of 
Lemma 8.2, we conclude the following. 
Lemma 8.3. Lemma 8.2 holds for the Q-chop network as well. 
8.2. Equivalence of neural and generalized networks 
Definition 8.4. Given a vector function f = $0 IT as above, we say that f is approximable 
in time A, (n) if there is a Turing machine M that computes T(n)-truncation( f) in time 
Af(n) on each input having total bit size n. 
Example 8.5. If f = t,b 0 71, II/ is approximable, and rr has rational coefficients, then f is 
approximable (as rc is approximable in this case). 
Lemma 8.6. Let L(T) be the class of languages recognized by generalized processor 
networks in time T, for which the function f is approximable in time A,, and the function 
358 H.T. Siegelmann, E.D. Sontag 
T is computable in time M(n). Then, L(T) is included in the class of languages recognized 
by Turing machines in time O(M(n)+ T(n)Ar(T(n))). 
Proof. Given a generalized processor network D satisfying the above assumptions, 
a Turing machine which approximates it can be built as follows. The machine receives 
an input string of length n. As a first step, it computes the function T(n), and it 
estimates the required precision Q=q(n) as in the previous lemma. Finally, it simu- 
lates the generalized processor network step by step, forgetting all information but the 
first Q required bits. This Turing machine computes the required function in the stated 
time. 0 
Corollary 8.7. Let D be a network which computes in polynomial time T, and such that 
f is approximable in polynomial time. Then the language recognized by D is in P. 
Definition 8.8. Given a vector functionf= \c/ 0 rc as above, we say thatfis nonuniformly 
F(n)-approximable in time Af(n) if there is a Turing machine M that computes 
T(n)-chop(f) in time polynomial in T(n) using an advice function (cf. [3, Vol. I, 
pp. 9991151) of length F(n). 
Example 8.9. Assume a generalized processor network D that computes in time T. 
A polynomial 71 with general real coefficients is nonuniformly T(n)-computable: for 
each input of size n, the machine receives the first O(T(n)) bits of each coefficient as an 
advice sequence, and then computes the polynomial. 
From the above results, we may conclude as follows. 
Theorem 8.10. Let D be a generalized processor network which computes via a function 
f = I+!J 0 7~. Assume $ is nonuniformly F (n)-approximable in polynomial time. Then there 
exists a neural network Nn which recognizes the same language as D and which does so 
with at most polynomial-time slowdown. Furthermore, zf $ is F(n)-approximable in 
polynomial time and x involves rational coefticients only, the weights of Nn are rational 
numbers as well. 
Corollary 8.11. Addingflexibility to the neural network model does not add power to the 
model, except for a possible polynomial-time speedup. Thisjexibility includes 
l using a high-order polynomial II rather than an ajine function, 
l using other $ functions rather than the saturation we used earlier, including the 
possibility of having difSerent functions in diflerent neurons, 
l allowing for the output to be “soft binary” rather than pure binary. 
Note that networks with high-order polynomials have appeared especially in the 
language recognition literature (see e.g. [S] and the references therein). We emphasize 
the relationship between these models. Let N1 be a neural network (of any order) 
Analog computation via neural networks 359 
which recognizes a language L in polynomial time. Then there is a first-order network 
Nz which recognizes the same language L in polynomial time. 
Remark 8.12. The networks that we consider are mildly “robust to noise and to 
implementation error” in the sense that small enough perturbations in weights or 
(formulated in a suitable sense) in the sigmoid activation function do not affect the 
computation, as long as “soft binary” outputs are considered. Given any time T, there 
is some Ed so that an error of .sT would not affect the computation up to time T. This is 
an easy consequence of the continuous dependence of the output on all the data. (A 
detailed proof involves defining precisely “perturbations of the activation function”; 
we omit the details.) 
9. Comments on analog and non-Turing “computation” 
In his recent, very popular - and very controversial - book [lS], Penrose has 
argued that the standard model of computing is not appropriate for modeling true 
biological intelligence. The author argues that physical processes, evolving at a quan- 
tum level, may result in computations which cannot be incorporated in Church’s 
thesis. It is interesting to point out that the work we report here does allow for such 
non-Turing power, while keeping track of computational constraints - and thus 
embedding a possible answer to Penrose’s challenge in more classical computer 
science. Note that Parberry, in [16], also insists that possible non-Turing theories 
should take account of such constraints, though he suggests a different approach, 
namely the use of probabilistic computations within the theory of circuit complexity. 
Finally, we remark that human cognition seems to be clearly based on “subsym- 
bolic” or “analog” components and modes of operation. As pointed out by many 
authors, in particular in the work of [13], the issue of understanding how macroscopic 
symbolic behavior arises from such a substrate is one of the most challenging ones in 
science. Perhaps our work, with its implicit use of infinite precision for internal 
computations, is not at all relevant to this understanding, because neurons are often 
taken to be low-precision devices. On the other hand, it is also possible that the 
precision issue should be understood solely in terms of limitations on observers and, 
more generally, interactions with the environment, and in that respect our model is 
not deficient, since input and output data are binary. 
References 
[l] J. Alspector and R.B. Allen, A neuromorphic VLSI learning system, in: P. Loseleben, ed., Aduanced 
Research in VLSI: Proceedings of the 1987 Stanford Conference (MIT Press, Cambridge, MA, 1987) 
3 13-349. 
[2] K.E. Atkinson, An Introduction to Numerical Analysis (Wiley, New York, 1989). 
360 H.T. Siegelmann, E.D. Sontag 
[3] J.L. Balcazar, J. Diaz and J. Gabarro, Structural Complexity (Springer, Berlin, 1988). 
[4] L. Blum, M. Shub and S. Smale, On a theory of computation and complexity over the real numbers: 
NP completeness, recursive functions, and universal machines, Bull. AMS 21 (1989) l-46. 
[S] A.K. Chandra, L. Stockmeyer and U. Vishkin, Constant depth reducibility, SIAM .I. Comput. 13 
(1984) 423-439. 
[6] S.P. Eberhardt, T. Daud, D.A. Kerns, T.X. Brown and A.P. Thakoor, Competitive neural architecture 
for hardware solution to the assignment problem, Neural Networks 4 (1989) 431-442. 
[7] M. Furst, J.B. Saxe and M. Sipser, Parity, circuits, and the polynomial-time hierarchy, in: Proc. 22nd 
IEEE Symp. Foundations of Comput. Sci. (1981) 260-270. 
[S] CL. Giles, C.B. Miller, D. Chen, H.H. Chen, G.Z. Sun and Y.C. Lee, Learning and extracting finite 
state automata with second-order recurrent neural networks, Neural Comput. 4 (1992) 393-405. 
[9] J.W. Hong, On connectionist models, Comm. Pure Appl. Math. 41 (1988) 1039-1050. 
[lo] R.M. Karp and R. Lipton, Turing machines that take advice, Enseign. Math. 28 (1982) 191-209. 
[ll] J. Kilian and H.T. Siegelmann, On the power of sigmoid neural networks, in: Proc. 6th ACM 
Workshop on Computational Learning Theory, Santa Cruz, 1993. 
[12] W. Maass, G. Schnitger and E.D. Sontag, On the computational power of sigmoid versus Boolean 
threshold circuits, in: Proc. 32nd IEEE Symp. Foundations of Comput. Sci. (1991) 761-716. 
[13] B.J. MacLennan, Continuous symbol systems: the logic of connectionism, in: D.S. Levine and 
M. Aparicio IV, eds., Neural Networks for Knowledge Representation and Inference (Lawrence 
Erlbaum, Hillsdale, NJ, 1992). 
[14] D.E. Muller, Complexity in electronic switching circuits, IRE Trans. Electronic Comput. 5 (1956) 
15-19. 
[15] S. Muroga, Threshold Logic and its Applications (Wiley, New York, 1971). 
[16J I. Parberry, Knowledge, understanding, and computational complexity, Tech. Report CRPDC-92-2, 
Center for Research in Parallel and Distributed Computing, Department of Computer Sciences, Univ. 
of North Texas, 1992. 
[17] I. Parberry, The Computational and Learning Complexity of Neural Networks, draft (MIT Press, 
Cambridge, 1994). 
[18] R. Penrose, The Emperor’s New Mind (Oxford Univ. Press, Oxford, 1989). 
[19] J.E. Savage, The Complexity of Computing (Wiley, New York, 1976). 
[ZO] H.T. Siegelmann and E.D. Sontag, On the computational power of neural nets, in Proc. 5th ACM 
Workshop on Computational Learning Theory, Pittsburgh, PA (1992) 440-449. 
1211 A. Vergis, K. Steiglitz and B. Dickinson, The complexity of analog computation, Math Comput. 
Simulation 28 (1986) 91-113. 
[22] D. Wolpert, A computationally universal field computer which is purely linear, Report LA-UR- 
91-2937, Los Alamos National Laboratory. 
[23] A. Yasuhara, Recursive Function Theory and Logic (Academic Press, New York, 1971). 
