Abstract
Introduction
There is a large class of problems that are at best poorly solved.
These problems involve the transformation of data across the boundary between the real world and the digital world. And they occur whenever a computer is sampling and/or acting on real world data.
Examples of these "boundary transformation" problems include the computer recognition of human speech, computer vision, textual and image content recognition, robot control, Optical Character Recognition (OCR), Automatic Target Recognition, etc. These are difficult problems to solve on a computer, since they require the computer to find complex structures and relationships in massive quantities of low precision, ambiguous, noisy data.
"Boundary transformations" are also very important. Our inability to adequately solve these problems constitutes a significant barrier to computer usage,
"I claim that if you take anything that's a human skill -speech, listening, hand-writing, touch -it's totally predictable that those are key technologies … that people should invest millions and millions of dollar in." Bill Gates, Upside Magazine, May 1992
We have made much progress at front end processing (such as in the Digital Signal Processing of one and two-dimensional signals), but the solution to complex recognition problems still eludes us. Neither Artificial Intelligence, Artificial Neural Networks, nor Fuzzy Logic have given us effective and robust solutions.
This paper discusses an approach that has the potential us to move us closer to solving these problems. I first begin with a discussion of Intelligent Signal Processing that attempts to solve complex problems in recognition and control. I then look at biological computing models, which offer new insight into new techniques for doing intelligent signal processing. However, these biological models are radically different and require radically different implementation.
In parallel to these revolutionary advances in computational neurobiology, silicon technology has been advancing at a phenomenal pace. Although previous attempts at combining these technologies were premature, silicon and computational neurobiology are beginning to merge to create a powerful, and radically new form of computation. This synthesis will result in a large, new market in neuromorphic silicon for solving a number of important problems ranging from genetic sequencing and internet routing to and content recognition to robotic control and speech processing. In other words, ISP augments and enhances existing Digital Signal Processing (DSP) by incorporating contextual and higher level knowledge of the application domain into the data transformation process.
Intelligent Signal Processing
ISP techniques, in essence, enhance boundary transformations.
One of the most common ISP techniques in use today is the Hidden Markov Model (HMM) [1] . In a HMM, the states in the model are discrete activations, with transition and symbol emission probabilities obtained by training on real world data. HMMs do not approach human capabilities. Higher level structure is limited to keep model sizes under control and only moderate parallelism is leveraged, further limiting model size.
Researchers believe that what makes human beings so good at pattern recognition is that:
• we generate numerous hypotheses based on incomplete and noisy data; • we select the "best" hypothesis based on previously observed data from the process in question, which could be a stored "model" that has evolved from repeated encounters with a particular context; • we make efficient use of historical statistical information in the selection process, and;
• we do all this in real-time.
An open question is, when attempting to recreate human-like intelligence in a computer, how accurately must one model the way humans perform these computations?
For many years, the symbolic, modestly parallel, approach (which has little biological relevance) was used and has not achieved great success. Many researchers, even in the AI community are beginning to agree that a key component to human intelligence is its ability to leverage more biological characteristics such as massive parallelism, and statistical, fuzzy processing.
A thought experiment that demonstrates the significant differences between how computers currently perform boundary transformations and how real neural circuitry works is Jerry Feldman's "100 step rule." Take a simple cognitive task involving the brief exposure of the image of a character on a screen. The subject is to push a button if the character is a numeral, but do nothing if it is a letter. For humans, the time for processing such a task, after practice, is typically about half a second (500 milliseconds). Given that the typical switching time for neurons is on the order of a few milliseconds, the brain processes this complex task in roughly 100 sequential steps, which implies massive parallelism.
Looking at biological computing hardware this is an obvious conclusion, since human cerebral cortex is estimated to consist of about 10 billion, relatively slow neurons. A computer program designed to accomplish the same task would be mostly sequential and could easily take up to a billion steps, which can best be described as "massively sequential."
The first real attempt to create more brain-like models for knowledge representation problems are the connectionist models [2] . They are a first step to more neural-like representation.
Often highly structured and problem specific, each node generally implies a specific meaning, each connection a relationship.
Sparsely connected and activated, computation was generally done in parallel using constraint relaxation (for example by energy minimization). Other related models are spreading activation semantic networks and Bayesian networks [3] .
Computers are getting faster and can execute larger and more complex versions of existing ISP techniques. But we need to do more than just rely on higher clock rates and larger memories to move to the next level in recognition capability -we need new solutions. We know that biological computing solves these problems. Perhaps we should turn there for inspiration.
Biological Computing Models
Even the most primitive biological systems are capable of performing complex ISP. In addition, biological computing is robust in the presence of faulty and failing hardware, and requires no intrinsic synchronization 1 . Biological computing is energy efficient, consists of networks of sparsely connected and sparsely activated nodes, and requires only moderate levels of precision (often binary).
Other current hypotheses about neurobiological computation include:
• Communication is expensive (mostly in energy), so biology tends to trade-off local computation for non-local communication;
• It is most likely that data representation is partially distributed or "vector encoded," where each node participates in a number of representations; this enhances fault tolerance and response time, and allows more efficient knowledge representation;
• There seems to be high-level linkage (hierarchical and bi-directional) between large subnetworks; and • They tend to be dynamic with multiple feedback loops.
If these models are so promising, why haven't our current batch of Artificial Neural Network (ANN) algorithms been more successful at performing ISP? ANN models create a powerful set of tools for solving a number of interesting problems, but most ANN models have little biological relevance. Among other things they are too small and not dynamic enough. In addition, they are limited to moderate levels of parallelism, unlike biological networks which are massively parallel. For all these reasons biologically inspired models have much potential for providing us with new, scalable ISP algorithms.
Before we can be inspired by computational neurobiology, there has to be abstract functional representations of these systems. What may actually be the most important result from the recent resurgence in neural network research, is a major shift in perspective in the neuroscience community. In the last 10-15 years many neuroscientists have been looking at functional models ("what does it compute?") and not just structural models ("what is it connected to?"). As researchers attempt to model ever more complex, higher order functionality, computational models are emerging from neuroscience laboratories all over the world. Such models will be the primary inspiration for the next generation of ISP algorithms.
There are a number of excellent examples of the reverse engineering of biological computing systems. These models are abstracted from the original biology and are scalable to large configurations.
An important model is the Cortronic Network, which has been developed at HNC Software under the leadership of Robert Hecht-Nielsen [4] . These networks are abstract models of cerebral cortex that perform association. They are sparsely connected and scalable to extremely large networks. The basic computation is straightforward and the models are stable. HNC is now using them to perform complex language processing tasks.
Another important set of models consists of those of Gary Lynch and Rick Granger at the University of California at Irvine [5] . They and their coworkers have "reversed engineered" olfactory pyriform cortex and hippocampus.
Their hippocampus model performs Bayesian classification with Parzen windows using a network of a non-obvious and amazingly efficient design.
It is sparsely connected and activated, and data are represented in a partially distributed manner where the network design leverages the statistical aspects of neuron connectivity. The models are now being used for solving many realworld pattern recognition problems by Thuris, Inc., a company started by Rick Granger.
The third set of models includes cortical models from Rodney Douglas's and Kevan Martin's group at ETH Zürich [6] . In addition to computational models, they have created silicon implementations, and are now applying these implementations to a number of real world applications in areas such as robotic control and computer vision.
Other interesting and relevant models include those of Ted Berger [7] and Christoph von der Malsburg [8] at USC, as well as those of Anders Lansner and his group [9, 10] at the Royal Institute of Technology in Stockholm. Anders and his group are doing some interesting work in reinforcement-based learning [11] .
Implementation Issues
It appears that one of the problems with current ANN models is a lack of sufficient parallelism, therefore it is most likely that successful, biologically inspired ISP will utilize large numbers of nodes. The ability to execute such models in real time requires radically new silicon architectures -even the fastest projected microprocessors will be insufficient. For example, emulating in real time a network of 1 million nodes and 1 billion connections (each node is connected to 1000 other nodes), where the network is updated once every 100 microseconds, requires more than 10 Tera-Ops/sec 2 . In addition, many of the envisioned applications require this performance in low power and inexpensive implementations.
People have built many neural network chips, but none have been commercial successes. These early efforts were either analog [12] , where they suffered from design and technology limitations. Or they were digital [13] , where, with moderately parallel models and limited I/O and transistor counts, they found themselves competing directly with mainstream microprocessor and DSP technologies, where they lost.
In addition to the need to provide more powerful ISP, another reason for looking to biology for inspiration for future VLSI structures is Moore's Law 3 , which the semiconductor industry has been following for almost 30 years. It has been said that it is not really a physical law, but one of faith. And now there is increasing pressure on our faith. As gate lengths shrink:
• quantum effects become more common, • transistors are increasingly leaky, noisy, and unreliable,
• metal interconnect appears as long, slow transmission lines, • communication becomes expensive relative to computation,
• it is increasingly difficult to synchronize an entire chip at multiple GigaHz clock rates, and • it is almost impossible to perform design verification and validation of a 100 million transistor design.
Another threat to Moore's law is fabrication cost. Intel is building a $2.3 billion chip fab in Oregon -lots of chips need to be sold to amortize this kind of investment.
Looking at this problem differently, much of the pressure on Moore's law results from existing computational models, which:
• are fault intolerant, • require high precision, • are globally synchronized, and • perform extensive global communication, which is required, for example, in high-precision, parallel multiplication and score-boarding and conditional execution. The models being derived from work in computational neurobiology, the awesome capabilities of silicon, and the fact that transistors are starting to behave like neurons creates a unique opportunity for new radical architectural models. These technologies, coupled with the significant need for more powerful ISP solutions, are creating what Intel's ex-CEO Andy Grove refers to as a strategic inflection point. This is the fundamental premise of our research project:
• that computational neurobiology will inspire new ISP models, and • that these models will be massively parallel and require massively parallel silicon architectures for efficient execution.
The implication is not that Moore's law will end for traditional computing structures, since it will continue for some time. Though many now acknowledge that there will be a slowing down as the fabrication of deep submicron circuits becomes ever more complex and expensive, and the behavior of the transistors themselves becomes more problematic. The main point of the discussion here is that biological computing offers models that will allow more rapid scaling because they are fundamentally tolerant of many of the deleterious effects of extreme scaling. They will also lead to significantly cheaper implementations, providing massively parallel computation using small, low power, fault tolerant processors.
Just as it is clear that Moore's law will continue to hold (more or less) for traditional computational structures, so too, the biologically inspired systems discussed here, are NOT being considered as a replacement for current computing models. Rather they will augment and enhance what we currently do now with computers, acting as ISP co-processors.
The Impact On VLSI Architecture
To create silicon structures for emulating biologically inspired computing, we need a better understanding of biological computing models, and we need VLSI design techniques that emulate these models efficiently. We also need to identify those aspects of computational neurobiology which are necessary and which are not. For example, analog computation has advantage in low precision computations and low power applications, and impressive computational density. However, analog computation also has disadvantages in stability, temperature sensitivity, communication, and ease of design. And it is not clear that analog's computational density is an advantage in sparsely activated, sparsely connected networks.
Digital technology is less area efficient, especially for certain types of functionality (e.g., leaky integration). It is also power intensive, and the representation of time tends to be more complicated (events are typically synchronized to a global clock). But digital allows for the efficient multiplexing of scarce computational and communication resources.
One of our research tasks is to determine the combination of implementation techniques that are best for this architecture space. We suspect that hybrid, analog/digital or "mixed-signal", techniques may very well constitute the optimum design point.
There are numerous other implementation issues in the adaptation of biological models to a vastly different implementation technology:
• capturing high-order and temporal information efficiently; • stability; • robustness in the face of faulty hardware -silicon also has different failure modes than biological structures; and • connectivity -silicon does not have the same storage and connectivity capabilities as biology, which could ultimately limit silicon based ISP; Of these, connectivity is one of the most important characteristics of biological neural structures. As Carver Mead expressed so eloquently in his ground breaking book on neural-inspired VLSI [16] ,
"Computation is always done in the context of neighboring information. For a neighborhood to be meaningful, nearby areas in the neural structure must represent information that is more closely related than that represented by areas further away. Visual areas in the cortex that begin the processing sequence are mapped retinotopically. Higher-level areas represent more abstract information, but areas that are close together still represent similar information. It is this map property that organizes the cortex such that most wires can be short and highly shared; it is perhaps the single most important architectural principle in the brain."
Unfortunately, connectivity is perhaps the one area where silicon is significantly less robust than biology. Communication in silicon is generally limited to a two-dimensional plane (though with several levels, 6-8 with today's semiconductor technologies). It is still one of the most important problems as we consider scaling to very large models. The following important result [17] demonstrates why, If, for example, we double the fan-in from 100 to 200, the silicon area required for the metal interconnect increases by a factor proportional to 8x.
Theorem: Assume an unbounded or very large rectangular array of silicon neurons where each neuron receives input from its N nearest neighborsi.e., the fan-out (divergence) and fan-in (convergence) is N. Each such connection consists of a single metal line, and the number of two-dimensional metal layers is much less than N. Then the area required by the metal interconnect is
This unfortunate result means that even for moderate connectivity, the silicon area 4 devoted to metal interconnect will dominate. Research at OGI [17] has indicated that even moderate multiplexing of communication resources would significantly decrease the silicon area requirements without any real loss in performance. Means [18] studied the implementation of the Lynch/Granger pyriform cortex model with multiplexed and non-multiplexed communication and obtained a similar result.
Concurrently, Carver Mead's group at CalTech and others developed "Address-Event-Representation" or AER communication [19, 20] . The address-event technique has also been expanded into a hierarchical structure by Lazzaro and Wawrznyk [21] . When analog computation is used, signals can be represented by action-potential-like "spikes" (generally a neuron unit exceeding its threshold). These signal "packets" or "pulses" are transmitted asynchronously at the moment they occur, by sending the originating unit's address on a single multiplexed bus. This "pseudodigital" representation allows multiplexing of the bus and retention of temporal information, if contention for the units sharing the bus is minimal.
Related to multiplexing is the issue of synchronization and clocking.
If significant multiplexing is used, then it becomes more difficult to operate in real time and a simulated virtual clock is required.
There are already some interesting techniques that have been developed for synchronizing large-scale SIMD systems that may be of use here [22, 23] .
In studying potential implementations of cortical structures, we have developed an efficient multiplexing architecture where data transfer occurs via overlapping, hierarchical buses [24] [25] [26] [27] . This structure, The Broadcast Hierarchy (TBH) allows simultaneous high-bandwidth local connectivity and long-range connectivity, thereby providing a reasonable match to many biological connectivity patterns. Braitenberg [28] postulates two general connectivity systems in cortex: "metric" (high density connections to physically local cells, based on actual two-dimensional layout), and "ametric" (low density point-to-point connections to all large groups of densely connected cells). Connectivity is significantly denser in the metric system, but with limited extent, whereas connectivity in the ametric system is very sparse and random. There are actually many other reasons for such bimodal connectivity schemes [29] . One hypothesis that we will be investigating is that these localized connectivity patterns actually enable certain kinds of advanced cognitive processing such as abstraction and hierarchical representations. So, it is possible that in solving the scaling problem, biological computation created a structure of great power and flexibility.
Assume the network discussed above with 1 million nodes and 1000 connections per node, which is 1 billion connections. If we have a simple analog processor per neuron, then we can compute all 1000 connections simultaneously.
If we are using micropower techniques, each processor could take a few microseconds, assume 100 microseconds. So we are computing 1 billion connections in 100 microseconds, which gives us a computation rate of 10 trillion connections computed per second. Assume that 10% of the neurons are active (i.e., they produce output pulses) and that each active neuron communicates 10 pulses 5 on average during a single network update. This is about 100K pulses per second per active neuron. The entire communication network then must handle 1M x 10% x 100K, or 10 billion pulses per second Below is a figure of a simple, two-level broadcast hierarchy with four "nodes" (digital or analog processors) in each low level region. A node can broadcast to any other node in its low level region or any node in the high level region. Assuming we have a two level broadcast hierarchy, and that 95% of the messages from each node are to other nodes that can be reached via broadcasting the pulses on the lower layer. In addition, assume that each lower level broadcast region is connected to 1000 neurons, that means that each low level broadcast region needs to handle about 9.5M pulses per second, which is not a terribly large bandwidth. Recall that there are 1000 of these lower layers, so the accumulated bandwidth is 9.5 billion pulses per second.
The top broadcast region, which will cover all 1M neurons, needs to handle 5% or 500 million pulses per second, which is certainly achievable in today's semiconductor technology.
Messages can be "pipelined" through buffers and routers, since even in neural circuitry there is often a fair amount of signal delay -though it is important that the delay be reasonably consistent and predictable.
This example is quite simple, a real implementation would probably have several broadcast layers and possibly even some point-to-point connectivity. However, the main point here is that we believe that we can meet the connectivity requirements of large neural network models with current silicon techniques. However, it is necessary that the networks being emulated exhibit reasonably localized interconnect, which has been shown to be the case in cortical structures [28] [29] [30] .
When implementing neural like structures in silicon, an important issue concerns how the synapse is represented, in particular, how information is stored in the synapse. For digital systems, such information storage is straightforward. Single bits can be stored in dynamic, static, or floating-gate devices, since even in a noisy environment, signal restoration to a 1 or 0 is reasonably straightforward. However, storing analog values is more error prone and complex. There has been much work in creating floating-gate structures for analog learning systems [31] . We intend to leverage this technology to the degree that we use mixed-signal (analog/digital) data representations. It should be pointed out that the models we are considering here use either single bit or at most a few bits to represent information at each synapse. It is very possible that multi-level logic would provide the best representation compromise and the most efficient utilization of scarce communication resources.
Another important issue affecting VLSI architecture is Fault Tolerance. Research at OGI [32] has shown that even with all-digital implementations, massively parallel hardware emulating neural network models has a reasonable degree of fault tolerance. This is due to the fact that the majority of silicon area contains circuitry whose failure has local functional impact.
Another interesting question concerns whether there is some degree of design fault tolerance. The Adaptive Solutions CNAPS chip had two design errors in the PN arithmetic unit which were not discovered for several years. The invisibility of this problem was mainly due to the fact that most applications did not need precise arithmetic results. More work is needed in this area as these architectures evolve.
Then there is manufacture chip test. It is not clear exactly how one tests a faulty, mixed signal chip like this. So the development of test strategies optimized for this kind of architecture is important. Although the testing of such chips efficiently is challenging, we do not view this as a "show-stopper" for the long-range success of the technology. However, we do foresee a fair amount of work required to create the necessary test techniques.
Commercial Realization
The goal of our project is to create a family of commercial ICs for use in a range of ISP solutions. Tentatively we plan for the first commercial chip to be derived from this work to be the Associative Data Processor (ADP) 6 . The ADP will implement high speed, high capacity, best-match, associative memory. Because of algorithm and application dependencies, it is difficult to estimate at this time the implementation parameters of the resulting chip. However, Palm [33] has shown that these networks operate best with a very large number of nodes. Our goal will be for 10s of thousands for preliminary implementation.
In addition to large numbers of nodes, Associative Memory operation is highly dependent on having a sparse data representation. Since few natural data representations are sparse, we will need input and output pre-and post-processing to "sparsify" and then "desparsify" input and output. Fields [34] has shown that sparse representations may be a more suitable representation for preprocessed data. Small networks with a moderate number of inputs and a large number of outputs can be trained to efficiently map application specific external representations to the distributed internal representations used by the network. A similar technique would be used for system output. Also, the input/output networks can be used to convert temporal to spatial information and vice-versa.
For the first generation of ADPs we envision the basic functionality will be "best-match" associative processing.
For the most part, the "content addressable" memory function that has been implemented to date is considered "exact" match. Examples include cache and virtual page addressing in modern microprocessors, and domain name to IP address lookup in Internet domain servers. Where a portion of a record is used as input, and the memory returns the rest of the record. In contrast, with bestmatch processing, an arbitrary subset of a record is input, which may not match any record in the memory. In this case then, the memory returns the closest match (or matches) according to some metric. Incidentally, a best-match memory will always return the exact match first, if such a match exists.
Best-match is significantly more powerful and more difficult to implement using traditional computing structures (it generally requires a complete search of the data in the memory). But it can gracefully deal with errors and missing data, and perform reasonable "generalization."
The metric used to determine how "close" an input vector is to a stored vector is generally an emergent property of the interconnection structure and the methods used in setting the connection or "synaptic" weights between nodes. For many applications the metric can be a simple vector distance measure. However, for more complex applications the metric becomes a function of the higher-order internal data representation.
For the algorithms we are considering, the resulting weights tend to create a "distance-likelihood" metric that approximates Bayesian 7 classification. That is, the ADP will return the most likely match (or set of matches) that are the most likely according to Bayesian rules.
A common operation in the applications being developed in other groups at OGI is that of finding complex higher order structure in data (sound waves, image pixels, Internet text). Although limited in capability, and compute intensive, Hidden Markov Models (HMMs) [1] are currently the best ISP technique for this task. One possibility we are considering is to use a temporal version of best-match to approximate HMM functionality. Most of the neural models we are considering are capable of some form of this type of processing. For example, one of the Lynch/Granger models has already been used in simple speech applications [37] . Just demonstrating superior results emulating HMMs would be a powerful existence proof for this technology.
And an implementation of HMMs on an ADP would be marketable functionality since HMMs have a ready set of applications in speech recognition, OCR, handwriting recognition, and genetic sequencing.
We also intend to leverage the significant fault tolerance of these models to increase yield. All but large area faults, such as those due to wafer processing and power/ground shorts, should be correctable). For this reason, no chip will be exactly the same. Therefore, ADP chips will need to be trained rather than programmed. Although the training process will be non-trivial, we view this as an advantage, since it will be easier than programming a large parallel processor array. But it will be a custom operation performed by the customer. (Much like burning a particular set of words into an EEPROM.) 7 Bayesian statistics [35, 36] guide the memory to return the most likely stored vector to have caused the input (assuming the input is a corrupted version of the returned value). Bayesian selection is optimal under certain conditions.
Conclusion
It is my belief that the convergence of high-density silicon and advanced computational models will lead to exciting new capabilities in Intelligent Signal Processing. For the research proposed here, our goal is to create a commercial product, based on simple models, which performs high-speed, adaptive, bestmatch, high-capacity associative data processing -the Associative Data Processor).
Returning to Moore's Law, and borrowing from Bob Lucky (IEEE Spectrum, September 98). Moore says there will be exponential progress and that doublings will occur every year and a half. One thing about exponentials, at first they are easy, but later they become overwhelming -and we are starting to enter the "overwhelming" phase in semiconductors. Since the invention of the transistor, there have been about 32 doublings of the technology -the first half of a chessboard.
The exciting question is, what overwhelming implications await us now as we begin the second half of the board?
The next ten years will be an extraordinary time for silicon engineers and computer scientists.
The challenges of Moore's law, and the search for quantitatively better ISP solutions will lead to more experimentation in new silicon architectures, fueled in part by ideas from biological computation. Understanding and mapping biological computing models to silicon, and then to real applications will be difficult, but the rewards will be great. By 2010, massively parallel, biologically inspired computational models will account for a significant portion of the global semiconductor business.
At the IEEE Centenary in 1984 ("The Next 100 Years," IEEE Technical Convocation), Dr. Robert Noyce, co-founder of Intel and co-inventor of the Integrated Circuit, said:
" 
