Public reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden to
ii This project focused on the design of the core and integration across a four node chip. A follow on project will focus on creating a 3 dimensional stack of chips that is enabled by the low power usage. The chip incorporates structures to enable stacking in a small form factor. A third project will focus on system architecture issues, using many stacks to create a neuromorphic computing platform capable of 100+ Trillion Floating Point Operations per Second, TFLOPS in the space of a small rack, with power usage of less than 10kW.
List of Figures
This report describes the completed design trades and architecture for the nodes and chip level integration. At the end of the project, the chip design was nearly ready for fabrication and will be fabricated in the first part of the follow on project, which focuses on the multi chip stacking architecture.
Introduction
A DoD need exists for small, autonomic systems in the battlefield. Autonomy allows the creation of unmanned systems to perform complex, high risk and/or covert operations in the battlefield without the need for constant human operation. Current computing systems are not optimized to perform intelligent operations, such as environmental awareness, learning and autonomic decisions in a size, weight and power form factor that matches platforms envisioned for future use.
This project created a new computer hardware architecture to provide massively parallel computing systems needed for future autonomic operations while dramatically improving computing power per system volume and computing power per energy demand ratios.
The term "cognitive operations" can cover numerous topics from fundamental perception to conscious reflection on the nature of self. In this project, the emphasis was on lower level operations that require massively parallel computing to perform in real time at a resolution rivaling human operations. An example is processing of visual images for object recognition.
The project focused on architecture development to enable massively parallel processing and the optimization of algorithms to utilize the new hardware architecture. The approach was to develop an architecture for processing the cognitive primitives that was not subject to limitations to parallelism that restricts Von Neumann type systems.
The Von Neumann computer architecture consists of a sequential instruction based processor plus external memory for storing the program or sequence of instructions, [1] . For 60 years economic, fabrication engineering and algorithm availability issues encouraged computer designs to follow the Von Neumann path by pursuing a single, fast sequential processor. With the creation of each new generation of processor, users have thought up new applications that exceeded the new capabilities, creating the need for further development. The architecture allowed designers to increase processing capacity by adding chip area and energy. The disadvantage of designing for increased sequential speed over energy is that heat increases at a higher rate than the speed of the processor. Eventually, a limit was reached where it wasn't cost effective to increase the speed of the single processor. The solution to the heat limit was to slow the processor down and use more than one processor in parallel. However, a disadvantage of using large separate chip processors is that latency from processor to processor and processor to memory is high. The long interconnects further increase energy use.
The Von Neumann architecture is efficient for computation that requires fast serial instructions but is subject to Amdahl's law for parallel operations due to the sequential instruction path. Amdahl's law describes the maximum speed-up to be gained from adding parallel processors to a system which has concurrent instructions spanning the parallel interface [2] . The marginal processing gain, from adding another processor, diminishes with each added processor due to cumulative wait time for concurrent instructions.
Efforts are under way to emulate human-brain scale processing. There are multiple approaches which can be differentiated by the resolution level in the emulation used and the use for the output of the emulation, e.g. [3, 4] . Of contention is whether or not emulation down to the molecular level is required for the computing system to perform and not just simulate or emulate various levels of cognitive functions. What is not in contention is the issue of the energy needed to achieve human scale operations. Given the current rate of progress in energy efficiency, it is estimated that a human scale system using the current processor and large supercomputer architectures will require megawatts to operate [4] .
Human scale systems, brains, work around the limits of Von Neumann and Amdahl by using a concurrent, dynamic, massively parallel processing network. In this project, the processor was significantly reduced in size, versus commercial processors. The cognitive operation primitive was set at the functional level. We do not expect to need to emulate molecular biology to achieve performance of perception and semantic operations. As the scale of the total system is increased by clustering nodes, responsibility for cognitive primitives will move up from the traditional "each processor is performing many serial cognitive primitive operations" to a network of nodes level. Each node will be responsible for a single cognitive primitive and be capable of performing the operation very quickly. This network work load architecture will require a very large number of nodes to accommodate a large range of knowledge for cognitive operations. In this manner, the system parallelism is pushed closer to the level at which the cognitive primitive is performed. The network node becomes the functional primitive hardware unit for semantic operations. A semantic network node architecture was impractical in the past because there was more commercial benefit in building one very large, fast processor in a fixed area than to divide the same chip area into many smaller processors. Unfortunately, large processor systems cost too much in area, power and processing time to create the number of nodes needed to process millions of semantic primitives.
The features that make the architecture developed in this project useful for cognitive operations also make it useful for many other military applications. The architecture makes major progress in the trade space for size, weight, energy demand, cyber security, system reliability, processing speed, modularity, bandwidth internal to a cluster, and flexibility of operation and resource control. The Floating Point Unit, FPU in our ASP was optimized for 
Methods, Assumptions, Procedures
This project created a new modular, fungible computing node that enables compact massively parallel computing with a memory / processing / communications geometry that is optimized towards the natural geometry of basic autonomic operations. This entailed identification of the specification trade space, component design, system design and operating software development. Power efficiency was aggressively pursued at all levels of the core and cluster to enable high cluster system density.
Hardware Design
The goal of the hardware design was to maximize power efficiency, core to core connectivity and system modularity while minimizing communication latency. A core concept was chosen that consisted of a small, Application Specific Processor, ASP, a block of core associated Random Access Memory, RAM, and a block of Asynchronous Field Programmable Gate Array, AFPGA. A 128 bit AES hardware block was added to the core design. The fundamental concept was to optimize for the lowest area per core that would support several chosen applications to enable the highest possible core density in a cluster system on a chip.
Power efficiency, connectivity and modularity facilitated high density.
A processor was designed for 65 nm fabrication technology with numerous features to enhance performance under size, weight and power restrictions. Design choices were also made to favor modularity, security and dynamic user interaction. For modularity the cores were designed with independent components rather than as monolithic integrated blocks. This allows soft, (post fabrication) and hard, (next version fabrication) changes in individual blocks without requiring redesign of the whole core. New hardware based security features were continuously sought throughout the design process.
Fungibility, the ability to easily interchange units with other, like units, was a continuous design consideration in this project. The difference between fungible and homogeneous is the degree of interchangeability. Parts of a system can be homogeneous but not interchangeable.
Parts must be fungible to enable using a neighboring part to re-route around a malfunctioning part. A fungible core aids the creation of clusters that grow or shrink with the demands of the computation and in the process allows tailoring of power consumption to efficiently fit the computation.
Some blocks such as the floating point multiplier and floating point adder in the ASP were adopted from a previous, 130nm project. Most hardware blocks were newly designed for this project.
Software Design
Control software design in the project was begun with an open version of Real Time
Executive for Multiprocessor Systems, RTEMS. This choice enabled the Air Force to retain the complete source code and knowledge of everything in the operating system, OS. Software needs for the project included a processor and node level operating system and software to enable the control functions envisioned for the AFPGA. The control software will direct resources for heterogeneous parallelization of computing tasks while effecting dynamic power efficiency measures. Dynamic resource allocation will enable future system architecture configurations such as the ability to turn processors on and off as needed and allocate memory.
Hardware and operating system features were added to the OS to allow user changes to the microcode. Modules were added to provide functionality for the AFPGA, security features and memory access. Concurrent design with hardware was used to maximize functionality and efficiency.
Applications
The cognitive operations models used in this project were chosen for their representation of the state of the art, user applicability, model accessibility and representation of a range of model geometries and HW/SW requirements. Concurrent development was used to increase efficiency.
For example, Fast Fourier Transform, FFT uses the common Vector-Matrix Multiply operation.
On the hardware side, registers were added to the Floating Point Unit and configured to maximize throughput. On the software side, new microinstructions were created to take advantage of the new register configuration and reduce the number of read / write calls. On the application side, the algorithm was rewritten to make the most efficient use of the new microinstructions. The ability to perform concurrent development was made possible by the built in modularity and flexible microcode OS.
Participants
The project was led by the Air Force Research Laboratory Advanced Computing Division.
AFRL researchers provided the Application Specific Processor, ASP design and project integration. The Asynchronous Field Programmable Gate Array design was provided by Cornell University. Cornell also provided some of the design flow facilities. ITT industries provided the operating system, design integration expertise and a cognitive computing model. Oklahoma
State contributed cell and processor design, design tool flow and block integration. Binghamton
University contributed hardware analysis for timing and heat, and a cognitive model. There was input on applications and large scale hardware systems from several other basic research projects.
Schedule
The 
Results
The design of the chip is nearly complete as of the scheduled end of this project. The design work will continue under the Cognitive Cluster on a Chip project and will go to tape-out for fabrication in February 2010. This section describes progress to date in the areas of hardware architecture and chip design and operating system architecture.
Hardware
The goal of the hardware design was to maximize power efficiency, core to core 
Figure 2. Mesh Hardware Blocks Between Two Cores
The AFPGA block is organized into 256 reconfigurable tiles with 2Kb of SRAM per tile. The processor interacts with the FPGA in three ways:
• Configuration -Resetting the FPGA at start up and making changes in AFPGA
• Programming -Logic functions with standard bits
• Privileged bits -JTAG accessed for write protection control and normal operation (sending and receiving data to/from the FPGA).
There are three modes of interaction timing:
• Timing-driven -Data is sent to FPGA, result of computation is retrieved after some number of cycles. Given the long latency of other methods, this is expected to be the most popular method of operation.
• Polling -Data is sent to FPGA, FPGA signals to the microprocessor when data is ready for retrieval; core is awaiting the results in a microcode loop and proceeds upon arrival.
• Interrupt -Same as polling, except core has proceeded with other work and takes an interrupt (IRQ1 = output_data_ready) when the result is delivered.
The 128 bit Advanced Encryption Standard block designed by ACICS.ws was obtained from OpenCore [5] . The block was modified for the fabrication technology used in this project and found to require less than 0.1 mm 2 in area. The AES is addressed through JTAG with the ability to maintain 4 keys simultaneously. Three of the keys are used for data operations and one is reserved for local core use only. The National Security Agency, NSA has approved the use of accredited 128 bit AES based encryption for protection of information up to the SECRET level.
For TOP SECRET and above, 192 bit or 256 bit AES is required [6] .
Information Assurance was built into all hardware and software blocks at all stages of the design. A nontrivial technique was used to maintain trusted access throughout the project to all designs. The personnel working on the project were approved before access was given and secure areas were set up at collaborator's facilities to maintain security. Fabrication was arranged through the Trusted Foundry operated by NSA.
Operating System and Software
The operating system designed for this project began with Real Time Executive for Multiprocessor Systems, RTEMS which is a simple, open source operating system for parallel and embedded processor systems. The choice of RTEMS allowed the retention of the source code so that changes can be made to the operating system and the exact source code is known at all times. The microcode physical store was designed to provide for insertion of new microinstructions, at the design stage and later, as needed by users. New instructions were created and emulated in about two weeks each to improve a Fast Fourier Transform algorithm, speed up hardware processing of morphing opcodes and implement AES encryption.
Microinstructions were written to securely control the JTAG interface. Microinstructions written for the AFPGA interface allow the user to tailor the rate of communication between the processor and the AFPGA. A user with an application that makes many data passes could set the interface to access at a set interval. A user with few data passes could save wait time by setting the interface call rate to only open the channel at an irregular demand signal.
The ability to tailor the operating system at the microinstruction level was used to quickly deliver a 10x improvement in the GFLOPs/Watt ratio for the FFT demonstration algorithm by creating and implementing pre-fetch and evict instructions. The addition of "squash" Further information can be found in the Binghamton project final report [7] .
Figure 4. Hardware Design for 256 128-neuron BSB models
The ITT cognitive task focused on large scale cognitive applications. Algorithms were investigated for resource requirements and prospects for optimization. The algorithms investigated include:
• Bayesian tree networks: hypothesized as a mechanism of visual perception [8, 9, 10 ].
The algorithm is described in detail in [11] .
• BSB implementation as a mechanism of visual and auditory perception [12] .
• Spiky neural model: the basis of hypothesis regarding the dynamical mechanisms of cognition [13] .
• Simple cell model: Anatomical mechanism [14] . The geometry of receptive field patterns is fundamental to visual perception.
• Confabulation: An algorithm which hypothesizes how networks of neurons respond to sequences of stimuli [15] .
Each of the algorithms and their conceptual models were examined for the cognitive primitive performed and the processing power and resource requirements needed to run them. The neuromorphic model of the visual cortex and the language confabulation model were chosen for further development. Resource requirements from those models were added to the ACS design trades and work was performed to improve the models for massively parallel systems. Both models were optimized to run on the Cell Broadband Engine architecture. Emulation was performed with a focus on areas to further optimize models and the system running the models.
Further details are available in [16] . 
Conclusions

Future Work
This project focused on the design and fabrication of a modular core and local mesh.
Transitions for chips with this architecture could include very small systems that have very limited power available but require much more computing power than is currently available at that energy level. Another long term development track involves mounting many stacks on a board system to approach the computational scale of the complete neocortex. It is expected that the 3D integration into stacks would be a three year project and full scale system development would entail another fabrication run at 32nm which would be large enough to make thousands of chips. Figure 5 shows the planned development from nodes to stacks to mega-node systems. The use of state of the art commercial fabrication makes fabrication funding the limit on the potential scale of the final system. With reasonable expectations for the processing power per energy and peak watts per chip it is possible scale to 10 PetaFlops in a million node system that consumes less than 100kW.
It is envisioned that the ultimate version of the operating system will allow executive level control of simultaneous formation of clusters of processors within stacks and clusters of stacks in multi-stack systems in order to accomplish multiple large parallel tasks. The executive function would dynamically determine the cluster geometry required to perform a task, parse out resources and adjust for minimum energy use. The complete system would be able to perform multiple cognitive operations, system control and communication tasks simultaneously in response to internal and external demand. These are the control requirements for a cognitive platform capable of achieving an assigned mission under internal, autonomic control. 
