79 research outputs found
The Fifth NASA Symposium on VLSI Design
The fifth annual NASA Symposium on VLSI Design had 13 sessions including Radiation Effects, Architectures, Mixed Signal, Design Techniques, Fault Testing, Synthesis, Signal Processing, and other Featured Presentations. The symposium provides insights into developments in VLSI and digital systems which can be used to increase data systems performance. The presentations share insights into next generation advances that will serve as a basis for future VLSI design
Floating Point Arithmetic for Transport Triggered Architectures
Laskentajärjestelmiin kohdistuu usein suorituskyky- ja virrankulutusvaatimuksia, joita ei pystytä saavuttamaan yleiskäyttöisellä prosessorilla. Toistaalta laitteistokiihdyttimien suunnittelu voi vaatia kohtuuttoman paljon työaikaa. Ongelmaa voidaan lähestyä käyttämällä sovellusta varten räätälöityä sovelluskohtaista käskykantaprosessoria (Application-Specific Instruction set Processor, ASIP), joka on kuitenkin ohjelmoitava. Prosessorin räätälöinnin täytyy olla pitkälle automatisoitua säästääkseen kustannuksia.
TTA-based Codesign Environment (TCE) on siirtoliipaistuun prosessoriarkkitehtuuriin (Transport Triggered Architecture, TTA) perustuva ASIP-kehitysympäristö. TTA on arkkitehtuurina helposti räätälöitävä ja joustaa pienistä ytimistä suuritehoisiin pitkän käskysanan suorittimiin. Useat tieteellisen laskennan ja signaalinkäsittelyn sovellukset, joissa TTA:n skaalautuvuudesta ja käskytason rinnakkaisuudesta olisi erityistä hyötyä, vaativat tuen laitteistokiihdytetylle liukulukulaskennalle.
Tässä diplomityössä suunniteltiin ja toteutettiin TCE-projektia varten sarja liukulukuyksiköitä. Yksiköiden suunnittelussa pyrittiin alustariippumattomuuteen sekä korkeaan suorituskykyyn Field Programmable Gate Array alustoilla (FPGA) jopa tinkimällä tuetusta liukulukustandardista. Yksiköt sisältävät työkalut puolen tarkkuuden liukulukulaskentaan. Lisäksi työssä esitetään erikoiskäskyihin perustuvat nopeat algoritmit liukulukujakolaskun ja -neliöjuuren laskentaan.
Yksiköiden toiminta varmistettiin automaattisella rekisterisiirtotason (Register Transfer Level, RTL) testipenkillä. Vertailussa Altera Stratix-II-FPGA:lla yksiköt pääsivät lähelle Alteran omien liukulukuyksiköiden suorituskykyä. Uudemmalla Xilinx Virtex-6-FPGA:lla korkein mahdollinen suorituskyky vaatisi tiheämpää liukuhihnoitusta
Hardware implementation of a spiking neural network for fast synchronization
In this master thesis, we present two different hardware implementations of the Oscillatory Dynamic Link Matcher (ODLM). The ODLM is an algorithm which uses the synchronization in a network of spiking neurons to realize different signal processing tasks. The main objective of this work is to identify the key design choices leading to the efficient implementation of an embedded version of the ODLM. The resulting systems have been tested with image segmentation and image matching tasks. The first system is bit-slice and time-driven. The state of the whole network is updated at regular time intervals. The system uses a bit-slice architecture with a large number of processing elements. Each processing element, or slice, implements one neuron of the network and takes the form of a column on the hardware. The columns are placed side by side and they are locally connected to their 2 neighbors. This local hardware connection scheme makes the system scalable, which means that columns can be easily added to increase the capacity of the system. Each column consists of a weight vector, a synapse model unit and a membrane model unit. The system can implement any network topology, making it very flexible. The function governing the time evolution of the neurons' membrane potential is approximated by a piece-wise linear function to reduce the amount of logical resources required. With this system, a fully-connected network of 648 neurons can be implemented on a Virtex-5 Xilinx XC5VSX5OT FPGA clocked at 100 MHz. The system is designed to process simultaneous spikes in parallel, reaching a maximum processing speed of 6 Mspikes/s. It can segment a 23×23 pixel image in 2 seconds and match two pre-segmented 90×30 pixel images in 550 ms. The second system is event-driven. A single processing element sequentially processes the spikes. This processing element is a 5-stage pipeline which can process an average of 1 synapse per 7 clock cycles. The synaptic weights are not stored in memory in this system, they are computed on-the-fly as spikes are processed. The topology of the network is also resolved during operation, and the system supports various regular topologies like 8-neighbor and fully-connected. The membrane potential time evolution function is computed with high precision using a look-up table. On the Virtex-5 FPGA, a network of 65 536 neurons can be implemented and a 406×158 pixel image can be segmented in 200 ms. The FPGA can be clocked at 100 MHz. Most of the design choices made for the second system are well adapted to the hardware implementation of the ODLM. In the original ODLM, the weight values do not change over time and usually depend on a single variable. It is therefore beneficial to compute the weights on the fly rather than saving them in a huge memory bank. The event-driven approach is a very efficient strategy. It reduces the amount of computations required to run the network and the amount of data moved in and out of memory. Finally, the precise computation of the neurons' membrane potential increases the convergence speed of the network
Recommended from our members
Low-cost duplication for separable error detection in computer arithmetic
Low-cost arithmetic error detection will be necessary in the future to ensure correct and safe system operation. However, current error detection mechanisms for arithmetic either have high area and energy overheads or are complex and offer incomplete protection against errors. Full duplication is simple, strong, and separable, but often is prohibitively costly. Alternative techniques such as arithmetic error coding require lower hardware and energy overheads than full duplication, but they do so at the expense of high design effort and error coverage holes. The goal of this research is to mitigate the deficiencies of duplication and arithmetic error coding to form an error detection scheme that may be readily employed in future systems. The techniques described by this work use a general duplication technique that employs an alternate number system in the duplicate arithmetic unit. These novel dual modular redundancy organizations are referred to as low-cost duplication, and they provide compelling efficiency and coverage advantages over prior arithmetic error detection mechanisms.Electrical and Computer Engineerin
An instruction systolic array architecture for multiple neural network types
Modern electronic systems, especially sensor and imaging systems, are beginning to
incorporate their own neural network subsystems. In order for these neural systems to learn in
real-time they must be implemented using VLSI technology, with as much of the learning
processes incorporated on-chip as is possible. The majority of current VLSI implementations
literally implement a series of neural processing cells, which can be connected together in an
arbitrary fashion. Many do not perform the entire neural learning process on-chip, instead
relying on other external systems to carry out part of the computation requirements of the
algorithm.
The work presented here utilises two dimensional instruction systolic arrays in an attempt to
define a general neural architecture which is closer to the biological basis of neural networks - it
is the synapses themselves, rather than the neurons, that have dedicated processing units. A
unified architecture is described which can be programmed at the microcode level in order to
facilitate the processing of multiple neural network types.
An essential part of neural network processing is the neuron activation function, which can
range from a sequential algorithm to a discrete mathematical expression. The architecture
presented can easily carry out the sequential functions, and introduces a fast method of
mathematical approximation for the more complex functions. This can be evaluated on-chip,
thus implementing the entire neural process within a single system.
VHDL circuit descriptions for the chip have been generated, and the systolic processing
algorithms and associated microcode instruction set for three different neural paradigms have
been designed. A software simulator of the architecture has been written, giving results for
several common applications in the field
Design of approximate overclocked datapath
Embedded applications can often demand stringent latency requirements. While high degrees of parallelism within custom FPGA-based accelerators may help to some extent, it may also be necessary to limit the precision used in the datapath to boost the operating frequency of the implementation. However, by reducing the precision, the engineer introduces quantisation error into the design.
In this thesis, we describe an alternative circuit design methodology when considering trade-offs between accuracy, performance and silicon area. We compare two different approaches that could trade accuracy for performance. One is the traditional approach where the precision used in the datapath is limited to meet a target latency. The other is a proposed new approach which simply allows the datapath to operate without timing closure. We demonstrate analytically and experimentally that for many applications it would be preferable to simply overclock the design and accept that timing violations may arise. Since the errors introduced by timing violations occur rarely, they will cause less noise than quantisation errors.
Furthermore, we show that conventional forms of computer arithmetic do not fail gracefully when pushed beyond the deterministic clocking region. In this thesis we take a fresh look at Online Arithmetic, originally proposed for digit serial operation, and synthesize unrolled digit parallel online arithmetic operators to allow for graceful degradation. We quantify the impact of timing violations on key arithmetic primitives, and show that substantial performance benefits can be obtained in comparison to binary arithmetic. Since timing errors are caused by long carry chains, these result in errors in least significant digits with online arithmetic, causing less impact than conventional implementations.Open Acces
Pond IDE: Machine level program development environment and register transfer level simulator for a massively parallel computer architecture
As computing architectures are being implemented in late and post silicon technologies, fault tolerance and concurrent operation are becoming increasingly important. It is already common knowledge that manufacturers are putting two, four or even more cores on a single silicon die to improve computing performance. The proposed architecture far exceeds this number by grouping thousands or even millions of simple reduced instruction set computing (RISC) processors, each of which is capable of a single operation at a time, and to communicate with its eight nearest neighbors. In this architecture, if a single core or cluster of cores have defects at the time of manufacture, or later in the life of the system, it is possible to test and disable them as necessary. A fine-grained architecture of this kind calls for a parallel programming style. One approach to this problem is the use of a parallelizing compiler. Another approach may be to use one of the several application programming interfaces (APIs) available for standard text based programming languages, with some built-in features for parallel programming. This work has generated a solution for creating machine level parallel programs for the massively parallel computer architecture described above using text and graphical means. To support this programming method, an integrated development environment (IDE) and a zero communication latency, register transfer level (RTL) simulator have been developed. Experimental results include the implementation of fundamental data processing algorithms and complex functions
An Ultra-Low-Power 75mV 64-Bit Current-Mode Majority-Function Adder
Ultra-low-power circuits are becoming more desirable due to growing portable device markets and they are also becoming more interesting and applicable today in biomedical, pharmacy and sensor networking applications because of the nano-metric scaling and CMOS reliability improvements. In this thesis, three main achievements are presented in ultra-low-power adders. First, a new majority function algorithm for carry and the sum generation is presented. Then with this algorithm and implied new architecture, we achieved a circuit with 75mV supply voltage operation. Last but not least, a 64 bit current-mode majority-function adder based on the new architecture and algorithm is successfully tested at 75mV supply voltage. The circuit consumed 4.5nW or 3.8pJ in one of the worst conditions
- …