79 research outputs found

    Serial-data computation in VLSI

    Get PDF

    The Fifth NASA Symposium on VLSI Design

    Get PDF
    The fifth annual NASA Symposium on VLSI Design had 13 sessions including Radiation Effects, Architectures, Mixed Signal, Design Techniques, Fault Testing, Synthesis, Signal Processing, and other Featured Presentations. The symposium provides insights into developments in VLSI and digital systems which can be used to increase data systems performance. The presentations share insights into next generation advances that will serve as a basis for future VLSI design

    Floating Point Arithmetic for Transport Triggered Architectures

    Get PDF
    Laskentajärjestelmiin kohdistuu usein suorituskyky- ja virrankulutusvaatimuksia, joita ei pystytä saavuttamaan yleiskäyttöisellä prosessorilla. Toistaalta laitteistokiihdyttimien suunnittelu voi vaatia kohtuuttoman paljon työaikaa. Ongelmaa voidaan lähestyä käyttämällä sovellusta varten räätälöityä sovelluskohtaista käskykantaprosessoria (Application-Specific Instruction set Processor, ASIP), joka on kuitenkin ohjelmoitava. Prosessorin räätälöinnin täytyy olla pitkälle automatisoitua säästääkseen kustannuksia. TTA-based Codesign Environment (TCE) on siirtoliipaistuun prosessoriarkkitehtuuriin (Transport Triggered Architecture, TTA) perustuva ASIP-kehitysympäristö. TTA on arkkitehtuurina helposti räätälöitävä ja joustaa pienistä ytimistä suuritehoisiin pitkän käskysanan suorittimiin. Useat tieteellisen laskennan ja signaalinkäsittelyn sovellukset, joissa TTA:n skaalautuvuudesta ja käskytason rinnakkaisuudesta olisi erityistä hyötyä, vaativat tuen laitteistokiihdytetylle liukulukulaskennalle. Tässä diplomityössä suunniteltiin ja toteutettiin TCE-projektia varten sarja liukulukuyksiköitä. Yksiköiden suunnittelussa pyrittiin alustariippumattomuuteen sekä korkeaan suorituskykyyn Field Programmable Gate Array alustoilla (FPGA) jopa tinkimällä tuetusta liukulukustandardista. Yksiköt sisältävät työkalut puolen tarkkuuden liukulukulaskentaan. Lisäksi työssä esitetään erikoiskäskyihin perustuvat nopeat algoritmit liukulukujakolaskun ja -neliöjuuren laskentaan. Yksiköiden toiminta varmistettiin automaattisella rekisterisiirtotason (Register Transfer Level, RTL) testipenkillä. Vertailussa Altera Stratix-II-FPGA:lla yksiköt pääsivät lähelle Alteran omien liukulukuyksiköiden suorituskykyä. Uudemmalla Xilinx Virtex-6-FPGA:lla korkein mahdollinen suorituskyky vaatisi tiheämpää liukuhihnoitusta

    Hardware implementation of a spiking neural network for fast synchronization

    Get PDF
    In this master thesis, we present two different hardware implementations of the Oscillatory Dynamic Link Matcher (ODLM). The ODLM is an algorithm which uses the synchronization in a network of spiking neurons to realize different signal processing tasks. The main objective of this work is to identify the key design choices leading to the efficient implementation of an embedded version of the ODLM. The resulting systems have been tested with image segmentation and image matching tasks. The first system is bit-slice and time-driven. The state of the whole network is updated at regular time intervals. The system uses a bit-slice architecture with a large number of processing elements. Each processing element, or slice, implements one neuron of the network and takes the form of a column on the hardware. The columns are placed side by side and they are locally connected to their 2 neighbors. This local hardware connection scheme makes the system scalable, which means that columns can be easily added to increase the capacity of the system. Each column consists of a weight vector, a synapse model unit and a membrane model unit. The system can implement any network topology, making it very flexible. The function governing the time evolution of the neurons' membrane potential is approximated by a piece-wise linear function to reduce the amount of logical resources required. With this system, a fully-connected network of 648 neurons can be implemented on a Virtex-5 Xilinx XC5VSX5OT FPGA clocked at 100 MHz. The system is designed to process simultaneous spikes in parallel, reaching a maximum processing speed of 6 Mspikes/s. It can segment a 23×23 pixel image in 2 seconds and match two pre-segmented 90×30 pixel images in 550 ms. The second system is event-driven. A single processing element sequentially processes the spikes. This processing element is a 5-stage pipeline which can process an average of 1 synapse per 7 clock cycles. The synaptic weights are not stored in memory in this system, they are computed on-the-fly as spikes are processed. The topology of the network is also resolved during operation, and the system supports various regular topologies like 8-neighbor and fully-connected. The membrane potential time evolution function is computed with high precision using a look-up table. On the Virtex-5 FPGA, a network of 65 536 neurons can be implemented and a 406×158 pixel image can be segmented in 200 ms. The FPGA can be clocked at 100 MHz. Most of the design choices made for the second system are well adapted to the hardware implementation of the ODLM. In the original ODLM, the weight values do not change over time and usually depend on a single variable. It is therefore beneficial to compute the weights on the fly rather than saving them in a huge memory bank. The event-driven approach is a very efficient strategy. It reduces the amount of computations required to run the network and the amount of data moved in and out of memory. Finally, the precise computation of the neurons' membrane potential increases the convergence speed of the network

    An instruction systolic array architecture for multiple neural network types

    Get PDF
    Modern electronic systems, especially sensor and imaging systems, are beginning to incorporate their own neural network subsystems. In order for these neural systems to learn in real-time they must be implemented using VLSI technology, with as much of the learning processes incorporated on-chip as is possible. The majority of current VLSI implementations literally implement a series of neural processing cells, which can be connected together in an arbitrary fashion. Many do not perform the entire neural learning process on-chip, instead relying on other external systems to carry out part of the computation requirements of the algorithm. The work presented here utilises two dimensional instruction systolic arrays in an attempt to define a general neural architecture which is closer to the biological basis of neural networks - it is the synapses themselves, rather than the neurons, that have dedicated processing units. A unified architecture is described which can be programmed at the microcode level in order to facilitate the processing of multiple neural network types. An essential part of neural network processing is the neuron activation function, which can range from a sequential algorithm to a discrete mathematical expression. The architecture presented can easily carry out the sequential functions, and introduces a fast method of mathematical approximation for the more complex functions. This can be evaluated on-chip, thus implementing the entire neural process within a single system. VHDL circuit descriptions for the chip have been generated, and the systolic processing algorithms and associated microcode instruction set for three different neural paradigms have been designed. A software simulator of the architecture has been written, giving results for several common applications in the field

    Design of approximate overclocked datapath

    Get PDF
    Embedded applications can often demand stringent latency requirements. While high degrees of parallelism within custom FPGA-based accelerators may help to some extent, it may also be necessary to limit the precision used in the datapath to boost the operating frequency of the implementation. However, by reducing the precision, the engineer introduces quantisation error into the design. In this thesis, we describe an alternative circuit design methodology when considering trade-offs between accuracy, performance and silicon area. We compare two different approaches that could trade accuracy for performance. One is the traditional approach where the precision used in the datapath is limited to meet a target latency. The other is a proposed new approach which simply allows the datapath to operate without timing closure. We demonstrate analytically and experimentally that for many applications it would be preferable to simply overclock the design and accept that timing violations may arise. Since the errors introduced by timing violations occur rarely, they will cause less noise than quantisation errors. Furthermore, we show that conventional forms of computer arithmetic do not fail gracefully when pushed beyond the deterministic clocking region. In this thesis we take a fresh look at Online Arithmetic, originally proposed for digit serial operation, and synthesize unrolled digit parallel online arithmetic operators to allow for graceful degradation. We quantify the impact of timing violations on key arithmetic primitives, and show that substantial performance benefits can be obtained in comparison to binary arithmetic. Since timing errors are caused by long carry chains, these result in errors in least significant digits with online arithmetic, causing less impact than conventional implementations.Open Acces

    Pond IDE: Machine level program development environment and register transfer level simulator for a massively parallel computer architecture

    Get PDF
    As computing architectures are being implemented in late and post silicon technologies, fault tolerance and concurrent operation are becoming increasingly important. It is already common knowledge that manufacturers are putting two, four or even more cores on a single silicon die to improve computing performance. The proposed architecture far exceeds this number by grouping thousands or even millions of simple reduced instruction set computing (RISC) processors, each of which is capable of a single operation at a time, and to communicate with its eight nearest neighbors. In this architecture, if a single core or cluster of cores have defects at the time of manufacture, or later in the life of the system, it is possible to test and disable them as necessary. A fine-grained architecture of this kind calls for a parallel programming style. One approach to this problem is the use of a parallelizing compiler. Another approach may be to use one of the several application programming interfaces (APIs) available for standard text based programming languages, with some built-in features for parallel programming. This work has generated a solution for creating machine level parallel programs for the massively parallel computer architecture described above using text and graphical means. To support this programming method, an integrated development environment (IDE) and a zero communication latency, register transfer level (RTL) simulator have been developed. Experimental results include the implementation of fundamental data processing algorithms and complex functions

    An Ultra-Low-Power 75mV 64-Bit Current-Mode Majority-Function Adder

    Get PDF
    Ultra-low-power circuits are becoming more desirable due to growing portable device markets and they are also becoming more interesting and applicable today in biomedical, pharmacy and sensor networking applications because of the nano-metric scaling and CMOS reliability improvements. In this thesis, three main achievements are presented in ultra-low-power adders. First, a new majority function algorithm for carry and the sum generation is presented. Then with this algorithm and implied new architecture, we achieved a circuit with 75mV supply voltage operation. Last but not least, a 64 bit current-mode majority-function adder based on the new architecture and algorithm is successfully tested at 75mV supply voltage. The circuit consumed 4.5nW or 3.8pJ in one of the worst conditions

    Evolutionary design of digital VLSI hardware

    Get PDF
    corecore