Search CORE

357 research outputs found

Recommended from our members

Silicon compilation

Author: Dutt Nikil D.
Gajski Daniel D.
Pangrle Barry M.
Publication venue: eScholarship, University of California
Publication date: 01/01/1987
Field of study

Silicon compilation is a term used for many different purposes. In this paper we define silicon compilation as a mapping from some higher level description into layout. We define the basic issues in structural and behavioral silicon compilation and some possible solutions to those issues. Finally, we define the concept of an intelligent silicon compiler in which the compiler evaluates the quality of the generated design and attempts to improve it if it is not satisfactory

eScholarship - University of California

Simulated annealing based datapath synthesis

Author: Neil John Paul
Publication venue: The University of Edinburgh
Publication date: 01/01/1994
Field of study

Edinburgh Research Archive

Synchronous elasticization: considerations for correct implementation and miniMIPS case study

Author: Kilada Eliyah
Stevens Kenneth
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2010
Field of study

Journal ArticleLatency insensitivity is a promising design paradigm in the nanometer era since it has potential benefits of increased modularity and robustness to variations. Synchronous elasticization is one approach (among others) of transforming an ordinary clocked circuit into a latency insensitive design. This paper presents practical considerations of elasticizing reconvergent fanouts. It also investigates the suitability of previously published as well as new join and fork implementations for usage in the elastic control network. We demonstrate that elasticization comes at a cost. Measurements of a MiniMIPS processor fabricated in a 0.5 μm node show that elasticization results in area and dynamic and idle power penalties of 29%, 13% and 58.3%, respectively, without any loss in performance. These measurements do not exploit the capability of pipeline bubbles that occur if one needs to have unpredictable interface latency, or to insert extra bubbles into a pipeline due to wire delays. We finally show the architectural performance advantage of eager over lazy protocols in the presence of bubbles in the MiniMIPS

The University of Utah: J. Willard Marriott Digital Library

Digital implementation of the cellular sensor-computers

Author: Benthien
Bolotski
Catrysse
Diamantaras
Domínguez-Castro
Dudek
Dudek
Dudek
El Gamal
El Gamal
Espejo
Foldesy
Gielen
Grossberg
Hamamoto
Herbordt
Johansson
Kleinfelder
Linan
Liu
Liñán
Liñán
Morris
Murray
Nayar
Ohta
Roska
Roska
Roska
Rudack
Schneider
Serra
Sinno
Tian
Wagner
Wanhammar
Zarándy
Publication venue: 'Wiley'
Publication date: 01/01/2006
Field of study

Two different kinds of cellular sensor-processor architectures are used nowadays in various applications. The first is the traditional sensor-processor architecture, where the sensor and the processor arrays are mapped into each other. The second is the foveal architecture, in which a small active fovea is navigating in a large sensor array. This second architecture is introduced and compared here. Both of these architectures can be implemented with analog and digital processor arrays. The efficiency of the different implementation types, depending on the used CMOS technology, is analyzed. It turned out, that the finer the technology is, the better to use digital implementation rather than analog

Crossref

SZTAKI Publication Repository

Repository of the Academy's Library

Dynamically reconfigurable asynchronous processor

Author: Fawaz Khodor Ahmad
Publication venue: The University of Edinburgh
Publication date: 25/06/2012
Field of study

The main design requirements for today's mobile applications are: · high throughput performance. · high energy efficiency. · high programmability. Until now, the choice of platform has often been limited to Application-Specific Integrated Circuits (ASICs), due to their best-of-breed performance and power consumption. The economies of scale possible with these high-volume markets have traditionally been able to hide the high Non-Recurring Engineering (NRE) costs required for designing and fabricating new ASICs. However, with the NREs and design time escalating with each generation of mobile applications, this practice may be reaching its limit. Designers today are looking at programmable solutions, so that they can respond more rapidly to changes in the market and spread costs over several generations of mobile applications. However, there have been few feasible alternatives to ASICs: Digital Signals Processors (DSPs) and microprocessors cannot meet the throughput requirements, whereas Field-Programmable Gate Arrays (FPGAs) require too much area and power. Coarse-grained dynamically reconfigurable architectures offer better solutions for high throughput applications, when power and area considerations are taken into account. One promising example is the Reconfigurable Instruction Cell Array (RICA). RICA consists of an array of cells with an interconnect that can be dynamically reconfigured on every cycle. This allows quite complex datapaths to be rendered onto the fabric and executed in a single configuration - making these architectures particularly suitable to stream processing. Furthermore, RICA can be programmed from C, making it a good fit with existing design methodologies. However the RICA architecture has a drawback: poor scalability in terms of area and power. As the core gets bigger, the number of sequential elements in the array must be increased significantly to maintain the ability to achieve high throughputs through pipelining. As a result, a larger clock tree is required to synchronise the increased number of sequential elements. The clock tree therefore takes up a larger percentage of the area and power consumption of the core. This thesis presents a novel Dynamically Reconfigurable Asynchronous Processor (DRAP), aimed at high-throughput mobile applications. DRAP is based on the RICA architecture, but uses asynchronous design techniques - methods of designing digital systems without clocks. The absence of a global clock signal makes DRAP more scalable in terms of power and area overhead than its synchronous counterpart. The DRAP architecture maintains most of the benefits of custom asynchronous design, whilst also providing programmability via conventional high-level languages. Results show that the DRAP processor delivers considerably lower power consumption when compared to a market-leading Very Long Instruction Word (VLIW) processor and a low-power ARM processor. For example, DRAP resulted in a reduction in power consumption of 20 times compared to the ARM7 processor, and 29 times compared to the TIC64x VLIW, when running the same benchmark capped to the same throughput and for the same process technology (0.13μm). When compared to an equivalent RICA design, DRAP was up to 22% larger than RICA but resulted in a power reduction of up to 1.9 times. It was also capable of achieving up to 2.8 times higher throughputs than RICA for the same benchmarks

Edinburgh Research Archive

FPGA Frequency Domain Based Gps Coarse Acquisition Processor using FFT

Author: Sajabi Cyprian D.
Publication venue: CORE Scholar
Publication date: 01/01/2006
Field of study

The Global Positioning System or GPS is a satellite based technology that has gained widespread use worldwide in civilian and military applications. Direct Sequence Spread spectrum (DSSS) is the method whereby the data transmitted by the satellite and received by user is kept secure, low power and relatively noise-immune. The first step required in the GPS operation is to perform a lock on the incoming signal, both with respect to time synchronization and frequency resolution. Because of the need for reduced time to lock and also reduced hardware, algorithms based in the frequency domain have been developed. These algorithms take advantage of the time to frequency matrix operation known as the fast Fourier transform or FFT. For this thesis, a Direct Sequence Spread Spectrum Coarse Acquisition code processor based on the FFT was implemented in VHDL and targeted to a Xilinx Virtex –II Pro Field Programmable Gate Array (FPGA). The use of the FFT allows simultaneous lock on coarse acquisition (C/A) code and carrier frequency. Because of hardware limitations, a novel technique of sub-sampling is used in this system to obtain data block sizes that match hardware limitations. In addition, design challenges related to scheduling and timing were addressed, allowing a system with 19 pipeline stages to be built. The system, which fits on a Xilinx Virtex-II pro XC2VP70 FPGA, uses 10 ms of data to perform the lock with 5.5 ms of processing time at 100 MHz and theoretically can operate on signals 20 db below the noise floor

OhioLINK Electronic Thesis and Dissertation Center

CORE

On the Feasibility and Limitations of Just-in-Time Instruction Set Extension for FPGA-Based Reconfigurable Processors

Author: Christian Plessl
Mariusz Grad
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2012
Field of study

Reconfigurable instruction set processors provide the possibility of tailor the instruction set of a CPU to a particular application. While this customization process could be performed during runtime in order to adapt the CPU to the currently executed workload, this use case has been hardly investigated. In this paper, we study the feasibility of moving the customization process to runtime and evaluate the relation of the expected speedups and the associated overheads. To this end, we present a tool flow that is tailored to the requirements of this just-in-time ASIP specialization scenario. We evaluate our methods by targeting our previously introduced Woolcano reconfigurable ASIP architecture for a set of applications from the SPEC2006, SPEC2000, MiBench, and SciMark2 benchmark suites. Our results show that just-in-time ASIP specialization is promising for embedded computing applications, where average speedups of 5x can be achieved by spending 50 minutes for custom instruction identification and hardware generation. These overheads will be compensated if the applications execute for more than 2 hours. For the scientific computing benchmarks, the achievable speedup is only 1.2x, which requires significant execution times in the order of days to amortize the overheads

Crossref

Directory of Open Access Journals

Modeling of a hardware VLSI placement system: Accelerating the Simulated Annealing algorithm

Author: Batts William Merle, Jr.
Publication venue: RIT Scholar Works
Publication date: 01/07/2005
Field of study

An essential step in the automation of electronic design is the placement of the physical components on the target semiconductor die. The placement step presents the opportunity to reduce costs in terms of wire length and performance degradation; however it is compute intensive and is NP-complete in terms of obtaining an optimal solution. As designs have grown in complexity and gate count, obtaining an optimal solution is not feasible due to time to market constraints or sheer compute effort required. Heuristic algorithms allow for efficient but sub-optimal designs to be produced with a reduction in processing time. A widely used algorithm is Simulated Annealing (SA). The goal of this work was to develop a model that would enable an analysis into the feasibility of developing a hardware accelerated placement system which uses SA at its core. The SA heuristic was analyzed for possible improvements in efficiency with focus given to targeting the system for hardware. A solution implementing parallel computing with specialized hardware configurations inside a field programmable gate array (FPGA) was investigated as having the possibility to improve the efficiency of the SA-based algorithm. All supporting subsystems were also described for a hardware accelerated model. A large speedup was analytically shown from both accelerating the critical path of the SA algorithm as well as novel methods of improving SA\u27s efficiency. As data throughput requirements were not included in this work, the results presented may be optimistic for an overall system speedup. However, the results clearly show that future work is warranted in studying the concept of a hardware accelerated placement system

RIT Scholar Works

FPGA-Based CNN Inference Accelerator Synthesized from Multi-Threaded C Software

Author: Anderson Jason H.
Brothers John
Grady Brett
Kim Jin Hee
Lian Ruolong
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 27/07/2018
Field of study

A deep-learning inference accelerator is synthesized from a C-language software program parallelized with Pthreads. The software implementation uses the well-known producer/consumer model with parallel threads interconnected by FIFO queues. The LegUp high-level synthesis (HLS) tool synthesizes threads into parallel FPGA hardware, translating software parallelism into spatial parallelism. A complete system is generated where convolution, pooling and padding are realized in the synthesized accelerator, with remaining tasks executing on an embedded ARM processor. The accelerator incorporates reduced precision, and a novel approach for zero-weight-skipping in convolution. On a mid-sized Intel Arria 10 SoC FPGA, peak performance on VGG-16 is 138 effective GOPS

arXiv.org e-Print Archive

Crossref