13 research outputs found
Viterbi Accelerator for Embedded Processor Datapaths
We present a novel architecture for a lightweight Viterbi accelerator that can be tightly integrated inside an embedded processor. We investigate the accelerator’s impact on processor performance by using the EEMBC Viterbi benchmark and the in-house Viterbi Branch Metric kernel. Our evaluation based on the EEMBC benchmark shows that an accelerated 65-nm 2.7-ns processor datapath is 20% larger but 90% more cycle efficient than a datapath lacking the Viterbi accelerator, leading to an 87% overall energy reduction and a data throughput of 3.52 Mbit/s
High-Speed and Low-Power Multipliers Using the Baugh-Wooley Algorithm and HPM Reduction Tree
The modifled-Booth algorithm is extensively used for high-speed multiplier circuits. Once, when array multipliers were used, the reduced number of generated partial products significantly improved multiplier performance. In designs based on reduction trees with logarithmic logic depth, however, the reduced number of partial products has a limited impact on overall performance. The Baugh-Wooley algorithm is a different scheme for signed multiplication, but is not so widely adopted because it may be complicated to deploy on irregular reduction trees. We use the Baugh-Wooley algorithm in our High Performance Multiplier (HPM) tree, which combines a regular layout with a logarithmic logic depth. We show for a range of operator bitwidths that, when implemented in 130-nm and 65-nm process technologies, the Baugh-Wooley multipliers exhibit comparable delay, less power dissipation and smaller area foot-print than modified-Booth multipliers
The Case for HPM-Based Baugh-Wooley Multipliers
The modified-Booth algorithm is extensively used for high-speed multiplier circuits. Once, when array multipliers were used, the reduced number of generated partial products significantly improved multiplier performance. In designs based on reduction trees with logarithmic logic depth, however, the reduced number of partial products has a limited impact on overall performance. The Baugh-Wooley algorithm is a different scheme for signed multiplication, but is not so widely adopted because it may be complicated to deploy on irregular reduction trees. We use the Baugh-Wooley algorithm in our High Performance Multiplier (HPM) tree, which combines a regular layout with a logarithmic logic depth. We show for a range of operator bit-widths that, when implemented in 130-nm and 65-nm process technologies, the Baugh-Wooley multipliers exhibit comparable delay, less power dissipation and smaller area foot-print than modified-Booth multipliers
A Flexible Datapath Interconnect for Embedded Applications
We investigate the effects of introducing a flexible interconnect into an exposed datapath. We define an exposed datapath as a traditional GPP datapath that has its normal control removed, leading to the exposure of a wide control word. For an FFT benchmark, the introduction of a flexible interconnect reduces the total execution time by 16%. Compared to a traditional GPP, the execution time for an exposed datapath using a flexible interconnect is 32% shorter whereas the energy dissipation is 29% lower. Our investigation is based on a cycleaccurate architectural simulator and figures on delay, power, and area are obtained from placed-and-routed layouts in a commercial 0.13-ìm technology. The results from our case studies indicate that by utilizing a flexible interconnect, significant performance gains can be achieved for generic applications
Reconfigurable Instruction Decoding for a Wide-Control-Word Processor
Fine-grained control through the use of a wide control word can lead to high instruction-level parallelism, but unless compressed the words require a large memory footprint. A reconfigurable fixed-length decoding scheme can be created by taking advantage of the fact that an application only uses a subset of the datapath for its execution. We present the first complete implementation of the FlexCore processor, integrating a wide-control-word datapath with a run-time reconfigurable instruction decompressor. Our evaluation, using three different EEMBC benchmarks, shows that it is possible to reach up to 35% speedup compared to a five-stage pipelined MIPS processor, assuming the same datapath units. In addition, our VLSI implementations show that this FlexCore processor offers up to 24% higher energy efficiency than the MIPS reference processor
Double Throughput Multiply-Accumulate Unit for FlexCore Processor Enhancements
As a simple five-stage General-Purpose Processor (GPP), the baseline FlexCore processor has a limited set of datapath units. By utilizing a flexible datapath interconnect and a wide control word, a FlexCore processor is explicitly designed to support integration of special units that, on demand, can accelerate certain data-intensive applications. In this paper, we propose the integration of a novel Double Throughput Multiply-Accumulate (DTMAC) unit, whose different operating modes allow for on-thefly optimization of computational precision. For the two EEMBC benchmarks considered, the FlexCore processor performance is significantly enhanced when one DTMAC accelerator is included, translating into reduced execution time and energy dissipation. In comparison to the 32-bit GPP reference, the accelerated 32- bit FlexCore processor shows a 4.37x improvement in execution time and a 3.92x reduction in energy dissipation, for a benchmark with many consecutive 16-bit MAC operations
Declarative, SAT-solver-based Scheduling for an Embedded Architecture with a Flexible Datapath
Abstract-Much like VLIW, statically scheduled architectures that expose all control signals to the compiler offer much potential for highly parallel, energy-efficient performance. Bau is a novel compilation infrastructure that leverages the LLVM compilation tools and the MiniSAT solver to generate efficient code for one such exposed architecture. We first build a compiler construction library that allows scheduling and resource constraints to be expressed declaratively in a domainspecific language, and then use this library to implement a compiler that generates programs that are 1.2-1.5 times more compact than either a baseline MIPS R2K compiler or a basic-block-based, sequentially phased scheduler
CREEP: Chalmers RTL-based Energy Evaluation of Pipelines
Abstract-Energy estimation at architectural level is vital since early design decisions have the greatest impact on the final implementation of an electronic system. It is, however, a particular challenge to perform energy evaluations for processors: While the software presents the processor designer with methodological problems related to, e.g., choice of benchmarks, technology scaling has made implementation properties depend strongly on, e.g., different circuit optimizations such as those used during timing closure. However tempting it is to modularize the hardware, this common method of using decoupled pipeline building blocks for energy estimation is bound to neglect implementation and integration aspects that are increasingly important. We introduce CREEP, an energy-evaluation framework for processor pipelines, which at its core has an accurate 65-nm CMOS implementation model of different configurations of a MIPS-Ilike pipeline including level-1 caches. While CREEP by default uses already existing estimated post-layout data, it is also possible for an advanced user to modify the pipeline RTL code or retarget the RTL code to a different process technology. We describe the CREEP evaluation flow, the components and tools used, and demonstrate the framework by analyzing a few different processor configurations in terms of energy and performance. I. INTRODUCTION In the early days of IC design, processors were developed with a focus on achieving high performance. Other design factors such as cost, area and power dissipation were also considered but only as limiting factors. However, in the late 1990's it became apparent that this design philosophy was unsustainable. CMOS technology scaling allowed for higher densities and increasing clock rates, but performance-centered designs that tried to leverage these advances became hard or impossible to cool in a cost-effective manner [1]. The power wall, which is a direct consequence of a discontinued Dennard scaling, means that technology scaling no longer is the obvious answer to increased performance and lower power Energy efficiency is, next to performance, the major focal point in VLSI design. The driving forces behind this are increased portability and environmental concerns. For portable battery-powered devices lower energy dissipation directly translates into a more well-received product. As far as environmental concerns, it is becoming painfully obvious that the rate at which the global energy dissipation increases is not sustainable. What is worrying is that integrated circuits contribute to a considerable chunk of this increas
Reducing set-associative L1 data cache energy by early load data dependence detection (ELD"sup"3"/sup")
Fast set-associative level-one data caches (L1 DCs) access all ways in parallel during load operations for reduced access latency. This is required in order to resolve data dependencies as early as possible in the pipeline, which otherwise would suffer from stall cycles. A significant amount of energy is wasted due to this fast access, since the data can only reside in one of the ways. While it is possible to reduce L1 DC energy usage by accessing the tag and data memories sequentially, hence activating only one data way on a tag match, this approach significantly increases execution time due to an increased number of stall cycles. We propose an early load data dependency detection (ELD3) technique for in-order pipelines. This technique makes it possible to detect if a load instruction has a data dependency with a subsequent instruction. If there is no such dependency, then the tag and data accesses for the load are sequentially performed so that only the data way in which the data resides is accessed. If there is a dependency, then the tag and data arrays are accessed in parallel to avoid introducing additional stall cycles. For the MiBench benchmark suite, the ELD3 technique enables about 49% of all load operations to access the L1 DC sequentially. Based on 65-nm data using commercial SRAM blocks, the proposed technique reduces L1 DC energy by 13%