Search CORE

227 research outputs found

Fast HUB Floating-point Adder for FPGA

Author: Gonzalez-Navarro Sonia
Hormigo-Aguilar Javier
Villalba-Moreno Julio
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 17/10/2018
Field of study

Several previous publications have shown the area and delay reduction when implementing real number computation using HUB formats for both floating-point and fixed-point. In this paper, we present a HUB floating-point adder for FPGA which greatly improves the speed of previous proposed HUB designs for these devices. Our architecture is based on the double path technique which reduces the execution time since each path works in parallel. We also deal with the implementation of unbiased rounding in the proposed adder. Experimental results are presented showing the goodness of the new HUB adder for FPGA.TIN2016- 80920-R, JA2012 P12-TIC-1692, JA2012 P12-TIC-147

Crossref

Repositorio Institucional Universidad de Málaga

Measuring Improvement when Using HUB Formats to Implement Floating-Point Systems under Round-to-Nearest

Author: Hormigo-Aguilar Javier
Villalba-Moreno Julio
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2016
Field of study

MEC bajo TIN2013-42253-PThis paper analyzes the benefits of using HUB formats to implement floating-point arithmetic under round-tonearest mode from a quantitative point of view. Using HUB formats to represent numbers allows the removal of the rounding logic of arithmetic units, including sticky-bit computation. This is shown for floating-point adders, multipliers, and converters. Experimental analysis demonstrates that HUB formats and the corresponding arithmetic units maintain the same accuracy as conventional ones. On the other hand, the implementation of these units, based on basic architectures, shows that HUB formats simultaneously improve area, speed, and power consumption. Specifically, based on data obtained from the synthesis, a HUB single-precision adder is about 14% faster but consumes 38% less area and 26% less power than the conventional adder. Similarly, a HUB single-precision multiplier is 17% faster, uses 22% less area, and consumes slightly less power than conventional multiplier. At the same speed, the adder and multiplier achieve area and power reductions of up to 50% and 40%, respectively

Repositorio Institucional Universidad de Málaga

Efficient floating-point givens rotation unit

Author: Hormigo-Aguilar Javier
Muñoz Sergio
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/12/2021
Field of study

This is a post-peer-review, pre-copyedit version of an article published in Circuits, Systems, and Signal Processing.High-throughput QR decomposition is a key operation in many advanced signal processing and communication applications. For some of these applications, using floating-point computation is becoming almost compulsory. However, there are scarce works in hardware implementations of floating-point QR decomposition for embedded systems. In this paper, we propose a very efficient high-throughput floating-point Givens rotation unit for QR decomposition. Moreover, the initial proposed design for conventional number formats is enhanced by using the new Half-Unit Biased format. The provided error analysis shows the effectiveness of our proposals and the trade-off of different implementation parameters. We also present FPGA implementation results and a thorough comparison between both approaches. These implementation results also reveal outstanding improvements compared to other previous similar designs in terms of area, latency, and throughput.This work was supported in part by following Spanish projects: TIN2016-80920-R, and JA2012 P12-TIC-169

Repositorio Institucional Universidad de Málaga

Run time reconfigurable DSP parallel processing system using dynamic FPGAs

Author: Murphy CW
Publication venue
Publication date
Field of study

LJMU Research Online (Liverpool John Moores University)

PRODUCTIVELY SCALING HARDWARE DESIGNS OVER INCREASING RESOURCES USING A SYSTEMATIC DESIGN ANALYSIS APPROACH

Author: NC DOCKS at The University of North Carolina at Charlotte
Schmidt Andrew Gregory
Publication venue
Publication date: 01/01/2011
Field of study

As processor development shifts from strict single core frequency scaling to het- erogeneous resource scaling two important considerations require evaluation. First, how to design systems with an increasing amount of heterogeneous resources, and second, how to maintain a designer’s productivity as the number of possible con- figurations grows. Therefore, it is necessary to determine what useful information can be gathered from existing designs to help predict or identify a design’s potential scalability, as well as, identifying which routine tasks can be automated to improve a designer’s productivity. Moreover, once this information is collected, how can this information be conveyed to the designer such that it can be used to increase overall productivity when implementing the design over increasing amounts of resources? This research looks at various approaches to analyze designs and attempts to distribute an application efficiently across a heterogeneous cluster of computing re- sources through the use of a Systematic Design Analysis flow and an assortment of productivity tools. These tools provide the designer with projections on the amount of resources needed to scale an existing design to a specified performance, as well as, projecting the performance based on a specified amount of resources. This is accomplished through the combination of static HDL profiling, component synthesis resource utilization, and runtime performance monitoring. For evaluation, four case studies are presented to demonstrate the proposed flow’s scalability on a small scale cluster of FPGAs. The results are highly favorable, providing orders of magnitude speedup with minimal intervention from the designer

The University of North Carolina at Greensboro

OpenCL acceleration on FPGA vs CUDA on GPU

Author: Quist R.
Publication venue
Publication date: 29/04/2019
Field of study

Pure OAI Repository

A new floating-point adder FPGA-based implementation using RN-coding of numbers

Author: Araujo Túlio
Arias-Garcia Janier
Cardoso Matheus B.R.
Llanos Carlos H.
Nepomuceno Erivelton
Publication venue: 'Elsevier BV'
Publication date: 01/01/2021
Field of study

A well-known problem in the computer science area is related to numerical data representation, which directly affects adder circuits’ design and a reason to have different formats: IEEE Std. 754, Half-Unit-Biased (HUB), and Round-to-Nearest (RN). RN has an advantage that rounding to nearest is equivalent to a word truncation. It avoids double rounding errors and intermediate rounding steps with an exact conversion between formats, making it applicable to general problems. However, there is a lack of research on the hardware implementation of the RN representation. In this work, we propose hardware architectures for binary and floating-point adders, analyzing for the latter its performance in terms of error and resource consumption in FPGAs. To accomplish this, we have developed a one-bit RN-based adder that allows modular designs, considering an efficient signal propagation to obtain new architectures for both binary and floating-point single-precision adders. The results open new perspectives for further applications

MURAL - Maynooth University Research Archive Library

Datacenter Design for Future Cloud Radio Access Network.

Author: Zheng Qi
Publication venue
Publication date: 01/01/2015
Field of study

Cloud radio access network (C-RAN), an emerging cloud service that combines the traditional radio access network (RAN) with cloud computing technology, has been proposed as a solution to handle the growing energy consumption and cost of the traditional RAN. Through aggregating baseband units (BBUs) in a centralized cloud datacenter, C-RAN reduces energy and cost, and improves wireless throughput and quality of service. However, designing a datacenter for C-RAN has not yet been studied. In this dissertation, I investigate how a datacenter for C-RAN BBUs should be built on commodity servers. I first design WiBench, an open-source benchmark suite containing the key signal processing kernels of many mainstream wireless protocols, and study its characteristics. The characterization study shows that there is abundant data level parallelism (DLP) and thread level parallelism (TLP). Based on this result, I then develop high performance software implementations of C-RAN BBU kernels in C++ and CUDA for both CPUs and GPUs. In addition, I generalize the GPU parallelization techniques of the Turbo decoder to the trellis algorithms, an important family of algorithms that are widely used in data compression and channel coding. Then I evaluate the performance of commodity CPU servers and GPU servers. The study shows that the datacenter with GPU servers can meet the LTE standard throughput with 4× to 16× fewer machines than with CPU servers. A further energy and cost analysis show that GPU servers can save on average 13× more energy and 6× more cost. Thus, I propose the C-RAN datacenter be built using GPUs as a server platform. Next I study resource management techniques to handle the temporal and spatial traffic imbalance in a C-RAN datacenter. I propose a “hill-climbing” power management that combines powering-off GPUs and DVFS to match the temporal C-RAN traffic pattern. Under a practical traffic model, this technique saves 40% of the BBU energy in a GPU-based C-RAN datacenter. For spatial traffic imbalance, I propose three workload distribution techniques to improve load balance and throughput. Among all three techniques, pipelining packets has the most throughput improvement at 10% and 16% for balanced and unbalanced loads, respectively.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/120825/1/qizheng_1.pd

Deep Blue Documents at the University of Michigan