227 research outputs found
Fast HUB Floating-point Adder for FPGA
Several previous publications have shown the area
and delay reduction when implementing real number computation
using HUB formats for both floating-point and fixed-point.
In this paper, we present a HUB floating-point adder for FPGA
which greatly improves the speed of previous proposed HUB
designs for these devices. Our architecture is based on the double
path technique which reduces the execution time since each
path works in parallel. We also deal with the implementation of
unbiased rounding in the proposed adder. Experimental results
are presented showing the goodness of the new HUB adder for
FPGA.TIN2016- 80920-R, JA2012 P12-TIC-1692, JA2012 P12-TIC-147
Measuring Improvement when Using HUB Formats to Implement Floating-Point Systems under Round-to-Nearest
MEC bajo TIN2013-42253-PThis paper analyzes the benefits of using HUB
formats to implement floating-point arithmetic under round-tonearest
mode from a quantitative point of view. Using HUB
formats to represent numbers allows the removal of the rounding
logic of arithmetic units, including sticky-bit computation. This
is shown for floating-point adders, multipliers, and converters.
Experimental analysis demonstrates that HUB formats and the
corresponding arithmetic units maintain the same accuracy as
conventional ones. On the other hand, the implementation of
these units, based on basic architectures, shows that HUB formats
simultaneously improve area, speed, and power consumption.
Specifically, based on data obtained from the synthesis, a HUB
single-precision adder is about 14% faster but consumes 38% less
area and 26% less power than the conventional adder. Similarly, a
HUB single-precision multiplier is 17% faster, uses 22% less area,
and consumes slightly less power than conventional multiplier. At
the same speed, the adder and multiplier achieve area and power
reductions of up to 50% and 40%, respectively
Efficient floating-point givens rotation unit
This is a post-peer-review, pre-copyedit version of an article published in Circuits, Systems, and Signal Processing.High-throughput QR decomposition is a key operation in many advanced signal processing and communication applications. For some of these applications, using floating-point computation is becoming almost compulsory. However, there are scarce works in hardware implementations of floating-point QR decomposition for embedded systems. In this paper, we propose a very efficient high-throughput floating-point Givens rotation unit for QR decomposition. Moreover, the initial proposed design for conventional number formats is enhanced by using the new Half-Unit Biased format. The provided error
analysis shows the effectiveness of our proposals and the trade-off of different implementation parameters. We also present FPGA implementation results and a thorough comparison between both approaches. These implementation results also reveal outstanding improvements compared to other previous similar designs in terms of area, latency, and throughput.This work was supported in part by following Spanish projects:
TIN2016-80920-R, and JA2012 P12-TIC-169
PRODUCTIVELY SCALING HARDWARE DESIGNS OVER INCREASING RESOURCES USING A SYSTEMATIC DESIGN ANALYSIS APPROACH
As processor development shifts from strict single core frequency scaling to het- erogeneous resource scaling two important considerations require evaluation. First, how to design systems with an increasing amount of heterogeneous resources, and second, how to maintain a designer’s productivity as the number of possible con- figurations grows. Therefore, it is necessary to determine what useful information can be gathered from existing designs to help predict or identify a design’s potential scalability, as well as, identifying which routine tasks can be automated to improve a designer’s productivity. Moreover, once this information is collected, how can this information be conveyed to the designer such that it can be used to increase overall productivity when implementing the design over increasing amounts of resources?
This research looks at various approaches to analyze designs and attempts to distribute an application efficiently across a heterogeneous cluster of computing re- sources through the use of a Systematic Design Analysis flow and an assortment of productivity tools. These tools provide the designer with projections on the amount of resources needed to scale an existing design to a specified performance, as well as, projecting the performance based on a specified amount of resources. This is accomplished through the combination of static HDL profiling, component synthesis resource utilization, and runtime performance monitoring. For evaluation, four case studies are presented to demonstrate the proposed flow’s scalability on a small scale cluster of FPGAs. The results are highly favorable, providing orders of magnitude speedup with minimal intervention from the designer
A new floating-point adder FPGA-based implementation using RN-coding of numbers
A well-known problem in the computer science area is related to numerical data representation,
which directly affects adder circuits’ design and a reason to have different formats: IEEE Std.
754, Half-Unit-Biased (HUB), and Round-to-Nearest (RN). RN has an advantage that rounding
to nearest is equivalent to a word truncation. It avoids double rounding errors and intermediate
rounding steps with an exact conversion between formats, making it applicable to general
problems. However, there is a lack of research on the hardware implementation of the RN
representation. In this work, we propose hardware architectures for binary and floating-point
adders, analyzing for the latter its performance in terms of error and resource consumption
in FPGAs. To accomplish this, we have developed a one-bit RN-based adder that allows
modular designs, considering an efficient signal propagation to obtain new architectures for
both binary and floating-point single-precision adders. The results open new perspectives for
further applications
Datacenter Design for Future Cloud Radio Access Network.
Cloud radio access network (C-RAN), an emerging cloud service that combines the traditional radio access network (RAN) with cloud computing technology, has been proposed as a solution to handle the growing energy consumption and cost of the traditional RAN. Through aggregating baseband units (BBUs) in a centralized cloud datacenter, C-RAN reduces energy and cost, and improves wireless throughput and quality of service. However, designing a datacenter for C-RAN has not yet been studied. In this dissertation, I investigate how a datacenter for C-RAN BBUs should be built on commodity servers.
I first design WiBench, an open-source benchmark suite containing the key signal processing kernels of many mainstream wireless protocols, and study its characteristics. The characterization study shows that there is abundant data level parallelism (DLP) and thread level parallelism (TLP). Based on this result, I then develop high performance software implementations of C-RAN BBU kernels in C++ and CUDA for both CPUs and GPUs. In addition, I generalize the GPU parallelization techniques of the Turbo decoder to the trellis algorithms, an important family of algorithms that are widely used in data compression and channel coding.
Then I evaluate the performance of commodity CPU servers and GPU servers. The study shows that the datacenter with GPU servers can meet the LTE standard throughput with 4Ă— to 16Ă— fewer machines than with CPU servers. A further energy and cost analysis show that GPU servers can save on average 13Ă— more energy and 6Ă— more cost. Thus, I propose the C-RAN datacenter be built using GPUs as a server platform.
Next I study resource management techniques to handle the temporal and spatial traffic imbalance in a C-RAN datacenter. I propose a “hill-climbing” power management that combines powering-off GPUs and DVFS to match the temporal C-RAN traffic pattern. Under a practical traffic model, this technique saves 40% of the BBU energy in a GPU-based C-RAN datacenter. For spatial traffic imbalance, I propose three workload distribution techniques to improve load balance and throughput. Among all three techniques, pipelining packets has the most throughput improvement at 10% and 16% for balanced and unbalanced loads, respectively.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/120825/1/qizheng_1.pd
- …