3,483 research outputs found
MORA - an architecture and programming model for a resource efficient coarse grained reconfigurable processor
This paper presents an architecture and implementation details for MORA, a novel coarse grained reconfigurable processor for accelerating media processing applications. The MORA architecture involves a 2-D array of several such processors, to deliver low cost, high throughput performance in media processing applications. A distinguishing feature of the MORA architecture is the co-design of hardware architecture and low-level programming language throughout the design cycle. The implementation details for the single MORA processor, and benchmark evaluation using a cycle accurate simulator are presented
Fast, area-efficient 32-bit LNS for computer arithmetic operations
PhD ThesisThe logarithmic number system has been proposed as an alternative to floating-point.
Multiplication, division and square-root operations are accomplished with fixedpoint
arithmetic, but addition and subtraction are considerably more challenging.
Recent work has demonstrated that these operations too can be done with similar
speed and accuracy to their floating-point equivalents, but the necessary circuitry is
complex. In particular, it is dominated by the need for large lookup tables for the
storage of a non-linear function.
This thesis describes the architectures required to implement a newly design
approach for producing fast and area-efficient 32-bit LNS arithmetic unit. The
designs are structured based on two different algorithms. At first, a new cotransformation
procedure is introduced in the singularity region whilst performing
subtractions in which the technique capable to generate less total storage than the cotransformation
method in the previous LNS architecture. Secondly, improvement to
an existing interpolation process is proposed, that also reduce the total tables to an
extent that allows their easy synthesis in logic. Consequently, the total delays in the
system can be significantly reduced.
According to the comparison analysis with previous best LNS design and
floating-point units, it is shown that the new LNS architecture capable to offer
significantly better in speed while sustaining its accuracy within floating-point limit.
In addition, its implementation is more economical than previous best LNS system
and almost equivalent with existing floating-point arithmetic unit.University Malaysia Perlis:
Ministry of Higher Education, Malaysia
PGPG: An Automatic Generator of Pipeline Design for Programmable GRAPE Systems
We have developed PGPG (Pipeline Generator for Programmable GRAPE), a
software which generates the low-level design of the pipeline processor and
communication software for FPGA-based computing engines (FBCEs). An FBCE
typically consists of one or multiple FPGA (Field-Programmable Gate Array)
chips and local memory. Here, the term "Field-Programmable" means that one can
rewrite the logic implemented to the chip after the hardware is completed, and
therefore a single FBCE can be used for calculation of various functions, for
example pipeline processors for gravity, SPH interaction, or image processing.
The main problem with FBCEs is that the user need to develop the detailed
hardware design for the processor to be implemented to FPGA chips. In addition,
she or he has to write the control logic for the processor, communication and
data conversion library on the host processor, and application program which
uses the developed processor. These require detailed knowledge of hardware
design, a hardware description language such as VHDL, the operating system and
the application, and amount of human work is huge. A relatively simple design
would require 1 person-year or more. The PGPG software generates all necessary
design descriptions, except for the application software itself, from a
high-level design description of the pipeline processor in the PGPG language.
The PGPG language is a simple language, specialized to the description of
pipeline processors. Thus, the design of pipeline processor in PGPG language is
much easier than the traditional design. For real applications such as the
pipeline for gravitational interaction, the pipeline processor generated by
PGPG achieved the performance similar to that of hand-written code. In this
paper we present a detailed description of PGPG version 1.0.Comment: 24 pages, 6 figures, accepted PASJ 2005 July 2
A low cost reconfigurable soft processor for multimedia applications: design synthesis and programming model
This paper presents an FPGA implementation of a low cost 8 bit reconfigurable processor core for media processing applications. The core is optimized to provide all basic arithmetic and logic functions required by the media processing and other domains, as well as to make it easily integrable into a 2D array. This paper presents an investigation of the feasibility of the core as a potential soft processing architecture for FPGA platforms. The core was synthesized on the entire Virtex FPGA family to evaluate its overall performance, scalability and portability. A special feature of the proposed architecture is its simple programming model which allows low level programming. Throughput results for popular benchmarks coded using the programming model and cycle accurate simulator are presented
Improved MDLNS Number System Addition and Subtraction by Use of the Novel Co-Transformation
Multi-Dimensional Logarithmic Number System (MDLNS) is a generalized version of the Logarithmic Number System (LNS) which has multiple dimensions or bases. These generalizations can increase accuracy and hardware efficiency. However, addition and subtraction operations are the major obstruction of all logarithmic number systems circuits and so far a fair amount of research has been done to find practical techniques in LNS to implement these operations efficiently without the need for large tables. In order to achieve this goal, several methods such as interpolation, multipartite tables, and co-transformation have been introduced to decrease the cost and complexity. One of the most recent works is Novel Co-transformation. This thesis investigates the application of the Novel Co-Transformation on MDLNS. The goal is to reduce the table sizes over previously published method which utilizes a different address decoder on its tables which requires greater overhead. The results show that the table sizes are reduced significantly when a minimal error is allowed. Other common LNS techniques for table reductions may be applied to obtain better results
Application-Specific Number Representation
Reconfigurable devices, such as Field Programmable Gate Arrays (FPGAs), enable application-
specific number representations. Well-known number formats include fixed-point, floating-
point, logarithmic number system (LNS), and residue number system (RNS). Such different
number representations lead to different arithmetic designs and error behaviours, thus produc-
ing implementations with different performance, accuracy, and cost.
To investigate the design options in number representations, the first part of this thesis presents
a platform that enables automated exploration of the number representation design space. The
second part of the thesis shows case studies that optimise the designs for area, latency or
throughput from the perspective of number representations.
Automated design space exploration in the first part addresses the following two major issues:
² Automation requires arithmetic unit generation. This thesis provides optimised
arithmetic library generators for logarithmic and residue arithmetic units, which support
a wide range of bit widths and achieve significant improvement over previous designs.
² Generation of arithmetic units requires specifying the bit widths for each
variable. This thesis describes an automatic bit-width optimisation tool called R-Tool,
which combines dynamic and static analysis methods, and supports different number
systems (fixed-point, floating-point, and LNS numbers).
Putting it all together, the second part explores the effects of application-specific number
representation on practical benchmarks, such as radiative Monte Carlo simulation, and seismic
imaging computations. Experimental results show that customising the number representations
brings benefits to hardware implementations: by selecting a more appropriate number format,
we can reduce the area cost by up to 73.5% and improve the throughput by 14.2% to 34.1%; by
performing the bit-width optimisation, we can further reduce the area cost by 9.7% to 17.3%.
On the performance side, hardware implementations with customised number formats achieve
5 to potentially over 40 times speedup over software implementations
Energy-efficient design and implementation of turbo codes for wireless sensor network
The objective of this thesis is to apply near Shannon limit Error-Correcting Codes (ECCs), particularly the turbo-like codes, to energy-constrained wireless devices, for the purpose of extending their lifetime. Conventionally, sophisticated ECCs are applied to applications, such as mobile telephone networks or satellite television networks, to facilitate long range and high throughput wireless communication. For low power applications, such as Wireless Sensor Networks (WSNs), these ECCs were considered due to their high decoder complexities. In particular, the energy efficiency of the sensor nodes in WSNs is one of the most important factors in their design. The processing energy consumption required by high complexity ECCs decoders is a significant drawback, which impacts upon the overall energy consumption of the system. However, as Integrated Circuit (IC) processing technology is scaled down, the processing energy consumed by hardware resources reduces exponentially. As a result, near Shannon limit ECCs have recently begun to be considered for use in WSNs to reduce the transmission energy consumption [1,2]. However, to ensure that the transmission energy consumption reduction granted by the employed ECC makes a positive improvement on the overall energy efficiency of the system, the processing energy consumption must still be carefully considered.The main subject of this thesis is to optimise the design of turbo codes at both an algorithmic and a hardware implementation level for WSN scenarios. The communication requirements of the target WSN applications, such as communication distance, channel throughput, network scale, transmission frequency, network topology, etc, are investigated. Those requirements are important factors for designing a channel coding system. Especially when energy resources are limited, the trade-off between the requirements placed on different parameters must be carefully considered, in order to minimise the overall energy consumption. Moreover, based on this investigation, the advantages of employing near Shannon limit ECCs in WSNs are discussed. Low complexity and energy-efficient hardware implementations of the ECC decoders are essential for the target applications
Highly accelerated simulations of glassy dynamics using GPUs: caveats on limited floating-point precision
Modern graphics processing units (GPUs) provide impressive computing
resources, which can be accessed conveniently through the CUDA programming
interface. We describe how GPUs can be used to considerably speed up molecular
dynamics (MD) simulations for system sizes ranging up to about 1 million
particles. Particular emphasis is put on the numerical long-time stability in
terms of energy and momentum conservation, and caveats on limited
floating-point precision are issued. Strict energy conservation over 10^8 MD
steps is obtained by double-single emulation of the floating-point arithmetic
in accuracy-critical parts of the algorithm. For the slow dynamics of a
supercooled binary Lennard-Jones mixture, we demonstrate that the use of
single-floating point precision may result in quantitatively and even
physically wrong results. For simulations of a Lennard-Jones fluid, the
described implementation shows speedup factors of up to 80 compared to a serial
implementation for the CPU, and a single GPU was found to compare with a
parallelised MD simulation using 64 distributed cores.Comment: 12 pages, 7 figures, to appear in Comp. Phys. Comm., HALMD package
licensed under the GPL, see http://research.colberg.org/projects/halm
- …