Search CORE

118 research outputs found

A 64-point Fourier transform chip for high-speed wireless LAN application using OFDM

Author: Grass Eckhard
Jagdhold Ulrich
Maharatna Koushik
Publication venue
Publication date: 01/03/2004
Field of study

In this article, we present a novel fixed-point 16-bit word-width 64-point FFT/IFFT processor developed primarily for the application in the OFDM based IEEE 802.11a Wireless LAN (WLAN) baseband processor. The 64-point FFT is realized by decomposing it into a 2-D structure of 8-point FFTs. This approach reduces the number of required complex multiplications compared to the conventional radix-2 64-point FFT algorithm. The complex multiplication operations are realized using shift-and-add operations. Thus, the processor does not use any 2-input digital multiplier. It also does not need any RAM or ROM for internal storage of coefficients. The proposed 64-point FFT/IFFT processor has been fabricated and tested successfully using our in-house 0.25 ?m BiCMOS technology. The core area of this chip is 6.8 mm2. The average dynamic power consumption is 41 mW @ 20 MHz operating frequency and 1.8 V supply voltage. The processor completes one parallel-to-parallel (i. e., when all input data are available in parallel and all output data are generated in parallel) 64-point FFT computation in 23 cycles. These features show that though it has been developed primarily for application in the IEEE 802.11a standard, it can be used for any application that requires fast operation as well as low power consumption

Southampton (e-Prints Soton)

Explore Bristol Research

Design of testbed and emulation tools

Author: Flynn M. J.
Lundstrom S. F.
Publication venue
Publication date
Field of study

The research summarized was concerned with the design of testbed and emulation tools suitable to assist in projecting, with reasonable accuracy, the expected performance of highly concurrent computing systems on large, complete applications. Such testbed and emulation tools are intended for the eventual use of those exploring new concurrent system architectures and organizations, either as users or as designers of such systems. While a range of alternatives was considered, a software based set of hierarchical tools was chosen to provide maximum flexibility, to ease in moving to new computers as technology improves and to take advantage of the inherent reliability and availability of commercially available computing systems

NASA Technical Reports Server

A Streaming High-Throughput Linear Sorter System with Contention Buffering

Author: David Andrews
Jorge Ortiz
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2011
Field of study

Popular sorting algorithms do not translate well into hardware implementations. Instead, hardware-based solutions like sorting networks, systolic sorters, and linear sorters exploit parallelism to increase sorting efficiency. Linear sorters, built from identical nodes with simple control, have less area and latency than sorting networks, but they are limited in their throughput. We present a system composed of multiple linear sorters acting in parallel to increase overall throughput. Interleaving is used to increase bandwidth and allow sorting of multiple values per clock cycle, and the amount of interleaving and depth of the linear sorters can be adapted to suit specific applications. Contention for available linear sorters in the system is solved through the use of buffers that accumulate conflicting requests, dispatching them in bulk to reduce latency penalties. Implementation of this system into a field programmable gate array (FPGA) results in a speedup of 68 compared to a MicroBlaze processor running quicksort

Crossref

Directory of Open Access Journals

Zooplankton visualization system: design and real-time lossless image compression

Author: Tetala Satya Surya Dattatreya Reddy
Publication venue: LSU Digital Commons
Publication date: 01/01/2004
Field of study

In this thesis, I present a design of a small, self-contained, underwater plankton imaging system. I base the imaging system’s design on an embedded PC architecture based on PC/104-Plus standards to meet the compact size and low power requirements. I developed a simple graphical user interface to run on a real-time operating system to control the imaging system. I also address how a real-time image compression scheme implemented on an FPGA chip speeds up image transfer speeds of the imaging system. Since lossless compression of the image is required in order to retain all image details, I began with an established compression scheme like SPIHT, and latter proposed a new compression scheme that suits the imaging system’s requirements. I provide an estimate of the total amount of resources required and propose suitable FPGA chips to implement the compression scheme. Finally, I present various parallel designs by which the FPGA chip can be integrated into the imaging system

Louisiana State University

A 64-Point Fourier Transform Chip for High-Speed Wireless LAN Application Using OFDM

Author: E. Grass
K. Maharatna
U. Jagdhold
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

A low-power, high-performance speech recognition accelerator

Author: Arnau Montañés José María
González Colás Antonio María
Yazdani Reza
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2019
Field of study

© 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Automatic Speech Recognition (ASR) is becoming increasingly ubiquitous, especially in the mobile segment. Fast and accurate ASR comes at high energy cost, not being affordable for the tiny power-budgeted mobile devices. Hardware acceleration reduces energy-consumption of ASR systems, while delivering high-performance. In this paper, we present an accelerator for largevocabulary, speaker-independent, continuous speech-recognition. It focuses on the Viterbi search algorithm representing the main bottleneck in an ASR system. The proposed design consists of innovative techniques to improve the memory subsystem, since memory is the main bottleneck for performance and power in these accelerators' design. It includes a prefetching scheme tailored to the needs of ASR systems that hides main memory latency for a large fraction of the memory accesses, negligibly impacting area. Additionally, we introduce a novel bandwidth-saving technique that removes off-chip memory accesses by 20 percent. Finally, we present a power saving technique that significantly reduces the leakage power of the accelerators scratchpad memories, providing between 8.5 and 29.2 percent reduction in entire power dissipation. Overall, the proposed design outperforms implementations running on the CPU by orders of magnitude, and achieves speedups between 1.7x and 5.9x for different speech decoders over a highly optimized CUDA implementation running on Geforce-GTX-980 GPU, while reducing the energy by 123-454x.Peer ReviewedPostprint (author's final draft

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Hardware acceleration of the trace transform for vision applications

Author: Fahmy Suhaib A.
Publication venue: Department of Electrical and Electronic Engineering, Imperial College London
Publication date: 01/01/2008
Field of study

Computer Vision is a rapidly developing field in which machines process visual data to extract meaningful information. Digitised images in their pixels and bits serve no purpose of their own. It is only by interpreting the data, and extracting higher level information that a scene can be understood. The algorithms that enable this process are often complex, and data-intensive, limiting the processing rate when implemented in software. Hardware-accelerated implementations provide a significant performance boost that can enable real- time processing. The Trace Transform is a newly proposed algorithm that has been proven effective in image categorisation and recognition tasks. It is flexibly defined allowing the mathematical details to be tailored to the target application. However, it is highly computationally intensive, which limits its applications. Modern heterogeneous FPGAs provide an ideal platform for accelerating the Trace transform for real-time performance, while also allowing an element of flexibility, which highly suits the generality of the Trace transform. This thesis details the implementation of an extensible Trace transform architecture for vision applications, before extending this architecture to a full flexible platform suited to the exploration of Trace transform applications. As part of the work presented, a general set of architectures for large-windowed median and weighted median filters are presented as required for a number of Trace transform implementations. Finally an acceleration of Pseudo 2-Dimensional Hidden Markov Model decoding, usable in a person detection system, is presented. Such a system can be used to extract frames of interest from a video sequence, to be subsequently processed by the Trace transform. All these architectures emphasise the need for considered, platform-driven design in achieving maximum performance through hardware acceleration

Spiral - Imperial College Digital Repository

Recommended from our members

The analysis and synthesis of a parallel sorting engine

Author: Ahn Byoungchul
Publication venue: 'Oregon State University'
Publication date
Field of study

This thesis is concerned with the development of a unique parallel sort-merge system suitable for implementation in VLSI. Two new sorting subsystems, a high performance VLSI sorter and a four-way merger, were also realized during the development process. In addition, the analysis of several existing parallel sorting architectures and algorithms was carried out. Algorithmic time complexity, VLSI processor performance, and chip area requirements for the existing sorting systems were evaluated. The rebound sorting algorithm was determined to be the most efficient among those considered. The rebound sorter algorithm was implemented in hardware as a systolic array with external expansion capability. The second phase of the research involved analyzing several parallel merge algorithms and their buffer management schemes. The dominant considerations for this phase of the research were the achievement of minimum VLSI chip area, design complexity, and logic delay. It was determined that the proposed merger architecture could be implemented in several ways. Selecting the appropriate microarchitecture for the merger, given the constraints of chip area and performance, was the major problem. The tradeoffs associated with this process are outlined. Finally, a pipelined sort-merge system was implemented in VLSI by combining a rebound sorter and a four-way merger on a single chip. The final chip size was 416 mils by 432 mils. Two micron CMOS technology was utilized in this chip realization. An overall throughput rate of 10M bytes/sec was achieved. The prototype system developed is capable of sorting thirty two 2-byte keys during each merge phase. If extended, this system is capable of economically sorting files of 100M bytes or more in size. In order to sort larger files, this design should be incorporated in a disk-based sort-merge system. A simplified disk I/O access model for such a system was studied. In this study the sort-merge system was assumed to be part of a disk controller subsystem

ScholarsArchive@OSU

Reconfigurable-Hardware Accelerated Stream Aggregation

Author: Ramakrishnan Geethakumari Prajith
Publication venue
Publication date: 01/01/2022
Field of study

High throughput and low latency stream aggregation is essential for many applications that analyze massive volumes of data in real-time. Incoming data need to be stored in a single sliding-window before processing, in cases where incremental aggregations are wasteful or not possible at all. However, storing all incoming values in a single-window puts tremendous pressure on the memory bandwidth and capacity. GPU and CPU memory management is inefficient for this task as it introduces unnecessary data movement that wastes bandwidth. FPGAs can make more efficient use of their memory but existing approaches employ only on-chip memory and therefore, can only support small problem sizes (i.e. small sliding windows and number of keys) due to the limited capacity. This thesis addresses the above limitations of stream processing systems by proposing techniques for accelerating single sliding-window stream aggregation using FPGAs to achieve line-rate processing throughput and ultra low latency. It does so first by building accelerators using FPGAs and second, by alleviating the memory pressure posed by single-window stream aggregation. The initial part of this thesis presents the accelerators for both windowing policies, namely, tuple- and time-\ua0based, using Maxeler\u27s DataFlow Engines\ua0(DFEs) which have a direct feed of incoming data from the network as well as direct access to off-chip DRAM. Compared to state-of-the-art stream processing software system, the DFEs offer 1-2 orders of magnitude higher processing throughput and 4 orders of magnitude lower latency. The later part of this thesis focuses on alleviating the memory pressure due to the various steps in single-window stream aggregation. Updating the window with new incoming values and reading it to feed the aggregation functions are the two primary steps in stream aggregation. The high on-chip SRAM bandwidth enables line-rate processing, but only for small problem sizes due to the limited capacity. The larger off-chip DRAM size supports larger problems, but falls short on performance due to lower bandwidth. In order to bridge this gap, this thesis introduces a specialized memory hierarchy for stream aggregation. It employs Multi-Level Queues (MLQs) spanning across multiple memory levels with different characteristics to offer both high bandwidth and capacity. In doing so, larger stream aggregation problems can be supported at line-rate performance, outperforming existing competing solutions. Compared to designs with only on-chip memory, our approach supports 4 orders of magnitude larger problems. Compared to designs that use only DRAM, our design achieves up to 8x higher throughput. Finally, this thesis aims to alleviate the memory pressure due to the window-aggregation step. Although window-updates can be supported efficiently using MLQs, frequent window-aggregations remain a performance bottleneck. This thesis addresses this problem by introducing StreamZip, a dataflow stream aggregation engine that is able to compress the sliding-windows. StreamZip deals with a number of data and control dependency challenges to integrate a compressor in the stream aggregation pipeline and alleviate the memory pressure posed by frequent aggregations. In doing so, StreamZip offers higher throughput as well as larger effective window capacity to support larger problems. StreamZip supports diverse compression algorithms offering both lossless and lossy compression to fixed- as well as floating- point numbers. Compared to designs using MLQs, StreamZip lossless and lossy designs achieve up to 7.5x and 22x higher throughput, while improving the effective memory capacity by up to 5x and 23x, respectively

Chalmers Research

A versatile nondestructive evaluation imaging workstation

Author: Butler David W.
Chern E. James
Publication venue
Publication date
Field of study

Ultrasonic C-scan and eddy current imaging systems are of the pointwise type evaluation systems that rely on a mechanical scanner to physically maneuver a probe relative to the specimen point by point in order to acquire data and generate images. Since the ultrasonic C-scan and eddy current imaging systems are based on the same mechanical scanning mechanisms, the two systems can be combined using the same PC platform with a common mechanical manipulation subsystem and integrated data acquisition software. Based on this concept, we have developed an IBM PC-based combined ultrasonic C-scan and eddy current imaging system. The system is modularized and provides capacity for future hardware and software expansions. Advantages associated with the combined system are: (1) eliminated duplication of the computer and mechanical hardware, (2) unified data acquisition, processing and storage software, (3) reduced setup time for repetitious ultrasonic and eddy current scans, and (4) improved system efficiency. The concept can be adapted to many engineering systems by integrating related PC-based instruments into one multipurpose workstation such as dispensing, machining, packaging, sorting, and other industrial applications

NASA Technical Reports Server