973 research outputs found
Pruned Bit-Reversal Permutations: Mathematical Characterization, Fast Algorithms and Architectures
A mathematical characterization of serially-pruned permutations (SPPs)
employed in variable-length permuters and their associated fast pruning
algorithms and architectures are proposed. Permuters are used in many signal
processing systems for shuffling data and in communication systems as an
adjunct to coding for error correction. Typically only a small set of discrete
permuter lengths are supported. Serial pruning is a simple technique to alter
the length of a permutation to support a wider range of lengths, but results in
a serial processing bottleneck. In this paper, parallelizing SPPs is formulated
in terms of recursively computing sums involving integer floor and related
functions using integer operations, in a fashion analogous to evaluating
Dedekind sums. A mathematical treatment for bit-reversal permutations (BRPs) is
presented, and closed-form expressions for BRP statistics are derived. It is
shown that BRP sequences have weak correlation properties. A new statistic
called permutation inliers that characterizes the pruning gap of pruned
interleavers is proposed. Using this statistic, a recursive algorithm that
computes the minimum inliers count of a pruned BR interleaver (PBRI) in
logarithmic time complexity is presented. This algorithm enables parallelizing
a serial PBRI algorithm by any desired parallelism factor by computing the
pruning gap in lookahead rather than a serial fashion, resulting in significant
reduction in interleaving latency and memory overhead. Extensions to 2-D block
and stream interleavers, as well as applications to pruned fast Fourier
transforms and LTE turbo interleavers, are also presented. Moreover,
hardware-efficient architectures for the proposed algorithms are developed.
Simulation results demonstrate 3 to 4 orders of magnitude improvement in
interleaving time compared to existing approaches.Comment: 31 page
A Hybrid Decomposition Parallel Implementation of the Car-Parrinello Method
We have developed a flexible hybrid decomposition parallel implementation of
the first-principles molecular dynamics algorithm of Car and Parrinello. The
code allows the problem to be decomposed either spatially, over the electronic
orbitals, or any combination of the two. Performance statistics for 32, 64, 128
and 512 Si atom runs on the Touchstone Delta and Intel Paragon parallel
supercomputers and comparison with the performance of an optimized code running
the smaller systems on the Cray Y-MP and C90 are presented.Comment: Accepted by Computer Physics Communications, latex, 34 pages without
figures, 15 figures available in PostScript form via WWW at
http://www-theory.chem.washington.edu/~wiggs/hyb_figures.htm
An Architecture for On board Frequency Domain Analysis of Launch Vehicle Vibration Signals
The dynamic properties of the airborne structures plays a crucial role in the stability of the vehicle during
flight. Modal and spectral behaviour of the structures are simulated and analysed. Ground tests are carried out with environmental conditions close to the flight conditions, with some assumptions. Subsequently, based on the flight telemetered data, the on-board mission algorithm and the auto-pilot filter coefficients are fine tuned. An attempt is made in this paper to design a novel architecture for analysing the modal and spectral random vibration signals on-board the flight vehicle and to identify the dominant frequencies. Based on the analysed results, the mission mode algorithm and the filter coefficients can be fine tuned on-board for better effectiveness in control and providing more stability. Three types of windows viz. Hann, Hamming and Blackman-Harris are configured with a generalised equation using FIR filter structure. The overlapping of the input signal data for better inclusiveness of the real-time data is implemented with BRAM. The domain conversion of the data from time domain to frequency domain is carried out with FFT using Radix-2 BF architecture. The FFT output data are processed for calculating the power spectral density. The dominant frequency is identified using the array search method and Goldschmidt algorithm is utilised for the averaging of the PSDs for better precision. The proposed architecture is synthesised, implemented and tested with both Synthetic and doppler signal of 300 Hz spot frequency padded with Gaussian white noise. The results are highly satisfactory in identifying the spot frequency and generating the PSD array
All Digital, Background Calibration for Time-Interleaved and Successive Approximation Register Analog-to-Digital Converters
The growth of digital systems underscores the need to convert analog information to the digital domain at high speeds and with great accuracy. Analog-to-Digital Converter (ADC) calibration is often a limiting factor, requiring longer calibration times to achieve higher accuracy. The goal of this dissertation is to perform a fully digital background calibration using an arbitrary input signal for A/D converters. The work presented here adapts the cyclic Split-ADC calibration method to the time interleaved (TI) and successive approximation register (SAR) architectures. The TI architecture has three types of linear mismatch errors: offset, gain and aperture time delay. By correcting all three mismatch errors in the digital domain, each converter is capable of operating at the fastest speed allowed by the process technology. The total number of correction parameters required for calibration is dependent on the interleaving ratio, M. To adapt the Split-ADC method to a TI system, 2M+1 half-sized converters are required to estimate 3(2M+1) correction parameters. This thesis presents a 4:1 Split-TI converter that achieves full convergence in less than 400,000 samples. The SAR architecture employs a binary weight capacitor array to convert analog inputs into digital output codes. Mismatch in the capacitor weights results in non-linear distortion error. By adding redundant bits and dividing the array into individual unit capacitors, the Split-SAR method can estimate the mismatch and correct the digital output code. The results from this work show a reduction in the non-linear distortion with the ability to converge in less than 750,000 samples
A serial bus architecture for parallel processing systems.
One of the most serious deterrants to the development of multiple processor
architectures has been the problem of providing adequate communication between the
discrete processing elements. This paper examines two communications-based
constraints.
The first constraint is related to the physical structure of the VLSI chip. The
wider the communication path the more pins are needed to effect the data transfer. As
Integrated Circuits grow in computational power, more communication capacity is
needed, pushing designs closer to the pin limitations of the packaging technology.
The second constraint, somewhat related to the first, is the limited speed with
which data can be transmitted via internal channels. Typical speeds one can achieve
on a single wire are on the order of 1 Gbps. The recent development of an
Optoelectronic Multiplexer may allow VLSI chips to communicate at rates up to 7
Gbps. An architecture for a parallel processing computer which takes advantage of
this new capability is presented. The feasibility of a single-chip parallel-processor
based on the Optoelectronic Multiplexer is examined by projecting current trends in
processor speed, power, and transistor count into estimates of throughput for a
multi-processor IC.http://hdl.handle.net/10945/22094http://archive.org/details/serialbusarchite00delaLieutenant, United States NavyApproved for public release; distribution is unlimited
Castell: a heterogeneous cmp architecture scalable to hundreds of processors
Technology improvements and power constrains have taken multicore architectures to dominate
microprocessor designs over uniprocessors. At the same time, accelerator based architectures
have shown that heterogeneous multicores are very efficient and can provide high throughput for
parallel applications, but with a high-programming effort. We propose Castell a scalable chip
multiprocessor architecture that can be programmed as uniprocessors, and provides the high
throughput of accelerator-based architectures.
Castell relies on task-based programming models that simplify software development. These
models use a runtime system that dynamically finds, schedules, and adds hardware-specific features
to parallel tasks. One of these features is DMA transfers to overlap computation and data
movement, which is known as double buffering. This feature allows applications on Castell
to tolerate large memory latencies and lets us design the memory system focusing on memory
bandwidth.
In addition to provide programmability and the design of the memory system, we have used
a hierarchical NoC and added a synchronization module. The NoC design distributes memory
traffic efficiently to allow the architecture to scale. The synchronization module is a consequence
of the large performance degradation of application for large synchronization latencies.
Castell is mainly an architecture framework that enables the definition of domain-specific
implementations, fine-tuned to a particular problem or application. So far, Castell has been
successfully used to propose heterogeneous multicore architectures for scientific kernels, video
decoding (using H.264), and protein sequence alignment (using Smith-Waterman and clustalW).
It has also been used to explore a number of architecture optimizations such as enhanced DMA
controllers, and architecture support for task-based programming models.
ii
Fast Fourier Transform algorithm design and tradeoffs
The Fast Fourier Transform (FFT) is a mainstay of certain numerical techniques for solving fluid dynamics problems. The Connection Machine CM-2 is the target for an investigation into the design of multidimensional Single Instruction Stream/Multiple Data (SIMD) parallel FFT algorithms for high performance. Critical algorithm design issues are discussed, necessary machine performance measurements are identified and made, and the performance of the developed FFT programs are measured. Fast Fourier Transform programs are compared to the currently best Cray-2 FFT program
- …