973 research outputs found

    Pruned Bit-Reversal Permutations: Mathematical Characterization, Fast Algorithms and Architectures

    Full text link
    A mathematical characterization of serially-pruned permutations (SPPs) employed in variable-length permuters and their associated fast pruning algorithms and architectures are proposed. Permuters are used in many signal processing systems for shuffling data and in communication systems as an adjunct to coding for error correction. Typically only a small set of discrete permuter lengths are supported. Serial pruning is a simple technique to alter the length of a permutation to support a wider range of lengths, but results in a serial processing bottleneck. In this paper, parallelizing SPPs is formulated in terms of recursively computing sums involving integer floor and related functions using integer operations, in a fashion analogous to evaluating Dedekind sums. A mathematical treatment for bit-reversal permutations (BRPs) is presented, and closed-form expressions for BRP statistics are derived. It is shown that BRP sequences have weak correlation properties. A new statistic called permutation inliers that characterizes the pruning gap of pruned interleavers is proposed. Using this statistic, a recursive algorithm that computes the minimum inliers count of a pruned BR interleaver (PBRI) in logarithmic time complexity is presented. This algorithm enables parallelizing a serial PBRI algorithm by any desired parallelism factor by computing the pruning gap in lookahead rather than a serial fashion, resulting in significant reduction in interleaving latency and memory overhead. Extensions to 2-D block and stream interleavers, as well as applications to pruned fast Fourier transforms and LTE turbo interleavers, are also presented. Moreover, hardware-efficient architectures for the proposed algorithms are developed. Simulation results demonstrate 3 to 4 orders of magnitude improvement in interleaving time compared to existing approaches.Comment: 31 page

    A Hybrid Decomposition Parallel Implementation of the Car-Parrinello Method

    Full text link
    We have developed a flexible hybrid decomposition parallel implementation of the first-principles molecular dynamics algorithm of Car and Parrinello. The code allows the problem to be decomposed either spatially, over the electronic orbitals, or any combination of the two. Performance statistics for 32, 64, 128 and 512 Si atom runs on the Touchstone Delta and Intel Paragon parallel supercomputers and comparison with the performance of an optimized code running the smaller systems on the Cray Y-MP and C90 are presented.Comment: Accepted by Computer Physics Communications, latex, 34 pages without figures, 15 figures available in PostScript form via WWW at http://www-theory.chem.washington.edu/~wiggs/hyb_figures.htm

    An Architecture for On board Frequency Domain Analysis of Launch Vehicle Vibration Signals

    Get PDF
    The dynamic properties of the airborne structures plays a crucial role in the stability of the vehicle during flight. Modal and spectral behaviour of the structures are simulated and analysed. Ground tests are carried out with environmental conditions close to the flight conditions, with some assumptions. Subsequently, based on the flight telemetered data, the on-board mission algorithm and the auto-pilot filter coefficients are fine tuned. An attempt is made in this paper to design a novel architecture for analysing the modal and spectral random vibration signals on-board the flight vehicle and to identify the dominant frequencies. Based on the analysed results, the mission mode algorithm and the filter coefficients can be fine tuned on-board for better effectiveness in control and providing more stability. Three types of windows viz. Hann, Hamming and Blackman-Harris are configured with a generalised equation using FIR filter structure. The overlapping of the input signal data for better inclusiveness of the real-time data is implemented with BRAM. The domain conversion of the data from time domain to frequency domain is carried out with FFT using Radix-2 BF architecture. The FFT output data are processed for calculating the power spectral density. The dominant frequency is identified using the array search method and Goldschmidt algorithm is utilised for the averaging of the PSDs for better precision. The proposed architecture is synthesised, implemented and tested with both Synthetic and doppler signal of 300 Hz spot frequency padded with Gaussian white noise. The results are highly satisfactory in identifying the spot frequency and generating the PSD array

    All Digital, Background Calibration for Time-Interleaved and Successive Approximation Register Analog-to-Digital Converters

    Get PDF
    The growth of digital systems underscores the need to convert analog information to the digital domain at high speeds and with great accuracy. Analog-to-Digital Converter (ADC) calibration is often a limiting factor, requiring longer calibration times to achieve higher accuracy. The goal of this dissertation is to perform a fully digital background calibration using an arbitrary input signal for A/D converters. The work presented here adapts the cyclic Split-ADC calibration method to the time interleaved (TI) and successive approximation register (SAR) architectures. The TI architecture has three types of linear mismatch errors: offset, gain and aperture time delay. By correcting all three mismatch errors in the digital domain, each converter is capable of operating at the fastest speed allowed by the process technology. The total number of correction parameters required for calibration is dependent on the interleaving ratio, M. To adapt the Split-ADC method to a TI system, 2M+1 half-sized converters are required to estimate 3(2M+1) correction parameters. This thesis presents a 4:1 Split-TI converter that achieves full convergence in less than 400,000 samples. The SAR architecture employs a binary weight capacitor array to convert analog inputs into digital output codes. Mismatch in the capacitor weights results in non-linear distortion error. By adding redundant bits and dividing the array into individual unit capacitors, the Split-SAR method can estimate the mismatch and correct the digital output code. The results from this work show a reduction in the non-linear distortion with the ability to converge in less than 750,000 samples

    A serial bus architecture for parallel processing systems.

    Get PDF
    One of the most serious deterrants to the development of multiple processor architectures has been the problem of providing adequate communication between the discrete processing elements. This paper examines two communications-based constraints. The first constraint is related to the physical structure of the VLSI chip. The wider the communication path the more pins are needed to effect the data transfer. As Integrated Circuits grow in computational power, more communication capacity is needed, pushing designs closer to the pin limitations of the packaging technology. The second constraint, somewhat related to the first, is the limited speed with which data can be transmitted via internal channels. Typical speeds one can achieve on a single wire are on the order of 1 Gbps. The recent development of an Optoelectronic Multiplexer may allow VLSI chips to communicate at rates up to 7 Gbps. An architecture for a parallel processing computer which takes advantage of this new capability is presented. The feasibility of a single-chip parallel-processor based on the Optoelectronic Multiplexer is examined by projecting current trends in processor speed, power, and transistor count into estimates of throughput for a multi-processor IC.http://hdl.handle.net/10945/22094http://archive.org/details/serialbusarchite00delaLieutenant, United States NavyApproved for public release; distribution is unlimited

    Castell: a heterogeneous cmp architecture scalable to hundreds of processors

    Get PDF
    Technology improvements and power constrains have taken multicore architectures to dominate microprocessor designs over uniprocessors. At the same time, accelerator based architectures have shown that heterogeneous multicores are very efficient and can provide high throughput for parallel applications, but with a high-programming effort. We propose Castell a scalable chip multiprocessor architecture that can be programmed as uniprocessors, and provides the high throughput of accelerator-based architectures. Castell relies on task-based programming models that simplify software development. These models use a runtime system that dynamically finds, schedules, and adds hardware-specific features to parallel tasks. One of these features is DMA transfers to overlap computation and data movement, which is known as double buffering. This feature allows applications on Castell to tolerate large memory latencies and lets us design the memory system focusing on memory bandwidth. In addition to provide programmability and the design of the memory system, we have used a hierarchical NoC and added a synchronization module. The NoC design distributes memory traffic efficiently to allow the architecture to scale. The synchronization module is a consequence of the large performance degradation of application for large synchronization latencies. Castell is mainly an architecture framework that enables the definition of domain-specific implementations, fine-tuned to a particular problem or application. So far, Castell has been successfully used to propose heterogeneous multicore architectures for scientific kernels, video decoding (using H.264), and protein sequence alignment (using Smith-Waterman and clustalW). It has also been used to explore a number of architecture optimizations such as enhanced DMA controllers, and architecture support for task-based programming models. ii

    Fast Fourier Transform algorithm design and tradeoffs

    Get PDF
    The Fast Fourier Transform (FFT) is a mainstay of certain numerical techniques for solving fluid dynamics problems. The Connection Machine CM-2 is the target for an investigation into the design of multidimensional Single Instruction Stream/Multiple Data (SIMD) parallel FFT algorithms for high performance. Critical algorithm design issues are discussed, necessary machine performance measurements are identified and made, and the performance of the developed FFT programs are measured. Fast Fourier Transform programs are compared to the currently best Cray-2 FFT program
    corecore