320 research outputs found

    Pipelined Two-Operand Modular Adders

    Get PDF
    Pipelined two-operand modular adder (TOMA) is one of basic components used in digital signal processing (DSP) systems that use the residue number system (RNS). Such modular adders are used in binary/residue and residue/binary converters, residue multipliers and scalers as well as within residue processing channels. The design of pipelined TOMAs is usually obtained by inserting an appriopriate number of latch layers inside a nonpipelined TOMA structure. Hence their area is also determined by the number of latches and the delay by the number of latch layers. In this paper we propose a new pipelined TOMA that is based on a new TOMA, that has the smaller area and smaller delay than other known structures. Comparisons are made using data from the very large scale of integration (VLSI) standard cell library

    Maximizing resource utilization by slicing of superscalar architecture

    Full text link
    Superscalar architectural techniques increase instruction throughput from one instruction per cycle to more than one instruction per cycle. Modern processors make use of several processing resources to achieve this kind of throughput. Control units perform various functions to minimize stalls and to ensure a continuous feed of instructions to execution units. It is vital to ensure that instructions ready for execution do not encounter a bottleneck in the execution stage; This thesis work proposes a dynamic scheme to increase efficiency of execution stage by a methodology called block slicing. Implementing this concept in a wide, superscalar pipelined architecture introduces minimal additional hardware and delay in the pipeline. The hardware required for the implementation of the proposed scheme is designed and assessed in terms of cost and delay. Performance measures of speed-up, throughput and efficiency have been evaluated for the resulting pipeline and analyzed

    VLSI Architecture for Configurable and Low-Complexity Design of Hard-Decision Viterbi Decoding Algorithm

    Get PDF
    Convolutional encoding and data decoding are fundamental processes in convolutional error correction. One of the most popular error correction methods in decoding is the Viterbi algorithm. It is extensively implemented in many digital communication applications. Its VLSI design challenges are about area, speed, power, complexity and configurability. In this research, we specifically propose a VLSI architecture for a configurable and low-complexity design of a hard-decision Viterbi decoding algorithm. The configurable and low-complexity design is achieved by designing a generic VLSI architecture, optimizing each processing element (PE) at the logical operation level and designing a conditional adapter. The proposed design can be configured for any predefined number of trace-backs, only by changing the trace-back parameter value. Its computational process only needs N + 2 clock cycles latency, with N is the number of trace-backs. Its configurability function has been proven for N = 8, N = 16, N = 32 and N = 64. Furthermore, the proposed design was synthesized and evaluated in Xilinx and Altera FPGA target boards for area consumption and speed performance

    A New RTL Design Approach for a DCT/IDCT-Based Image Compression Architecture using the mCBE Algorithm

    Get PDF
    In  the  literature, several approaches  of  designing  a  DCT/IDCT-based image compression system have been proposed.  In this paper,  we present a new RTL design approach with as main  focus developing a  DCT/IDCT-based image compression  architecture  using  a  self-created  algorithm.  This  algorithm  can efficiently  minimize  the  amount  of  shifter -adders  to  substitute  multiplier s.  We call  this  new  algorithm  the  multiplication  from  Common  Binary  Expression (mCBE)  Algorithm. Besides this algorithm, we propose alternative quantization numbers,  which  can  be  implemented  simply  as  shifters  in  digital  hardware. Mostly, these numbers can retain a good compressed-image quality  compared to JPEG  recommendations.  These  ideas  lead  to  our  design  being  small  in  circuit area,  multiplierless,  and  low  in  complexity.  The  proposed  8-point  1D-DCT design  has  only  six  stages,  while  the  8-point  1D-IDCT  design  has  only  seven stages  (one  stage  being  defined as  equal  to  the  delay  of  one  shifter  or  2-input adder). By using the pipelining method, we can achieve a high-speed architecture with latency as    a  trade-off consideration. The  design has been synthesized and can reach a speed of up to 1.41ns critical path delay (709.22MHz).

    Baseband Processing for 5G and Beyond: Algorithms, VLSI Architectures, and Co-design

    Get PDF
    In recent years the number of connected devices and the demand for high data-rates have been significantly increased. This enormous growth is more pronounced by the introduction of the Internet of things (IoT) in which several devices are interconnected to exchange data for various applications like smart homes and smart cities. Moreover, new applications such as eHealth, autonomous vehicles, and connected ambulances set new demands on the reliability, latency, and data-rate of wireless communication systems, pushing forward technology developments. Massive multiple-input multiple-output (MIMO) is a technology, which is employed in the 5G standard, offering the benefits to fulfill these requirements. In massive MIMO systems, base station (BS) is equipped with a very large number of antennas, serving several users equipments (UEs) simultaneously in the same time and frequency resource. The high spatial multiplexing in massive MIMO systems, improves the data rate, energy and spectral efficiencies as well as the link reliability of wireless communication systems. The link reliability can be further improved by employing channel coding technique. Spatially coupled serially concatenated codes (SC-SCCs) are promising channel coding schemes, which can meet the high-reliability demands of wireless communication systems beyond 5G (B5G). Given the close-to-capacity error correction performance and the potential to implement a high-throughput decoder, this class of code can be a good candidate for wireless systems B5G. In order to achieve the above-mentioned advantages, sophisticated algorithms are required, which impose challenges on the baseband signal processing. In case of massive MIMO systems, the processing is much more computationally intensive and the size of required memory to store channel data is increased significantly compared to conventional MIMO systems, which are due to the large size of the channel state information (CSI) matrix. In addition to the high computational complexity, meeting latency requirements is also crucial. Similarly, the decoding-performance gain of SC-SCCs also do come at the expense of increased implementation complexity. Moreover, selecting the proper choice of design parameters, decoding algorithm, and architecture will be challenging, since spatial coupling provides new degrees of freedom in code design, and therefore the design space becomes huge. The focus of this thesis is to perform co-optimization in different design levels to address the aforementioned challenges/requirements. To this end, we employ system-level characteristics to develop efficient algorithms and architectures for the following functional blocks of digital baseband processing. First, we present a fast Fourier transform (FFT), an inverse FFT (IFFT), and corresponding reordering scheme, which can significantly reduce the latency of orthogonal frequency-division multiplexing (OFDM) demodulation and modulation as well as the size of reordering memory. The corresponding VLSI architectures along with the application specific integrated circuit (ASIC) implementation results in a 28 nm CMOS technology are introduced. In case of a 2048-point FFT/IFFT, the proposed design leads to 42% reduction in the latency and size of reordering memory. Second, we propose a low-complexity massive MIMO detection scheme. The key idea is to exploit channel sparsity to reduce the size of CSI matrix and eventually perform linear detection followed by a non-linear post-processing in angular domain using the compressed CSI matrix. The VLSI architecture for a massive MIMO with 128 BS antennas and 16 UEs along with the synthesis results in a 28 nm technology are presented. As a result, the proposed scheme reduces the complexity and required memory by 35%–73% compared to traditional detectors while it has better detection performance. Finally, we perform a comprehensive design space exploration for the SC-SCCs to investigate the effect of different design parameters on decoding performance, latency, complexity, and hardware cost. Then, we develop different decoding algorithms for the SC-SCCs and discuss the associated decoding performance and complexity. Also, several high-level VLSI architectures along with the corresponding synthesis results in a 12 nm process are presented, and various design tradeoffs are provided for these decoding schemes

    A System for Compressive Sensing Signal Reconstruction

    Full text link
    An architecture for hardware realization of a system for sparse signal reconstruction is presented. The threshold based reconstruction method is considered, which is further modified in this paper to reduce the system complexity in order to provide easier hardware realization. Instead of using the partial random Fourier transform matrix, the minimization problem is reformulated using only the triangular R matrix from the QR decomposition. The triangular R matrix can be efficiently implemented in hardware without calculating the orthogonal Q matrix. A flexible and scalable realization of matrix R is proposed, such that the size of R changes with the number of available samples and sparsity level.Comment: 6 page

    A Low-Complexity Decision Feedforward Equalizer Architecture for High-Speed Receivers on Highly Dispersive Channels

    Get PDF
    This paper presents an improved decision feedforward equalizer (DFFE) for high speed receivers in the presence of highly dispersive channels. This decision-aided equalizer technique has been recently proposed for multigigabit communication receivers, where the use of parallel processing is mandatory. Well-known parallel architectures for the typical decision feedback equalizer (DFE) have a complexity that grows exponentially with the channel memory. Instead, the new DFFE avoids that exponential increase in complexity by using tentative decisions to cancel iteratively the intersymbol interference (ISI). Here, we demostrate that the DFFE not only allows to obtain a similar performance to the typical DFE but it also reduces the compelxity in channels with large memory. Additionally, we propose a theoretical approximation for the error probability in each iteration. In fact, when the number of iteration increases, the error probability in the DFFE tends to approach the DFE. These benefits make the DFFE an excellent choice for the next generation of high-speed receivers.Fil: Pola, Ariel Luis. Universidad Nacional de Cordoba. Facultad de Cs.exactas Fisicas y Naturales. Departamento de Electronica. Laboratorio de Comunicaciones; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas; ArgentinaFil: Cousseau, Juan Edmundo. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Bahía Blanca. Instituto de Investigación En Ingeniería Eléctrica; Argentina. Universidad Nacional del Sur; ArgentinaFil: Agazzi, Oscar E.. Irvine Center Drive. ClariPhy Communications; Estados UnidosFil: Hueda, Mario Rafael. Universidad Nacional de Cordoba. Facultad de Cs.exactas Fisicas y Naturales. Departamento de Electronica. Laboratorio de Comunicaciones; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentin

    On the Distribution of Control in Asynchronous Processor Architectures

    Get PDF
    Institute for Computing Systems ArchitectureThe effective performance of computer systems is to a large measure determined by the synergy between the processor architecture, the instruction set and the compiler. In the past, the sequencing of information within processor architectures has normally been synchronous: controlled centrally by a clock. However, this global signal could possibly limit the future gains in performance that can potentially be achieved through improvements in implementation technology. This thesis investigates the effects of relaxing this strict synchrony by distributing control within processor architectures through the use of a novel asynchronous design model known as a micronet. The impact of asynchronous control on the performance of a RISC-style processor is explored at different levels. Firstly, improvements in the performance of individual instructions by exploiting actual run-time behaviours are demonstrated. Secondly, it is shown that micronets are able to exploit further (both spatial and temporal) instructionlevel parallelism (ILP) efficiently through the distribution of control to datapath resources. Finally, exposing fine-grain concurrency within a datapath can only be of benefit to a computer system if it can easily be exploited by the compiler. Although compilers for micronet-based asynchronous processors may be considered to be more complex than their synchronous counterparts, it is shown that the variable execution time of an instruction does not adversely affect the compiler's ability to schedule code efficiently. In conclusion, the modelling of a processor's datapath as a micronet permits the exploitation of both finegrain ILP and actual run-time delays, thus leading to the efficient utilisation of functional units and in turn resulting in an improvement in overall system performance
    corecore