320 research outputs found
Pipelined Two-Operand Modular Adders
Pipelined two-operand modular adder (TOMA) is one of basic components used in digital signal processing (DSP) systems that use the residue number system (RNS). Such modular adders are used in binary/residue and residue/binary converters, residue multipliers and scalers as well as within residue processing channels. The design of pipelined TOMAs is usually obtained by inserting an appriopriate number of latch layers inside a nonpipelined TOMA structure. Hence their area is also determined by the number of latches and the delay by the number of latch layers. In this paper we propose a new pipelined TOMA that is based on a new TOMA, that has the smaller area and smaller delay than other known structures. Comparisons are made using data from the very large scale of integration (VLSI) standard cell library
Maximizing resource utilization by slicing of superscalar architecture
Superscalar architectural techniques increase instruction throughput from one instruction per cycle to more than one instruction per cycle. Modern processors make use of several processing resources to achieve this kind of throughput. Control units perform various functions to minimize stalls and to ensure a continuous feed of instructions to execution units. It is vital to ensure that instructions ready for execution do not encounter a bottleneck in the execution stage; This thesis work proposes a dynamic scheme to increase efficiency of execution stage by a methodology called block slicing. Implementing this concept in a wide, superscalar pipelined architecture introduces minimal additional hardware and delay in the pipeline. The hardware required for the implementation of the proposed scheme is designed and assessed in terms of cost and delay. Performance measures of speed-up, throughput and efficiency have been evaluated for the resulting pipeline and analyzed
VLSI Architecture for Configurable and Low-Complexity Design of Hard-Decision Viterbi Decoding Algorithm
Convolutional encoding and data decoding are fundamental processes in convolutional error correction. One of the most popular error correction methods in decoding is the Viterbi algorithm. It is extensively implemented in many digital communication applications. Its VLSI design challenges are about area, speed, power, complexity and configurability. In this research, we specifically propose a VLSI architecture for a configurable and low-complexity design of a hard-decision Viterbi decoding algorithm. The configurable and low-complexity design is achieved by designing a generic VLSI architecture, optimizing each processing element (PE) at the logical operation level and designing a conditional adapter. The proposed design can be configured for any predefined number of trace-backs, only by changing the trace-back parameter value. Its computational process only needs N + 2 clock cycles latency, with N is the number of trace-backs. Its configurability function has been proven for N = 8, N = 16, N = 32 and N = 64. Furthermore, the proposed design was synthesized and evaluated in Xilinx and Altera FPGA target boards for area consumption and speed performance
A New RTL Design Approach for a DCT/IDCT-Based Image Compression Architecture using the mCBE Algorithm
In the literature, several approaches of designing a DCT/IDCT-based image compression system have been proposed. In this paper, we present a new RTL design approach with as main focus developing a DCT/IDCT-based image compression architecture using a self-created algorithm. This algorithm can efficiently minimize the amount of shifter -adders to substitute multiplier s. We call this new algorithm the multiplication from Common Binary Expression (mCBE) Algorithm. Besides this algorithm, we propose alternative quantization numbers, which can be implemented simply as shifters in digital hardware. Mostly, these numbers can retain a good compressed-image quality compared to JPEG recommendations. These ideas lead to our design being small in circuit area, multiplierless, and low in complexity. The proposed 8-point 1D-DCT design has only six stages, while the 8-point 1D-IDCT design has only seven stages (one stage being defined as equal to the delay of one shifter or 2-input adder). By using the pipelining method, we can achieve a high-speed architecture with latency as a trade-off consideration. The design has been synthesized and can reach a speed of up to 1.41ns critical path delay (709.22MHz).
Baseband Processing for 5G and Beyond: Algorithms, VLSI Architectures, and Co-design
In recent years the number of connected devices and the demand for high data-rates have been significantly increased. This enormous growth is more pronounced by the introduction of the Internet of things (IoT) in which several devices are interconnected to exchange data for various applications like smart homes and smart cities. Moreover, new applications such as eHealth, autonomous vehicles, and connected ambulances set new demands on the reliability, latency, and data-rate of wireless communication systems, pushing forward technology developments. Massive multiple-input multiple-output (MIMO) is a technology, which is employed in the 5G standard, offering the benefits to fulfill these requirements. In massive MIMO systems, base station (BS) is equipped with a very large number of antennas, serving several users equipments (UEs) simultaneously in the same time and frequency resource. The high spatial multiplexing in massive MIMO systems, improves the data rate, energy and spectral efficiencies as well as the link reliability of wireless communication systems. The link reliability can be further improved by employing channel coding technique. Spatially coupled serially concatenated codes (SC-SCCs) are promising channel coding schemes, which can meet the high-reliability demands of wireless communication systems beyond 5G (B5G). Given the close-to-capacity error correction performance and the potential to implement a high-throughput decoder, this class of code can be a good candidate for wireless systems B5G. In order to achieve the above-mentioned advantages, sophisticated algorithms are required, which impose challenges on the baseband signal processing. In case of massive MIMO systems, the processing is much more computationally intensive and the size of required memory to store channel data is increased significantly compared to conventional MIMO systems, which are due to the large size of the channel state information (CSI) matrix. In addition to the high computational complexity, meeting latency requirements is also crucial. Similarly, the decoding-performance gain of SC-SCCs also do come at the expense of increased implementation complexity. Moreover, selecting the proper choice of design parameters, decoding algorithm, and architecture will be challenging, since spatial coupling provides new degrees of freedom in code design, and therefore the design space becomes huge. The focus of this thesis is to perform co-optimization in different design levels to address the aforementioned challenges/requirements. To this end, we employ system-level characteristics to develop efficient algorithms and architectures for the following functional blocks of digital baseband processing. First, we present a fast Fourier transform (FFT), an inverse FFT (IFFT), and corresponding reordering scheme, which can significantly reduce the latency of orthogonal frequency-division multiplexing (OFDM) demodulation and modulation as well as the size of reordering memory. The corresponding VLSI architectures along with the application specific integrated circuit (ASIC) implementation results in a 28 nm CMOS technology are introduced. In case of a 2048-point FFT/IFFT, the proposed design leads to 42% reduction in the latency and size of reordering memory. Second, we propose a low-complexity massive MIMO detection scheme. The key idea is to exploit channel sparsity to reduce the size of CSI matrix and eventually perform linear detection followed by a non-linear post-processing in angular domain using the compressed CSI matrix. The VLSI architecture for a massive MIMO with 128 BS antennas and 16 UEs along with the synthesis results in a 28 nm technology are presented. As a result, the proposed scheme reduces the complexity and required memory by 35%–73% compared to traditional detectors while it has better detection performance. Finally, we perform a comprehensive design space exploration for the SC-SCCs to investigate the effect of different design parameters on decoding performance, latency, complexity, and hardware cost. Then, we develop different decoding algorithms for the SC-SCCs and discuss the associated decoding performance and complexity. Also, several high-level VLSI architectures along with the corresponding synthesis results in a 12 nm process are presented, and various design tradeoffs are provided for these decoding schemes
A System for Compressive Sensing Signal Reconstruction
An architecture for hardware realization of a system for sparse signal
reconstruction is presented. The threshold based reconstruction method is
considered, which is further modified in this paper to reduce the system
complexity in order to provide easier hardware realization. Instead of using
the partial random Fourier transform matrix, the minimization problem is
reformulated using only the triangular R matrix from the QR decomposition. The
triangular R matrix can be efficiently implemented in hardware without
calculating the orthogonal Q matrix. A flexible and scalable realization of
matrix R is proposed, such that the size of R changes with the number of
available samples and sparsity level.Comment: 6 page
A Low-Complexity Decision Feedforward Equalizer Architecture for High-Speed Receivers on Highly Dispersive Channels
This paper presents an improved decision feedforward equalizer (DFFE) for high speed receivers in the presence of highly dispersive channels. This decision-aided equalizer technique has been recently proposed for multigigabit communication receivers, where the use of parallel processing is mandatory. Well-known parallel architectures for the typical decision feedback equalizer (DFE) have a complexity that grows exponentially with the channel memory. Instead, the new DFFE avoids that exponential increase in complexity by using tentative decisions to cancel iteratively the intersymbol interference (ISI). Here, we demostrate that the DFFE not only allows to obtain a similar performance to the typical DFE but it also reduces the compelxity in channels with large memory. Additionally, we propose a theoretical approximation for the error probability in each iteration. In fact, when the number of iteration increases, the error probability in the DFFE tends to approach the DFE. These benefits make the DFFE an excellent choice for the next generation of high-speed receivers.Fil: Pola, Ariel Luis. Universidad Nacional de Cordoba. Facultad de Cs.exactas Fisicas y Naturales. Departamento de Electronica. Laboratorio de Comunicaciones; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas; ArgentinaFil: Cousseau, Juan Edmundo. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Bahía Blanca. Instituto de Investigación En Ingeniería Eléctrica; Argentina. Universidad Nacional del Sur; ArgentinaFil: Agazzi, Oscar E.. Irvine Center Drive. ClariPhy Communications; Estados UnidosFil: Hueda, Mario Rafael. Universidad Nacional de Cordoba. Facultad de Cs.exactas Fisicas y Naturales. Departamento de Electronica. Laboratorio de Comunicaciones; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentin
On the Distribution of Control in Asynchronous Processor Architectures
Institute for Computing Systems ArchitectureThe effective performance of computer systems is to a large measure
determined by the synergy between the processor architecture, the
instruction set and the compiler. In the past, the sequencing of
information within processor architectures has normally been
synchronous: controlled centrally by a clock. However, this global
signal could possibly limit the future gains in performance that can
potentially be achieved through improvements in implementation
technology.
This thesis investigates the effects of relaxing this strict synchrony
by distributing control within processor architectures through the use
of a novel asynchronous design model known as a micronet. The impact
of asynchronous control on the performance of a RISC-style processor
is explored at different levels. Firstly, improvements in the
performance of individual instructions by exploiting actual run-time
behaviours are demonstrated. Secondly, it is shown that micronets are
able to exploit further (both spatial and temporal) instructionlevel
parallelism (ILP) efficiently through the distribution of control to
datapath resources. Finally, exposing fine-grain concurrency within a
datapath can only be of benefit to a computer system if it can easily
be exploited by the compiler. Although compilers for micronet-based
asynchronous processors may be considered to be more complex than
their synchronous counterparts, it is shown that the variable
execution time of an instruction does not adversely affect the
compiler's ability to schedule code efficiently. In conclusion, the
modelling of a processor's datapath as a micronet permits the
exploitation of both finegrain ILP and actual run-time delays, thus
leading to the efficient utilisation of functional units and in turn
resulting in an improvement in overall system performance
- …