25 research outputs found

    Modeling and Energy Optimization of LDPC Decoder Circuits with Timing Violations

    Full text link
    This paper proposes a "quasi-synchronous" design approach for signal processing circuits, in which timing violations are permitted, but without the need for a hardware compensation mechanism. The case of a low-density parity-check (LDPC) decoder is studied, and a method for accurately modeling the effect of timing violations at a high level of abstraction is presented. The error-correction performance of code ensembles is then evaluated using density evolution while taking into account the effect of timing faults. Following this, several quasi-synchronous LDPC decoder circuits based on the offset min-sum algorithm are optimized, providing a 23%-40% reduction in energy consumption or energy-delay product, while achieving the same performance and occupying the same area as conventional synchronous circuits.Comment: To appear in IEEE Transactions on Communication

    Algorithm Development and VLSI Implementation of Energy Efficient Decoders of Polar Codes

    Get PDF
    With its low error-floor performance, polar codes attract significant attention as the potential standard error correction code (ECC) for future communication and data storage. However, the VLSI implementation complexity of polar codes decoders is largely influenced by its nature of in-series decoding. This dissertation is dedicated to presenting optimal decoder architectures for polar codes. This dissertation addresses several structural properties of polar codes and key properties of decoding algorithms that are not dealt with in the prior researches. The underlying concept of the proposed architectures is a paradigm that simplifies and schedules the computations such that hardware is simplified, latency is minimized and bandwidth is maximized. In pursuit of the above, throughput centric successive cancellation (TCSC) and overlapping path list successive cancellation (OPLSC) VLSI architectures and express journey BP (XJBP) decoders for the polar codes are presented. An arbitrary polar code can be decomposed by a set of shorter polar codes with special characteristics, those shorter polar codes are referred to as constituent polar codes. By exploiting the homogeneousness between decoding processes of different constituent polar codes, TCSC reduces the decoding latency of the SC decoder by 60% for codes with length n = 1024. The error correction performance of SC decoding is inferior to that of list successive cancellation decoding. The LSC decoding algorithm delivers the most reliable decoding results; however, it consumes most hardware resources and decoding cycles. Instead of using multiple instances of decoding cores in the LSC decoders, a single SC decoder is used in the OPLSC architecture. The computations of each path in the LSC are arranged to occupy the decoder hardware stages serially in a streamlined fashion. This yields a significant reduction of hardware complexity. The OPLSC decoder has achieved about 1.4 times hardware efficiency improvement compared with traditional LSC decoders. The hardware efficient VLSI architectures for TCSC and OPLSC polar codes decoders are also introduced. Decoders based on SC or LSC algorithms suffer from high latency and limited throughput due to their serial decoding natures. An alternative approach to decode the polar codes is belief propagation (BP) based algorithm. In BP algorithm, a graph is set up to guide the beliefs propagated and refined, which is usually referred to as factor graph. BP decoding algorithm allows decoding in parallel to achieve much higher throughput. XJBP decoder facilitates belief propagation by utilizing the specific constituent codes that exist in the conventional factor graph, which results in an express journey (XJ) decoder. Compared with the conventional BP decoding algorithm for polar codes, the proposed decoder reduces the computational complexity by about 40.6%. This enables an energy-efficient hardware implementation. To further explore the hardware consumption of the proposed XJBP decoder, the computations scheduling is modeled and analyzed in this dissertation. With discussions on different hardware scenarios, the optimal scheduling plans are developed. A novel memory-distributed micro-architecture of the XJBP decoder is proposed and analyzed to solve the potential memory access problems of the proposed scheduling strategy. The register-transfer level (RTL) models of the XJBP decoder are set up for comparisons with other state-of-the-art BP decoders. The results show that the power efficiency of BP decoders is improved by about 3 times

    Algorithm Development and VLSI Implementation of Energy Efficient Decoders of Polar Codes

    Get PDF
    With its low error-floor performance, polar codes attract significant attention as the potential standard error correction code (ECC) for future communication and data storage. However, the VLSI implementation complexity of polar codes decoders is largely influenced by its nature of in-series decoding. This dissertation is dedicated to presenting optimal decoder architectures for polar codes. This dissertation addresses several structural properties of polar codes and key properties of decoding algorithms that are not dealt with in the prior researches. The underlying concept of the proposed architectures is a paradigm that simplifies and schedules the computations such that hardware is simplified, latency is minimized and bandwidth is maximized. In pursuit of the above, throughput centric successive cancellation (TCSC) and overlapping path list successive cancellation (OPLSC) VLSI architectures and express journey BP (XJBP) decoders for the polar codes are presented. An arbitrary polar code can be decomposed by a set of shorter polar codes with special characteristics, those shorter polar codes are referred to as constituent polar codes. By exploiting the homogeneousness between decoding processes of different constituent polar codes, TCSC reduces the decoding latency of the SC decoder by 60% for codes with length n = 1024. The error correction performance of SC decoding is inferior to that of list successive cancellation decoding. The LSC decoding algorithm delivers the most reliable decoding results; however, it consumes most hardware resources and decoding cycles. Instead of using multiple instances of decoding cores in the LSC decoders, a single SC decoder is used in the OPLSC architecture. The computations of each path in the LSC are arranged to occupy the decoder hardware stages serially in a streamlined fashion. This yields a significant reduction of hardware complexity. The OPLSC decoder has achieved about 1.4 times hardware efficiency improvement compared with traditional LSC decoders. The hardware efficient VLSI architectures for TCSC and OPLSC polar codes decoders are also introduced. Decoders based on SC or LSC algorithms suffer from high latency and limited throughput due to their serial decoding natures. An alternative approach to decode the polar codes is belief propagation (BP) based algorithm. In BP algorithm, a graph is set up to guide the beliefs propagated and refined, which is usually referred to as factor graph. BP decoding algorithm allows decoding in parallel to achieve much higher throughput. XJBP decoder facilitates belief propagation by utilizing the specific constituent codes that exist in the conventional factor graph, which results in an express journey (XJ) decoder. Compared with the conventional BP decoding algorithm for polar codes, the proposed decoder reduces the computational complexity by about 40.6%. This enables an energy-efficient hardware implementation. To further explore the hardware consumption of the proposed XJBP decoder, the computations scheduling is modeled and analyzed in this dissertation. With discussions on different hardware scenarios, the optimal scheduling plans are developed. A novel memory-distributed micro-architecture of the XJBP decoder is proposed and analyzed to solve the potential memory access problems of the proposed scheduling strategy. The register-transfer level (RTL) models of the XJBP decoder are set up for comparisons with other state-of-the-art BP decoders. The results show that the power efficiency of BP decoders is improved by about 3 times

    Energy-Efficient Computing for Mobile Signal Processing

    Full text link
    Mobile devices have rapidly proliferated, and deployment of handheld devices continues to increase at a spectacular rate. As today's devices not only support advanced signal processing of wireless communication data but also provide rich sets of applications, contemporary mobile computing requires both demanding computation and efficiency. Most mobile processors combine general-purpose processors, digital signal processors, and hardwired application-specific integrated circuits to satisfy their high-performance and low-power requirements. However, such a heterogeneous platform is inefficient in area, power and programmability. Improving the efficiency of programmable mobile systems is a critical challenge and an active area of computer systems research. SIMD (single instruction multiple data) architectures are very effective for data-level-parallelism intense algorithms in mobile signal processing. However, new characteristics of advanced wireless/multimedia algorithms require architectural re-evaluation to achieve better energy efficiency. Therefore, fourth generation wireless protocol and high definition mobile video algorithms are analyzed to enhance a wide-SIMD architecture. The key enhancements include 1) programmable crossbar to support complex data alignment, 2) SIMD partitioning to support fine-grain SIMD computation, and 3) fused operation to support accelerating frequently used instruction pairs. Near-threshold computation has been attractive in low-power architecture research because it balances performance and power. To further improve energy efficiency in mobile computing, near-threshold computation is applied to a wide SIMD architecture. This proposed near-threshold wide SIMD architecture-Diet SODA-presents interesting architectural design decisions such as 1) very wide SIMD datapath to compensate for degraded performance induced by near-threshold computation and 2) scatter-gather data prefetcher to exploit large latency gap between memory and the SIMD datapath. Although near-threshold computation provides excellent energy efficiency, it suffers from increased delay variations. A systematic study of delay variations in near-threshold computing is performed and simple techniques-structural duplication and voltage/frequency margining-are explored to tolerate and mitigate the delay variations in near-threshold wide SIMD architectures. This dissertation analyzes representative wireless/multimedia mobile signal processing algorithms, proposes an energy-efficient programmable platform, and evaluates performance and power. A main theme of this dissertation is that the performance and efficiency of programmable embedded systems can be significantly improved with a combination of parallel SIMD and near-threshold computations.Ph.D.Electrical EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/86356/1/swseo_1.pd

    Combined Time, Frecuency and Space Diversity in Multimedia Mobile Broadcasting Systems

    Full text link
    El uso combinado de diversidad en el dominio temporal, frecuencial y espacial constituye una valiosa herramienta para mejorar la recepción de servicios de difusión móviles. Gracias a la mejora conseguida por las técnicas de diversidad es posible extender la cobertura de los servicios móviles además de reducir la infraestructura de red. La presente tesis investiga el uso de técnicas de diversidad para la provisión de servicios móviles en la familia europea de sistemas de difusión terrestres estandarizada por el prpoyecto DVB (Digital Video Broadcasting). Esto incluye la primera y segunda generación de sistemas DVB-T (Terrestrial), DVB-NGH (Handheld), y DVB-T2 (Terrestrial 2nd generation), así como el sistema de siguiente generación DVB-NGH. No obstante, el estudio llevado a cabo en la tesis es genérico y puede aplicarse a futuras evoluciones de estándares como el japonés ISDB-T o el americano ATSC. Las investigaciones realizadas dentro del contexto de DVB-T, DVB-H y DVBT2 tienen como objetivo la transmisión simultánea de servicios fijos y móviles en redes terrestres. Esta Convergencia puede facilitar la introducción de servicios móviles de TB debido a la reutilización de espectro, contenido e infraestructura. De acuerdo a los resultados, la incorporación de entrelazado temporal en la capa física para diversidad temporal, y de single-input multiple-output (SIMO) para diversidad espacial, son esenciales para el rendimiento de sistemas móviles de difusión. A pesar de que las técnicas upper later FEC (UL-FEC) pueden propocionar diversidad temporal en sistemas de primera generación como DVB-T y DVB-H, requieren la transmisión de paridad adicional y no son útiles para la recepción estática. El análisis en t�ñerminos de link budjget revela que las técnicas de diversidad noson suficientes para facilitar la provision de servicios móviles en redes DVB-T y DVB-T2 planificadas para recepción fija. Sin embargo, el uso de diversidad en redes planificadas para recepción portableGozálvez Serrano, D. (2012). Combined Time, Frecuency and Space Diversity in Multimedia Mobile Broadcasting Systems [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/16273Palanci

    Rapid Industrial Prototyping and SoC Design of 3G/4G Wireless Systems Using an HLS Methodology

    Get PDF
    Many very-high-complexity signal processing algorithms are required in future wireless systems, giving tremendous challenges to real-time implementations. In this paper, we present our industrial rapid prototyping experiences on 3G/4G wireless systems using advanced signal processing algorithms in MIMO-CDMA and MIMO-OFDM systems. Core system design issues are studied and advanced receiver algorithms suitable for implementation are proposed for synchronization, MIMO equalization, and detection. We then present VLSI-oriented complexity reduction schemes and demonstrate how to interact these high-complexity algorithms with an HLS-based methodology for extensive design space exploration. This is achieved by abstracting the main effort from hardware iterations to the algorithmic C/C++ fixed-point design. We also analyze the advantages and limitations of the methodology. Our industrial design experience demonstrates that it is possible to enable an extensive architectural analysis in a short-time frame using HLS methodology, which significantly shortens the time to market for wireless systems.National Science Foundatio

    Dependable Embedded Systems

    Get PDF
    This Open Access book introduces readers to many new techniques for enhancing and optimizing reliability in embedded systems, which have emerged particularly within the last five years. This book introduces the most prominent reliability concerns from today’s points of view and roughly recapitulates the progress in the community so far. Unlike other books that focus on a single abstraction level such circuit level or system level alone, the focus of this book is to deal with the different reliability challenges across different levels starting from the physical level all the way to the system level (cross-layer approaches). The book aims at demonstrating how new hardware/software co-design solution can be proposed to ef-fectively mitigate reliability degradation such as transistor aging, processor variation, temperature effects, soft errors, etc. Provides readers with latest insights into novel, cross-layer methods and models with respect to dependability of embedded systems; Describes cross-layer approaches that can leverage reliability through techniques that are pro-actively designed with respect to techniques at other layers; Explains run-time adaptation and concepts/means of self-organization, in order to achieve error resiliency in complex, future many core systems

    Multicarrier Faster-than-Nyquist Signaling Transceivers: From Theory to Practice

    Get PDF
    The demand for spectrum resources in cellular systems worldwide has seen a tremendous escalation in the recent past. The mobile phones of today are capable of being cameras taking pictures and videos, able to browse the Internet, do video calling and much more than an yesteryear computer. Due to the variety and the amount of information that is being transmitted the demand for spectrum resources is continuously increasing. Efficient use of bandwidth resources has hence become a key parameter in the design and realization of wireless communication systems. Faster-than-Nyquist (FTN) signaling is one such technique that achieves bandwidth efficiency by making better use of the available spectrum resources at the expense of higher processing complexity in the transceiver. This thesis addresses the challenges and design trade offs arising during the hardware realization of Faster-than-Nyquist signaling transceivers. The FTN system has been evaluated for its achievable performance compared to the processing overhead in the transmitter and the receiver. Coexistence with OFDM systems, a more popular multicarrier scheme in existing and upcoming wireless standards, has been considered by designing FTN specific processing blocks as add-ons to the conventional transceiver chain. A multicarrier system capable of operating under both orthogonal and FTN signaling has been developed. The performance of the receiver was evaluated for AWGN and fading channels. The FTN system was able to achieve 2x improvement in bandwidth usage with similar performance as that of an OFDM system. The extra processing in the receiver was in terms of an iterative decoder for the decoding of FTN modulated signals. An efficient hardware architecture for the iterative decoder reusing the FTN specific processing blocks and realize different functionality has been designed. An ASIC implementation of this decoder was implemented in a 65nm CMOS technology and the implemented chip has been successfully verified for its functionality

    Hardware realization of discrete wavelet transform cauchy Reed Solomon minimal instruction set computer architecture for wireless visual sensor networks

    Get PDF
    Large amount of image data transmitting across the Wireless Visual Sensor Networks (WVSNs) increases the data transmission rate thus increases the power transmission. This would inevitably decreases the operating lifespan of the sensor nodes and affecting the overall operation of WVSNs. Limiting power consumption to prolong battery lifespan is one of the most important goals in WVSNs. To achieve this goal, this thesis presents a novel low complexity Discrete Wavelet Transform (DWT) Cauchy Reed Solomon (CRS) Minimal Instruction Set Computer (MISC) architecture that performs data compression and data encoding (encryption) in a single architecture. There are four different programme instructions were developed to programme the MISC processor, which are Subtract and Branch if Negative (SBN), Galois Field Multiplier (GF MULT), XOR and 11TO8 instructions. With the use of these programme instructions, the developed DWT CRS MISC were programmed to perform DWT image compression to reduce the image size and then encode the DWT coefficients with CRS code to ensure data security and reliability. Both compression and CRS encoding were performed by a single architecture rather than in two separate modules which require a lot of hardware resources (logic slices). By reducing the number of logic slices, the power consumption can be subsequently reduced. Results show that the proposed new DWT CRS MISC architecture implementation requires 142 Slices (Xilinx Virtex-II), 129 slices (Xilinx Spartan-3E), 144 Slices (Xilinx Spartan-3L) and 66 Slices (Xilinx Spartan-6). The developed DWT CRS MISC architecture has lower hardware complexity as compared to other existing systems, such as Crypto-Processor in Xilinx Spartan-6 (4828 Slices), Low-Density Parity-Check in Xilinx Virtex-II (870 slices) and ECBC in Xilinx Spartan-3E (1691 Slices). With the use of RC10 development board, the developed DWT CRS MISC architecture can be implemented onto the Xilinx Spartan-3L FPGA to simulate an actual visual sensor node. This is to verify the feasibility of developing a joint compression, encryption and error correction processing framework in WVSNs

    Hardware realization of discrete wavelet transform cauchy Reed Solomon minimal instruction set computer architecture for wireless visual sensor networks

    Get PDF
    Large amount of image data transmitting across the Wireless Visual Sensor Networks (WVSNs) increases the data transmission rate thus increases the power transmission. This would inevitably decreases the operating lifespan of the sensor nodes and affecting the overall operation of WVSNs. Limiting power consumption to prolong battery lifespan is one of the most important goals in WVSNs. To achieve this goal, this thesis presents a novel low complexity Discrete Wavelet Transform (DWT) Cauchy Reed Solomon (CRS) Minimal Instruction Set Computer (MISC) architecture that performs data compression and data encoding (encryption) in a single architecture. There are four different programme instructions were developed to programme the MISC processor, which are Subtract and Branch if Negative (SBN), Galois Field Multiplier (GF MULT), XOR and 11TO8 instructions. With the use of these programme instructions, the developed DWT CRS MISC were programmed to perform DWT image compression to reduce the image size and then encode the DWT coefficients with CRS code to ensure data security and reliability. Both compression and CRS encoding were performed by a single architecture rather than in two separate modules which require a lot of hardware resources (logic slices). By reducing the number of logic slices, the power consumption can be subsequently reduced. Results show that the proposed new DWT CRS MISC architecture implementation requires 142 Slices (Xilinx Virtex-II), 129 slices (Xilinx Spartan-3E), 144 Slices (Xilinx Spartan-3L) and 66 Slices (Xilinx Spartan-6). The developed DWT CRS MISC architecture has lower hardware complexity as compared to other existing systems, such as Crypto-Processor in Xilinx Spartan-6 (4828 Slices), Low-Density Parity-Check in Xilinx Virtex-II (870 slices) and ECBC in Xilinx Spartan-3E (1691 Slices). With the use of RC10 development board, the developed DWT CRS MISC architecture can be implemented onto the Xilinx Spartan-3L FPGA to simulate an actual visual sensor node. This is to verify the feasibility of developing a joint compression, encryption and error correction processing framework in WVSNs
    corecore