1,552 research outputs found

    Scalable Energy-Recovery Architectures.

    Full text link
    Energy efficiency is a critical challenge for today's integrated circuits, especially for high-end digital signal processing and communications that require both high throughput and low energy dissipation for extended battery life. Charge-recovery logic recovers and reuses charge using inductive elements and has the potential to achieve order-of-magnitude improvement in energy efficiency while maintaining high performance. However, the lack of large-scale high-speed silicon demonstrations and inductor area overheads are two major concerns. This dissertation focuses on scalable charge-recovery designs. We present a semi-automated design flow to enable the design of large-scale charge-recovery chips. We also present a new architecture that uses in-package inductors, eliminating the area overheads caused by the use of integrated inductors in high-performance charge-recovery chips. To demonstrate our semi-automated flow, which uses custom-designed standard-cell-like dynamic cells, we have designed a 576-bit charge-recovery low-density parity-check (LDPC) decoder chip. Functioning correctly at clock speeds above 1 GHz, this prototype is the first-ever demonstration of a GHz-speed charge-recovery chip of significant complexity. In terms of energy consumption, this chip improves over recent state-of-the-art LDPCs by at least 1.3 times with comparable or better area efficiency. To demonstrate our architecture for eliminating inductor overheads, we have designed a charge-recovery LDPC decoder chip with in-package inductors. This test-chip has been fabricated in a 65nm CMOS flip-chip process. A custom 6-layer FC-BGA package substrate has been designed with 16 inductors embedded in the fifth layer of the package substrate, yielding higher Q and significantly improving area efficiency and energy efficiency compared to their on-chip counterparts. From measurements, this chip achieves at least 2.3 times lower energy consumption with better area efficiency over state-of-the-art published designs.PhDElectrical EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/116653/1/terryou_1.pd

    System-on-chip Computing and Interconnection Architectures for Telecommunications and Signal Processing

    Get PDF
    This dissertation proposes novel architectures and design techniques targeting SoC building blocks for telecommunications and signal processing applications. Hardware implementation of Low-Density Parity-Check decoders is approached at both the algorithmic and the architecture level. Low-Density Parity-Check codes are a promising coding scheme for future communication standards due to their outstanding error correction performance. This work proposes a methodology for analyzing effects of finite precision arithmetic on error correction performance and hardware complexity. The methodology is throughout employed for co-designing the decoder. First, a low-complexity check node based on the P-output decoding principle is designed and characterized on a CMOS standard-cells library. Results demonstrate implementation loss below 0.2 dB down to BER of 10^{-8} and a saving in complexity up to 59% with respect to other works in recent literature. High-throughput and low-latency issues are addressed with modified single-phase decoding schedules. A new "memory-aware" schedule is proposed requiring down to 20% of memory with respect to the traditional two-phase flooding decoding. Additionally, throughput is doubled and logic complexity reduced of 12%. These advantages are traded-off with error correction performance, thus making the solution attractive only for long codes, as those adopted in the DVB-S2 standard. The "layered decoding" principle is extended to those codes not specifically conceived for this technique. Proposed architectures exhibit complexity savings in the order of 40% for both area and power consumption figures, while implementation loss is smaller than 0.05 dB. Most modern communication standards employ Orthogonal Frequency Division Multiplexing as part of their physical layer. The core of OFDM is the Fast Fourier Transform and its inverse in charge of symbols (de)modulation. Requirements on throughput and energy efficiency call for FFT hardware implementation, while ubiquity of FFT suggests the design of parametric, re-configurable and re-usable IP hardware macrocells. In this context, this thesis describes an FFT/IFFT core compiler particularly suited for implementation of OFDM communication systems. The tool employs an accuracy-driven configuration engine which automatically profiles the internal arithmetic and generates a core with minimum operands bit-width and thus minimum circuit complexity. The engine performs a closed-loop optimization over three different internal arithmetic models (fixed-point, block floating-point and convergent block floating-point) using the numerical accuracy budget given by the user as a reference point. The flexibility and re-usability of the proposed macrocell are illustrated through several case studies which encompass all current state-of-the-art OFDM communications standards (WLAN, WMAN, xDSL, DVB-T/H, DAB and UWB). Implementations results are presented for two deep sub-micron standard-cells libraries (65 and 90 nm) and commercially available FPGA devices. Compared with other FFT core compilers, the proposed environment produces macrocells with lower circuit complexity and same system level performance (throughput, transform size and numerical accuracy). The final part of this dissertation focuses on the Network-on-Chip design paradigm whose goal is building scalable communication infrastructures connecting hundreds of core. A low-complexity link architecture for mesochronous on-chip communication is discussed. The link enables skew constraint looseness in the clock tree synthesis, frequency speed-up, power consumption reduction and faster back-end turnarounds. The proposed architecture reaches a maximum clock frequency of 1 GHz on 65 nm low-leakage CMOS standard-cells library. In a complex test case with a full-blown NoC infrastructure, the link overhead is only 3% of chip area and 0.5% of leakage power consumption. Finally, a new methodology, named metacoding, is proposed. Metacoding generates correct-by-construction technology independent RTL codebases for NoC building blocks. The RTL coding phase is abstracted and modeled with an Object Oriented framework, integrated within a commercial tool for IP packaging (Synopsys CoreTools suite). Compared with traditional coding styles based on pre-processor directives, metacoding produces 65% smaller codebases and reduces the configurations to verify up to three orders of magnitude

    Baseband analog front-end and digital back-end for reconfigurable multi-standard terminals

    Get PDF
    Multimedia applications are driving wireless network operators to add high-speed data services such as Edge (E-GPRS), WCDMA (UMTS) and WLAN (IEEE 802.11a,b,g) to the existing GSM network. This creates the need for multi-mode cellular handsets that support a wide range of communication standards, each with a different RF frequency, signal bandwidth, modulation scheme etc. This in turn generates several design challenges for the analog and digital building blocks of the physical layer. In addition to the above-mentioned protocols, mobile devices often include Bluetooth, GPS, FM-radio and TV services that can work concurrently with data and voice communication. Multi-mode, multi-band, and multi-standard mobile terminals must satisfy all these different requirements. Sharing and/or switching transceiver building blocks in these handsets is mandatory in order to extend battery life and/or reduce cost. Only adaptive circuits that are able to reconfigure themselves within the handover time can meet the design requirements of a single receiver or transmitter covering all the different standards while ensuring seamless inter-interoperability. This paper presents analog and digital base-band circuits that are able to support GSM (with Edge), WCDMA (UMTS), WLAN and Bluetooth using reconfigurable building blocks. The blocks can trade off power consumption for performance on the fly, depending on the standard to be supported and the required QoS (Quality of Service) leve

    Novel Front-end Electronics for Time Projection Chamber Detectors

    Full text link
    Este trabajo ha sido realizado en la OrganizaciĆ³n Europea para la InvestigaciĆ³n Nuclear (CERN) y forma parte del proyecto de investigaciĆ³n Europeo para futuros aceleradores lineales (EUDET). En fĆ­sica de partĆ­culas existen diferentes categorĆ­as de detectores de partĆ­culas. El diseƱo presentado esta centrado en un tipo particular de detector de trayectoria de partĆ­culas denominado TPC (Time Projection Chamber) que proporciona una imagen en tres dimensiones de las partĆ­culas elĆ©ctricamente cargadas que atraviesan su volumen gaseoso. La tesis incluye un estudio de los objetivos para futuros detectores, resumiendo los parĆ”metros que un sistema de adquisiciĆ³n de datos debe cumplir en esos casos. AdemĆ”s, estos requisitos son comparados con los actuales sistemas de lectura utilizados en diferentes detectores TPC. Se concluye que ninguno de los sistemas cumple las restrictivas condiciones. Algunos de los principales objetivos para futuros detectores TPC son un altĆ­simo nivel de integraciĆ³n, incremento del nĆŗmero de canales, electrĆ³nica mĆ”s rĆ”pida y muy baja potencia. El principal inconveniente del estado del arte de los sistemas anteriores es la utilizaciĆ³n de varios circuitos integrados en la cadena de adquisiciĆ³n. Este hecho hace imposible alcanzar el altĆ­simo nivel de integraciĆ³n requerido para futuros detectores. AdemĆ”s, un aumento del nĆŗmero de canales y frecuencia de muestreo harĆ­a incrementar hasta valores no permitidos la potencia utilizada. Y en consecuencia, incrementar la refrigeraciĆ³n necesaria (en caso de ser posible). Una de las novedades presentadas es la integraciĆ³n de toda la cadena de adquisiciĆ³n (filtros analĆ³gicos de entrada, conversor analĆ³gico-digital (ADC) y procesado de seƱal digital) en un Ćŗnico circuito integrado en tecnologĆ­a de 130nm. Este chip es el primero que realiza esta altĆ­sima integraciĆ³n para detectores TPC. Por otro lado, se presenta un anĆ”lisis detallado de los filtros de procesado de seƱal. Los objetivos mĆ”s importantes es la reducciĆ³GarcĆ­a GarcĆ­a, EJ. (2012). Novel Front-end Electronics for Time Projection Chamber Detectors [Tesis doctoral no publicada]. Universitat PolitĆØcnica de ValĆØncia. https://doi.org/10.4995/Thesis/10251/16980Palanci

    FPGAs in Industrial Control Applications

    Get PDF
    The aim of this paper is to review the state-of-the-art of Field Programmable Gate Array (FPGA) technologies and their contribution to industrial control applications. Authors start by addressing various research fields which can exploit the advantages of FPGAs. The features of these devices are then presented, followed by their corresponding design tools. To illustrate the benefits of using FPGAs in the case of complex control applications, a sensorless motor controller has been treated. This controller is based on the Extended Kalman Filter. Its development has been made according to a dedicated design methodology, which is also discussed. The use of FPGAs to implement artificial intelligence-based industrial controllers is then briefly reviewed. The final section presents two short case studies of Neural Network control systems designs targeting FPGAs

    Gemmini: Enabling Systematic Deep-Learning Architecture Evaluation via Full-Stack Integration

    Full text link
    DNN accelerators are often developed and evaluated in isolation without considering the cross-stack, system-level effects in real-world environments. This makes it difficult to appreciate the impact of System-on-Chip (SoC) resource contention, OS overheads, and programming-stack inefficiencies on overall performance/energy-efficiency. To address this challenge, we present Gemmini, an open-source*, full-stack DNN accelerator generator. Gemmini generates a wide design-space of efficient ASIC accelerators from a flexible architectural template, together with flexible programming stacks and full SoCs with shared resources that capture system-level effects. Gemmini-generated accelerators have also been fabricated, delivering up to three orders-of-magnitude speedups over high-performance CPUs on various DNN benchmarks. * https://github.com/ucb-bar/gemminiComment: To appear at the 58th IEEE/ACM Design Automation Conference (DAC), December 2021, San Francisco, CA, US

    High Peformance and Low Power On-Die Interconnect Fabrics.

    Full text link
    Increasing power density with technology scaling has caused stagnation in operating frequency of modern day microprocessors. This has led designers to prefer multicore architectures over complex monolithic processors to keep up with the demand for rising computing throughput. Although processing units are getting smaller and simpler, the dramatic rise of their count on a single die has made the fabric that connects these processing units increasingly complex. These interconnect fabrics have become a bottleneck in improving overall system effciency. As a result, the design paradigm for multi-core chips is gradually shifting from a core-centric architecture towards an interconnect-centric architecture, where system efficiency is limited by the fabric rather than the processing ability of any individual core. This dissertation introduces three novel and synergistic circuit techniques to improve scalability of switch fabrics to make on-die integration of hundreds to thousands of cores feasible. 1) A matrix topology is proposed for designing a fully connected switch fabric that re-uses output buses for programming, and stores shue congurations at cross points. This significantly reduces routing congestion, lowers area/power, and improves per- formance. Silicon measurements demonstrate 47% energy savings in a 64-lane SIMD processor fabricated in 65nm CMOS over a conventional implementation. 2) A novel approach to handle high radix arbitration along with data routing is proposed. It optimally uses existing cross-bar interconnect resources without requiring any additional overhead. Bandwidth exceeding 2Tb/s is recorded in a test prototype fabricated in 65nm. 3) Building on the later, a new circuit topology to manage and update priority adaptively within the switch fabric without incurring additional delay or area is then proposed. Several assist circuit techniques, such as a thyristor based sense amplifier and self regenerating bi-directional repeaters are proposed for high speed energy efficient signaling to and from the switch fabric to improve overall routing efficiency. Using these techniques a 64 x 64 switch fabric with 128b data bus fabricated in 45nm achieves a throughput of 4.5Tb/s at single cycle latency while operating at 559MHz.Ph.D.Electrical EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/91506/1/sudhirks_1.pd
    • ā€¦
    corecore