827 research outputs found

    Design of multimedia processor based on metric computation

    Get PDF
    Media-processing applications, such as signal processing, 2D and 3D graphics rendering, and image compression, are the dominant workloads in many embedded systems today. The real-time constraints of those media applications have taxing demands on today's processor performances with low cost, low power and reduced design delay. To satisfy those challenges, a fast and efficient strategy consists in upgrading a low cost general purpose processor core. This approach is based on the personalization of a general RISC processor core according the target multimedia application requirements. Thus, if the extra cost is justified, the general purpose processor GPP core can be enforced with instruction level coprocessors, coarse grain dedicated hardware, ad hoc memories or new GPP cores. In this way the final design solution is tailored to the application requirements. The proposed approach is based on three main steps: the first one is the analysis of the targeted application using efficient metrics. The second step is the selection of the appropriate architecture template according to the first step results and recommendations. The third step is the architecture generation. This approach is experimented using various image and video algorithms showing its feasibility

    Exploiting parallelism within multidimensional multirate digital signal processing systems

    Get PDF
    The intense requirements for high processing rates of multidimensional Digital Signal Processing systems in practical applications justify the Application Specific Integrated Circuits designs and parallel processing implementations. In this dissertation, we propose novel theories, methodologies and architectures in designing high-performance VLSI implementations for general multidimensional multirate Digital Signal Processing systems by exploiting the parallelism within those applications. To systematically exploit the parallelism within the multidimensional multirate DSP algorithms, we develop novel transformations including (1) nonlinear I/O data space transforms, (2) intercalation transforms, and (3) multidimensional multirate unfolding transforms. These transformations are applied to the algorithms leading to systematic methodologies in high-performance architectural designs. With the novel design methodologies, we develop several architectures with parallel and distributed processing features for implementing multidimensional multirate applications. Experimental results have shown that those architectures are much more efficient in terms of execution time and/or hardware cost compared with existing hardware implementations

    Interconnect-aware scheduling and resource allocation for high-level synthesis

    Get PDF
    A high-level architectural synthesis can be described as the process of transforming a behavioral description into a structural description. The scheduling, processor allocation, and register binding are the most important tasks in the high-level synthesis. In the past, it has been possible to focus simply on the delays of the processing units in a high-level synthesis and neglect the wire delays, since the overall delay of a digital system was dominated by the delay of the logic gates. However, with the process technology being scaled down to deep-submicron region, the global interconnect delays can no longer be neglected in VLSI designs. It is, therefore, imperative to include in high-level synthesis the delays on wires and buses used to communicate data between the processing units i.e., inter-processor communication delays. Furthermore, the way the process of register binding is performed also has an impact on the complexity of the interconnect paths required to transfer data between the processing units. Hence, the register binding can no longer ignore its effect on the wiring complexity of resulting designs. The objective of this thesis is to develop techniques for an interconnect-aware high-level synthesis. Under this common theme, this thesis has two distinct focuses. The first focus of this thesis is on developing a new high-level synthesis framework while taking the inter-processor communication delay into consideration. The second focus of this thesis is on the developing of a technique to carry out the register binding and a scheme to reduce the number of registers while taking the complexity of the interconnects into consideration. A novel scheduling and processor allocation technique taking into consideration the inter-processor communication delay is presented. In the proposed technique, the communication delay between a pair of nodes of different types is treated as a non-computing node, whereas that between a pair of nodes of the same type is taken into account by re-adjusting the firing times of the appropriate nodes of the data flow graph (DFG). Another technique for the integration of the placement process into the scheduling and processor allocation in order to determine the actual positions of the processing units in the placement space is developed. The proposed technique makes use of a hybrid library of functional units, which includes both operation-specific and reconfigurable multiple-operation functional units, to maximize the local data transfer. A technique for register binding that results in a reduced number of registers and interconnects is developed by appropriately dividing the lifetime of a token into multiple segments and then binding those having the same source and/or destination into a single register. A node regeneration scheme, in which the idle processing units are utilized to generate multiple copies of the nodes in a given DFG, is devised to reduce the number of registers and interconnects even further. The techniques and schemes developed in this thesis are applied to the synthesis of architectures for a number of benchmark DSP problems and compared with various other commonly used synthesis methods in order to assess their effectiveness. It is shown that the proposed techniques provide superior performance in terms of the iteration period, placement area, and the numbers of the processing units, registers and interconnects in the synthesized architectur

    System-on-chip Computing and Interconnection Architectures for Telecommunications and Signal Processing

    Get PDF
    This dissertation proposes novel architectures and design techniques targeting SoC building blocks for telecommunications and signal processing applications. Hardware implementation of Low-Density Parity-Check decoders is approached at both the algorithmic and the architecture level. Low-Density Parity-Check codes are a promising coding scheme for future communication standards due to their outstanding error correction performance. This work proposes a methodology for analyzing effects of finite precision arithmetic on error correction performance and hardware complexity. The methodology is throughout employed for co-designing the decoder. First, a low-complexity check node based on the P-output decoding principle is designed and characterized on a CMOS standard-cells library. Results demonstrate implementation loss below 0.2 dB down to BER of 10^{-8} and a saving in complexity up to 59% with respect to other works in recent literature. High-throughput and low-latency issues are addressed with modified single-phase decoding schedules. A new "memory-aware" schedule is proposed requiring down to 20% of memory with respect to the traditional two-phase flooding decoding. Additionally, throughput is doubled and logic complexity reduced of 12%. These advantages are traded-off with error correction performance, thus making the solution attractive only for long codes, as those adopted in the DVB-S2 standard. The "layered decoding" principle is extended to those codes not specifically conceived for this technique. Proposed architectures exhibit complexity savings in the order of 40% for both area and power consumption figures, while implementation loss is smaller than 0.05 dB. Most modern communication standards employ Orthogonal Frequency Division Multiplexing as part of their physical layer. The core of OFDM is the Fast Fourier Transform and its inverse in charge of symbols (de)modulation. Requirements on throughput and energy efficiency call for FFT hardware implementation, while ubiquity of FFT suggests the design of parametric, re-configurable and re-usable IP hardware macrocells. In this context, this thesis describes an FFT/IFFT core compiler particularly suited for implementation of OFDM communication systems. The tool employs an accuracy-driven configuration engine which automatically profiles the internal arithmetic and generates a core with minimum operands bit-width and thus minimum circuit complexity. The engine performs a closed-loop optimization over three different internal arithmetic models (fixed-point, block floating-point and convergent block floating-point) using the numerical accuracy budget given by the user as a reference point. The flexibility and re-usability of the proposed macrocell are illustrated through several case studies which encompass all current state-of-the-art OFDM communications standards (WLAN, WMAN, xDSL, DVB-T/H, DAB and UWB). Implementations results are presented for two deep sub-micron standard-cells libraries (65 and 90 nm) and commercially available FPGA devices. Compared with other FFT core compilers, the proposed environment produces macrocells with lower circuit complexity and same system level performance (throughput, transform size and numerical accuracy). The final part of this dissertation focuses on the Network-on-Chip design paradigm whose goal is building scalable communication infrastructures connecting hundreds of core. A low-complexity link architecture for mesochronous on-chip communication is discussed. The link enables skew constraint looseness in the clock tree synthesis, frequency speed-up, power consumption reduction and faster back-end turnarounds. The proposed architecture reaches a maximum clock frequency of 1 GHz on 65 nm low-leakage CMOS standard-cells library. In a complex test case with a full-blown NoC infrastructure, the link overhead is only 3% of chip area and 0.5% of leakage power consumption. Finally, a new methodology, named metacoding, is proposed. Metacoding generates correct-by-construction technology independent RTL codebases for NoC building blocks. The RTL coding phase is abstracted and modeled with an Object Oriented framework, integrated within a commercial tool for IP packaging (Synopsys CoreTools suite). Compared with traditional coding styles based on pre-processor directives, metacoding produces 65% smaller codebases and reduces the configurations to verify up to three orders of magnitude

    A VLSI DSP DESIGN AND IMPLEMENTATION OF ALL POLE LATTICE FILTER USING RETIMING METHODOLOGY

    Get PDF
    All pole lattice fil ters are used in a variety of signal processing applications that is speech processing, adaptive filters and various other applications. The implementation of lattice f i l t e r requires more clock period hence low speed. There are various transformation technique pr es ent for design of high-speed or low-area or lowpower implementations. This paper presents design of high-speed (smaller clock period) implementation of 8th order all pole lattice filter using the methodology named as Retiming. Retiming reduces the clock period of the circuit, reducing the number of registers in the circuit, reducing the power consumption of the circuit. Therefore, retiming has been used to reduce the clock period of all pole lattice filters and it increases the speed of the system

    EPICURE: A partitioning and co-design framework for reconfigurable computing

    Get PDF
    This paper presents a new design methodology able to bridge the gap between an abstract specification and a heterogeneous reconfigurable architecture. The EPICURE contribution is the result of a joint study on abstraction/refinement methods and a smart reconfigurable architecture within the formal Esterel design tools suite. The original points of this work are: (i) a generic HW/SW interface model, (ii) a specification methodology that handles the control, and includes efficient verification and HW/SW synthesis capabilities, (iii) a method for parallelism exploration based on abstract resources/performance estimation expressed in terms of area/delay tradeoffs, (iv) a HW/SW partitioning approach that refines the specification into explicit HW configurations and the associated SW control. The EPICURE framework shows how a cooperation of complementary methodologies and CAD tools associated with a relevant architecture can signficantly improve the designer productivity, especially in the context of reconfigurable architectures

    Automatic mapping of graphical programming applications to microelectronic technologies

    Get PDF
    Adaptive computing systems (ACSs) and application-specific integrated circuits (ASICs) can serve as flexible hardware accelerators for applications in domains such as image processing and digital signal processing. However, the mapping of applications onto ACSs and ASICs using the traditional methods can take months for a hardware engineer to develop and debug. In this dissertation, a new approach for automatic mapping of software applications onto ACSs and ASICs has been developed, implemented and validated. This dissertation presents the design flow of the software environment called CHAMPION, which is being developed at the University of Tennessee. This environment permits high-level design entry using the Cantata graphical programming software fromKRI. Using Cantata as the design entry, CHAMPION hides from the user the low-level details of the hardware architecture and the finer issues of application mapping onto the hardware. Validation of the CHAMPION environment was performed using multiple applications of moderate complexity. In one case, theapplication mapping time which required six weeks to perform manually took only six minutes for CHAMPION, yet comparable results were produced. Furthermore, the CHAMPION environment was constructed such that retargeting to a new adaptive computing system could be accomplished in just a few hours as opposed to weeks using manual methods. Thus, CHAMPION permits both ACSs and ASICs to be utilized by a wider audience and application development accomplished in less time
    • …
    corecore