11 research outputs found

    An efficient incremental algorithm for min-area retiming

    Full text link
    As one of the most effective sequential optimization tech-niques, retiming is a structural transformation that relocates flip-flops in a circuit without changing its functionality. The min-area retiming problem seeks a solution with the mini-mum flip-flop area (or number) under a given clock period. Even though having polynomial runtime, the best existing algorithms for this problem still need to first construct a dense path graph and then find a min-cost network flow on it, thus incur huge storage and time expenses for large cir-cuits. Recently, provable incremental algorithms have been discovered for min-period retiming, and heuristic incremen-tal algorithms have been proposed for min-area retiming. However, given the complexity of the problem, min-area re-timing is still resisting an efficient provable incremental algo-rithm. In this paper, we fill the gap by presenting an efficient algorithm to solve the min-area retiming problem incremen-tally and optimally. Contrary to existing approaches, no dense path graph is constructed; only the active timing con-straints are dynamically generated in the algorithm. Exper-imental results show that the total runtime of our algorithm for all the benchmarks is at least 60 × faster than the best existing approach

    Fast algorithms for retiming large digital circuits

    Get PDF
    The increasing complexity of VLSI systems and shrinking time to market requirements demand good optimization tools capable of handling large circuits. Retiming is a powerful transformation that preserves functionality, and can be used to optimize sequential circuits for a wide range of objective functions by judiciously relocating the memory elements. Leiserson and Saxe, who introduced the concept, presented algorithms for period optimization (minperiod retiming) and area optimization (minarea retiming). The ASTRA algorithm proposed an alternative view of retiming using the equivalence between retiming and clock skew optimization;The first part of this thesis defines the relationship between the Leiserson-Saxe and the ASTRA approaches and utilizes it for efficient minarea retiming of large circuits. The new algorithm, Minaret, uses the same linear program formulation as the Leiserson-Saxe approach. The underlying philosophy of the ASTRA approach is incorporated to reduce the number of variables and constraints in this linear program. This allows minarea retiming of circuits with over 56,000 gates in under fifteen minutes;The movement of flip-flops in control logic changes the state encoding of finite state machines, requiring the preservation of initial (reset) states. In the next part of this work the problem of minimizing the number of flip-flops in control logic subject to a specified clock period and with the guarantee of an equivalent initial state, is formulated as a mixed integer linear program. Bounds on the retiming variables are used to guarantee an equivalent initial state in the retimed circuit. These bounds lead to a simple method for calculating an equivalent initial state for the retimed circuit;The transparent nature of level sensitive latches enables level-clocked circuits to operate faster and require less area. However, this transparency makes the operation of level-clocked circuits very complex, and optimization of level-clocked circuits is a difficult task. This thesis also presents efficient algorithms for retiming large level-clocked circuits. The relationship between retiming and clock skew optimization for level-clocked circuits is defined and utilized to develop efficient retiming algorithms for period and area optimization. Using these algorithms a circuit with 56,000 gates could be retimed for minimum period in under twenty seconds and for minimum area in under 1.5 hours

    Rewired retiming for flip-flop reduction and low power without delay penalty.

    Get PDF
    Jiang, Mingqi.Thesis (M.Phil.)--Chinese University of Hong Kong, 2009.Includes bibliographical references (leaves [49]-51).Abstract also in Chinese.Abstract --- p.iAcknowledgement --- p.iiiChapter 1 --- Introduction --- p.1Chapter 2 --- Rewiring Background --- p.4Chapter 2.1 --- REWIRE --- p.6Chapter 2.2 --- GBAW --- p.7Chapter 3 --- Retiming --- p.9Chapter 3.1 --- Min-Clock Period Retiming --- p.9Chapter 3.2 --- Min-Area Retiming --- p.17Chapter 3.3 --- Retiming for Low Power --- p.18Chapter 3.4 --- Retiming with Interconnect Delay --- p.22Chapter 4 --- Rewired Retiming for Flip-flop Reduction --- p.26Chapter 4.1 --- Motivation and Problem Formulation --- p.26Chapter 4.2 --- Retiming Indication --- p.29Chapter 4.3 --- Target Wire Selection --- p.31Chapter 4.4 --- Incremental Placement Update --- p.33Chapter 4.5 --- Optimization Flow --- p.36Chapter 4.6 --- Experimental Results --- p.38Chapter 5 --- Power Analysis for Rewired Retiming --- p.41Chapter 5.1 --- Power Model --- p.41Chapter 5.2 --- Experimental Results --- p.44Chapter 6 --- Conclusion --- p.47Bibliography --- p.5

    Exploiting parallelism within multidimensional multirate digital signal processing systems

    Get PDF
    The intense requirements for high processing rates of multidimensional Digital Signal Processing systems in practical applications justify the Application Specific Integrated Circuits designs and parallel processing implementations. In this dissertation, we propose novel theories, methodologies and architectures in designing high-performance VLSI implementations for general multidimensional multirate Digital Signal Processing systems by exploiting the parallelism within those applications. To systematically exploit the parallelism within the multidimensional multirate DSP algorithms, we develop novel transformations including (1) nonlinear I/O data space transforms, (2) intercalation transforms, and (3) multidimensional multirate unfolding transforms. These transformations are applied to the algorithms leading to systematic methodologies in high-performance architectural designs. With the novel design methodologies, we develop several architectures with parallel and distributed processing features for implementing multidimensional multirate applications. Experimental results have shown that those architectures are much more efficient in terms of execution time and/or hardware cost compared with existing hardware implementations

    Tools for efficient Deep Learning

    Get PDF
    In the era of Deep Learning (DL), there is a fast-growing demand for building and deploying Deep Neural Networks (DNNs) on various platforms. This thesis proposes five tools to address the challenges for designing DNNs that are efficient in time, in resources and in power consumption. We first present Aegis and SPGC to address the challenges in improving the memory efficiency of DL training and inference. Aegis makes mixed precision training (MPT) stabler by layer-wise gradient scaling. Empirical experiments show that Aegis can improve MPT accuracy by at most 4\%. SPGC focuses on structured pruning: replacing standard convolution with group convolution (GConv) to avoid irregular sparsity. SPGC formulates GConv pruning as a channel permutation problem and proposes a novel heuristic polynomial-time algorithm. Common DNNs pruned by SPGC have maximally 1\% higher accuracy than prior work. This thesis also addresses the challenges lying in the gap between DNN descriptions and executables by Polygeist for software and POLSCA for hardware. Many novel techniques, e.g. statement splitting and memory partitioning, are explored and used to expand polyhedral optimisation. Polygeist can speed up software execution in sequential and parallel by 2.53 and 9.47 times on Polybench/C. POLSCA achieves 1.5 times speedup over hardware designs directly generated from high-level synthesis on Polybench/C. Moreover, this thesis presents Deacon, a framework that generates FPGA-based DNN accelerators of streaming architectures with advanced pipelining techniques to address the challenges from heterogeneous convolution and residual connections. Deacon provides fine-grained pipelining, graph-level optimisation, and heuristic exploration by graph colouring. Compared with prior designs, Deacon shows resource/power consumption efficiency improvement of 1.2x/3.5x for MobileNets and 1.0x/2.8x for SqueezeNets. All these tools are open source, some of which have already gained public engagement. We believe they can make efficient deep learning applications easier to build and deploy.Open Acces

    Exploiting parallelism within multidimensional multirate digital signal processing systems

    Get PDF
    The intense requirements for high processing rates of multidimensional Digital Signal Processing systems in practical applications justify the Application Specific Integrated Circuits designs and parallel processing implementations. In this dissertation, we propose novel theories, methodologies and architectures in designing high-performance VLSI implementations for general multidimensional multirate Digital Signal Processing systems by exploiting the parallelism within those applications. To systematically exploit the parallelism within the multidimensional multirate DSP algorithms, we develop novel transformations including (1) nonlinear I/O data space transforms, (2) intercalation transforms, and (3) multidimensional multirate unfolding transforms. These transformations are applied to the algorithms leading to systematic methodologies in high-performance architectural designs. With the novel design methodologies, we develop several architectures with parallel and distributed processing features for implementing multidimensional multirate applications. Experimental results have shown that those architectures are much more efficient in terms of execution time and/or hardware cost compared with existing hardware implementations
    corecore