93 research outputs found

    Power and memory optimization techniques in embedded systems design

    Get PDF
    Embedded systems incur tight constraints on power consumption and memory (which impacts size) in addition to other constraints such as weight and cost. This dissertation addresses two key factors in embedded system design, namely minimization of power consumption and memory requirement. The first part of this dissertation considers the problem of optimizing power consumption (peak power as well as average power) in high-level synthesis (HLS). The second part deals with memory usage optimization mainly targeting a restricted class of computations expressed as loops accessing large data arrays that arises in scientific computing such as the coupled cluster and configuration interaction methods in quantum chemistry. First, a mixed-integer linear programming (MILP) formulation is presented for the scheduling problem in HLS using multiple supply-voltages in order to optimize peak power as well as average power and energy consumptions. For large designs, the MILP formulation may not be suitable; therefore, a two-phase iterative linear programming formulation and a power-resource-saving heuristic are presented to solve this problem. In addition, a new heuristic that uses an adaptation of the well-known force-directed scheduling heuristic is presented for the same problem. Next, this work considers the problem of module selection simultaneously with scheduling for minimizing peak and average power consumption. Then, the problem of power consumption (peak and average) in synchronous sequential designs is addressed. A solution integrating basic retiming and multiple-voltage scheduling (MVS) is proposed and evaluated. A two-stage algorithm namely power-oriented retiming followed by a MVS technique for peak and/or average power optimization is presented. Memory optimization is addressed next. Dynamic memory usage optimization during the evaluation of a special class of interdependent large data arrays is considered. Finally, this dissertation develops a novel integer-linear programming (ILP) formulation for static memory optimization using the well-known fusion technique by encoding of legality rules for loop fusion of a special class of loops using logical constraints over binary decision variables and a highly effective approximation of memory usage

    Fast algorithms for retiming large digital circuits

    Get PDF
    The increasing complexity of VLSI systems and shrinking time to market requirements demand good optimization tools capable of handling large circuits. Retiming is a powerful transformation that preserves functionality, and can be used to optimize sequential circuits for a wide range of objective functions by judiciously relocating the memory elements. Leiserson and Saxe, who introduced the concept, presented algorithms for period optimization (minperiod retiming) and area optimization (minarea retiming). The ASTRA algorithm proposed an alternative view of retiming using the equivalence between retiming and clock skew optimization;The first part of this thesis defines the relationship between the Leiserson-Saxe and the ASTRA approaches and utilizes it for efficient minarea retiming of large circuits. The new algorithm, Minaret, uses the same linear program formulation as the Leiserson-Saxe approach. The underlying philosophy of the ASTRA approach is incorporated to reduce the number of variables and constraints in this linear program. This allows minarea retiming of circuits with over 56,000 gates in under fifteen minutes;The movement of flip-flops in control logic changes the state encoding of finite state machines, requiring the preservation of initial (reset) states. In the next part of this work the problem of minimizing the number of flip-flops in control logic subject to a specified clock period and with the guarantee of an equivalent initial state, is formulated as a mixed integer linear program. Bounds on the retiming variables are used to guarantee an equivalent initial state in the retimed circuit. These bounds lead to a simple method for calculating an equivalent initial state for the retimed circuit;The transparent nature of level sensitive latches enables level-clocked circuits to operate faster and require less area. However, this transparency makes the operation of level-clocked circuits very complex, and optimization of level-clocked circuits is a difficult task. This thesis also presents efficient algorithms for retiming large level-clocked circuits. The relationship between retiming and clock skew optimization for level-clocked circuits is defined and utilized to develop efficient retiming algorithms for period and area optimization. Using these algorithms a circuit with 56,000 gates could be retimed for minimum period in under twenty seconds and for minimum area in under 1.5 hours

    Advanced Timing and Synchronization Methodologies for Digital VLSI Integrated Circuits

    Get PDF
    This dissertation addresses timing and synchronization methodologies that are critical to the design, analysis and optimization of high-performance, integrated digital VLSI systems. As process sizes shrink and design complexities increase, achieving timing closure for digital VLSI circuits becomes a significant bottleneck in the integrated circuit design flow. Circuit designers are motivated to investigate and employ alternative methods to satisfy the timing and physical design performance targets. Such novel methods for the timing and synchronization of complex circuitry are developed in this dissertation and analyzed for performance and applicability.Mainstream integrated circuit design flow is normally tuned for zero clock skew, edge-triggered circuit design. Non-zero clock skew or multi-phase clock synchronization is seldom used because the lack of design automation tools increases the length and cost of the design cycle. For similar reasons, level-sensitive registers have not become an industry standard despite their superior size, speed and power consumption characteristics compared to conventional edge-triggered flip-flops.In this dissertation, novel design and analysis techniques that fully automate the design and analysis of non-zero clock skew circuits are presented. Clock skew scheduling of both edge-triggered and level-sensitive circuits are investigated in order to exploit maximum circuit performances. The effects of multi-phase clocking on non-zero clock skew, level-sensitive circuits are investigated leading to advanced synchronization methodologies. Improvements in the scalability of the computational timing analysis process with clock skew scheduling are explored through partitioning and parallelization.The integration of the proposed design and analysis methods to the physical design flow of integrated circuits synchronized with a next-generation clocking technology-resonant rotary clocking technology-is also presented. Based on the design and analysis methods presented in this dissertation, a computer-aided design tool for the design of rotary clock synchronized integrated circuits is developed

    Energy-Efficient Digital Circuit Design using Threshold Logic Gates

    Get PDF
    abstract: Improving energy efficiency has always been the prime objective of the custom and automated digital circuit design techniques. As a result, a multitude of methods to reduce power without sacrificing performance have been proposed. However, as the field of design automation has matured over the last few decades, there have been no new automated design techniques, that can provide considerable improvements in circuit power, leakage and area. Although emerging nano-devices are expected to replace the existing MOSFET devices, they are far from being as mature as semiconductor devices and their full potential and promises are many years away from being practical. The research described in this dissertation consists of four main parts. First is a new circuit architecture of a differential threshold logic flipflop called PNAND. The PNAND gate is an edge-triggered multi-input sequential cell whose next state function is a threshold function of its inputs. Second a new approach, called hybridization, that replaces flipflops and parts of their logic cones with PNAND cells is described. The resulting \hybrid circuit, which consists of conventional logic cells and PNANDs, is shown to have significantly less power consumption, smaller area, less standby power and less power variation. Third, a new architecture of a field programmable array, called field programmable threshold logic array (FPTLA), in which the standard lookup table (LUT) is replaced by a PNAND is described. The FPTLA is shown to have as much as 50% lower energy-delay product compared to conventional FPGA using well known FPGA modeling tool called VPR. Fourth, a novel clock skewing technique that makes use of the completion detection feature of the differential mode flipflops is described. This clock skewing method improves the area and power of the ASIC circuits by increasing slack on timing paths. An additional advantage of this method is the elimination of hold time violation on given short paths. Several circuit design methodologies such as retiming and asynchronous circuit design can use the proposed threshold logic gate effectively. Therefore, the use of threshold logic flipflops in conventional design methodologies opens new avenues of research towards more energy-efficient circuits.Dissertation/ThesisDoctoral Dissertation Computer Science 201

    Efficient Block Scheduling to Minimize Context Switching Time for Programmable Embedded Processors

    Full text link
    Scheduling is one of the most often addressed optimization problems in DSP compilation, behavioral synthesis, and system-level synthesis research. With the rapid pace of changes in modern DSP applications requirements and implementation technologies, however, new types of scheduling challenges arise. This paper is concerned with the problem of scheduling blocks of computations in order to optimize the efficiency of their execution on programmable embedded systems under a realistic timing model of their processors. We describe an effective scheme for scheduling the blocks of any computation on a given system architecture and with a specified algorithm implementing each block. We also present algorithmic techniques for performing optimal block scheduling simultaneously with optimal architecture and algorithm selection. Our techniques address the block scheduling problem for both single- and multiple-processor system platforms and for a variety of optimization objectives including throughput, cost, and power dissipation. We demonstrate the practical effectiveness of our techniques on numerous designs and synthetic examples.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/44804/1/10617_2004_Article_239764.pd

    Enhancing Power Efficient Design Techniques in Deep Submicron Era

    Get PDF
    Excessive power dissipation has been one of the major bottlenecks for design and manufacture in the past couple of decades. Power efficient design has become more and more challenging when technology scales down to the deep submicron era that features the dominance of leakage, the manufacture variation, the on-chip temperature variation and higher reliability requirements, among others. Most of the computer aided design (CAD) tools and algorithms currently used in industry were developed in the pre deep submicron era and did not consider the new features explicitly and adequately. Recent research advances in deep submicron design, such as the mechanisms of leakage, the source and characterization of manufacture variation, the cause and models of on-chip temperature variation, provide us the opportunity to incorporate these important issues in power efficient design. We explore this opportunity in this dissertation by demonstrating that significant power reduction can be achieved with only minor modification to the existing CAD tools and algorithms. First, we consider peak current, which has become critical for circuit's reliability in deep submicron design. Traditional low power design techniques focus on the reduction of average power. We propose to reduce peak current while keeping the overhead on average power as small as possible. Second, dual Vt technique and gate sizing have been used simultaneously for leakage savings. However, this approach becomes less effective in deep submicron design. We propose to use the newly developed process-induced mechanical stress to enhance its performance. Finally, in deep submicron design, the impact of on-chip temperature variation on leakage and performance becomes more and more significant. We propose a temperature-aware dual Vt approach to alleviate hot spots and achieve further leakage reduction. We also consider this leakage-temperature dependency in the dynamic voltage scaling approach and discover that a commonly accepted result is incorrect for the current technology. We conduct extensive experiments with popular design benchmarks, using the latest industry CAD tools and design libraries. The results show that our proposed enhancements are promising in power saving and are practical to solve the low power design challenges in deep submicron era

    Parallelization of dynamic programming recurrences in computational biology

    Get PDF
    The rapid growth of biosequence databases over the last decade has led to a performance bottleneck in the applications analyzing them. In particular, over the last five years DNA sequencing capacity of next-generation sequencers has been doubling every six months as costs have plummeted. The data produced by these sequencers is overwhelming traditional compute systems. We believe that in the future compute performance, not sequencing, will become the bottleneck in advancing genome science. In this work, we investigate novel computing platforms to accelerate dynamic programming algorithms, which are popular in bioinformatics workloads. We study algorithm-specific hardware architectures that exploit fine-grained parallelism in dynamic programming kernels using field-programmable gate arrays: FPGAs). We advocate a high-level synthesis approach, using the recurrence equation abstraction to represent dynamic programming and polyhedral analysis to exploit parallelism. We suggest a novel technique within the polyhedral model to optimize for throughput by pipelining independent computations on an array. This design technique improves on the state of the art, which builds latency-optimal arrays. We also suggest a method to dynamically switch between a family of designs using FPGA reconfiguration to achieve a significant performance boost. We have used polyhedral methods to parallelize the Nussinov RNA folding algorithm to build a family of accelerators that can trade resources for parallelism and are between 15-130x faster than a modern dual core CPU implementation. A Zuker RNA folding accelerator we built on a single workstation with four Xilinx Virtex 4 FPGAs outperforms 198 3 GHz Intel Core 2 Duo processors. Furthermore, our design running on a single FPGA is an order of magnitude faster than competing implementations on similar-generation FPGAs and graphics processors. Our work is a step toward the goal of automated synthesis of hardware accelerators for dynamic programming algorithms

    Linearization of The Timing Analysis and Optimization of Level-Sensitive Circuits

    Get PDF
    This thesis describes a linear programming (LP) formulation applicable to the static timing analysis of large scale synchronous circuits with level-sensitive latches. The automatic timing analysis procedure presented here is composed of deriving the connectivity information, constructing the LP model and solving the clock period minimization problem of synchronous digital VLSI circuits. In synchronous circuits with level-sensitive latches, operation at a reduced clock period (higher clock frequency) is possible by takingadvantage of both non-zero clock skew scheduling and time borrowing. Clock skew schedulingis performed in order to exploit the benefits of nonidentical clock signal delays on circuit timing. The time borrowing property of level-sensitive circuits permits higher operating frequencies compared to edge-sensitivecircuits. Considering time borrowing in the timing analysis, however, introduces non-linearity in this timing analysis. The modified big M (MBM) method is defined in order to transform the non-linear constraints arising in the problem formulation into solvable linear constraints. Equivalent LP model problemsfor single-phase clock synchronization of the ISCAS'89 benchmark circuits are generated and these problems are solved by the industrial LP solver CPLEX. Through the simultaneous application of time borrowing and clock skew scheduling, up to 63% improvements are demonstrated in minimum clock period with respect to zero-skew edge-sensitive synchronous circuits. The timing constraints governing thelevel-sensitive synchronous circuit operation not only solve the clock period minimization problem but also provide a common framework for the general timing analysis of such circuits. The inclusion of additional constraints into the problem formulation in order to meet the timing requirements imposed by specific applicationenvironments is discussed

    Voltage stacking for near/sub-threshold operation

    Get PDF

    ENERGY-AWARE OPTIMIZATION FOR EMBEDDED SYSTEMS WITH CHIP MULTIPROCESSOR AND PHASE-CHANGE MEMORY

    Get PDF
    Over the last two decades, functions of the embedded systems have evolved from simple real-time control and monitoring to more complicated services. Embedded systems equipped with powerful chips can provide the performance that computationally demanding information processing applications need. However, due to the power issue, the easy way to gain increasing performance by scaling up chip frequencies is no longer feasible. Recently, low-power architecture designs have been the main trend in embedded system designs. In this dissertation, we present our approaches to attack the energy-related issues in embedded system designs, such as thermal issues in the 3D chip multiprocessor (CMP), the endurance issue in the phase-change memory(PCM), the battery issue in the embedded system designs, the impact of inaccurate information in embedded system, and the cloud computing to move the workload to remote cloud computing facilities. We propose a real-time constrained task scheduling method to reduce peak temperature on a 3D CMP, including an online 3D CMP temperature prediction model and a set of algorithm for scheduling tasks to different cores in order to minimize the peak temperature on chip. To address the challenging issues in applying PCM in embedded systems, we propose a PCM main memory optimization mechanism through the utilization of the scratch pad memory (SPM). Furthermore, we propose an MLC/SLC configuration optimization algorithm to enhance the efficiency of the hybrid DRAM + PCM memory. We also propose an energy-aware task scheduling algorithm for parallel computing in mobile systems powered by batteries. When scheduling tasks in embedded systems, we make the scheduling decisions based on information, such as estimated execution time of tasks. Therefore, we design an evaluation method for impacts of inaccurate information on the resource allocation in embedded systems. Finally, in order to move workload from embedded systems to remote cloud computing facility, we present a resource optimization mechanism in heterogeneous federated multi-cloud systems. And we also propose two online dynamic algorithms for resource allocation and task scheduling. We consider the resource contention in the task scheduling
    • …
    corecore