156 research outputs found

    Low power digital signal processing

    Get PDF

    Memory Management for Emerging Memory Technologies

    Get PDF
    The Memory Wall, or the gap between CPU speed and main memory latency, is ever increasing. The latency of Dynamic Random-Access Memory (DRAM) is now of the order of hundreds of CPU cycles. Additionally, the DRAM main memory is experiencing power, performance and capacity constraints that limit process technology scaling. On the other hand, the workloads running on such systems are themselves changing due to virtualization and cloud computing demanding more performance of the data centers. Not only do these workloads have larger working set sizes, but they are also changing the way memory gets used, resulting in higher sharing and increased bandwidth demands. New Non-Volatile Memory technologies (NVM) are emerging as an answer to the current main memory issues. This thesis looks at memory management issues as the emerging memory technologies get integrated into the memory hierarchy. We consider the problems at various levels in the memory hierarchy, including sharing of CPU LLC, traffic management to future non-volatile memories behind the LLC, and extending main memory through the employment of NVM. The first solution we propose is “Adaptive Replacement and Insertion" (ARI), an adaptive approach to last-level CPU cache management, optimizing the cache miss rate and writeback rate simultaneously. Our specific focus is to reduce writebacks as much as possible while maintaining or improving miss rate relative to conventional LRU replacement policy, with minimal hardware overhead. ARI reduces writebacks on benchmarks from SPEC2006 suite on average by 32.9% while also decreasing misses on average by 4.7%. In a PCM based memory system, this decreases energy consumption by 23% compared to LRU and provides a 49% lifetime improvement beyond what is possible with randomized wear-leveling. Our second proposal is “Variable-Timeslice Thread Scheduling" (VATS), an OS kernel-level approach to CPU cache sharing. With modern, large, last-level caches (LLC), the time to fill the LLC is greater than the OS scheduling window. As a result, when a thread aggressively thrashes the LLC by replacing much of the data in it, another thread may not be able to recover its working set before being rescheduled. We isolate the threads in time by increasing their allotted time quanta, and allowing larger periods of time between interfering threads. Our approach, compared to conventional scheduling, mitigates up to 100% of the performance loss caused by CPU LLC interference. The system throughput is boosted by up to 15%. As an unconventional approach to utilizing emerging memory technologies, we present a Ternary Content-Addressable Memory (TCAM) design with Flash transistors. TCAM is successfully used in network routing but can also be utilized in the OS Virtual Memory applications. Based on our layout and circuit simulation experiments, we conclude that our FTCAM block achieves an area improvement of 7.9× and a power improvement of 1.64× compared to a CMOS approach. In order to lower the cost of Main Memory in systems with huge memory demand, it is becoming practical to extend the DRAM in the system with the less-expensive NVMe Flash, for a much lower system cost. However, given the relatively high Flash devices access latency, naively using them as main memory leads to serious performance degradation. We propose OSVPP, a software-only, OS swap-based page prefetching scheme for managing such hybrid DRAM + NVM systems. We show that it is possible to gain about 50% of the lost performance due to swapping into the NVM and thus enable the utilization of such hybrid systems for memory-hungry applications, lowering the memory cost while keeping the performance comparable to the DRAM-only system

    Memory Management for Emerging Memory Technologies

    Get PDF
    The Memory Wall, or the gap between CPU speed and main memory latency, is ever increasing. The latency of Dynamic Random-Access Memory (DRAM) is now of the order of hundreds of CPU cycles. Additionally, the DRAM main memory is experiencing power, performance and capacity constraints that limit process technology scaling. On the other hand, the workloads running on such systems are themselves changing due to virtualization and cloud computing demanding more performance of the data centers. Not only do these workloads have larger working set sizes, but they are also changing the way memory gets used, resulting in higher sharing and increased bandwidth demands. New Non-Volatile Memory technologies (NVM) are emerging as an answer to the current main memory issues. This thesis looks at memory management issues as the emerging memory technologies get integrated into the memory hierarchy. We consider the problems at various levels in the memory hierarchy, including sharing of CPU LLC, traffic management to future non-volatile memories behind the LLC, and extending main memory through the employment of NVM. The first solution we propose is “Adaptive Replacement and Insertion" (ARI), an adaptive approach to last-level CPU cache management, optimizing the cache miss rate and writeback rate simultaneously. Our specific focus is to reduce writebacks as much as possible while maintaining or improving miss rate relative to conventional LRU replacement policy, with minimal hardware overhead. ARI reduces writebacks on benchmarks from SPEC2006 suite on average by 32.9% while also decreasing misses on average by 4.7%. In a PCM based memory system, this decreases energy consumption by 23% compared to LRU and provides a 49% lifetime improvement beyond what is possible with randomized wear-leveling. Our second proposal is “Variable-Timeslice Thread Scheduling" (VATS), an OS kernel-level approach to CPU cache sharing. With modern, large, last-level caches (LLC), the time to fill the LLC is greater than the OS scheduling window. As a result, when a thread aggressively thrashes the LLC by replacing much of the data in it, another thread may not be able to recover its working set before being rescheduled. We isolate the threads in time by increasing their allotted time quanta, and allowing larger periods of time between interfering threads. Our approach, compared to conventional scheduling, mitigates up to 100% of the performance loss caused by CPU LLC interference. The system throughput is boosted by up to 15%. As an unconventional approach to utilizing emerging memory technologies, we present a Ternary Content-Addressable Memory (TCAM) design with Flash transistors. TCAM is successfully used in network routing but can also be utilized in the OS Virtual Memory applications. Based on our layout and circuit simulation experiments, we conclude that our FTCAM block achieves an area improvement of 7.9× and a power improvement of 1.64× compared to a CMOS approach. In order to lower the cost of Main Memory in systems with huge memory demand, it is becoming practical to extend the DRAM in the system with the less-expensive NVMe Flash, for a much lower system cost. However, given the relatively high Flash devices access latency, naively using them as main memory leads to serious performance degradation. We propose OSVPP, a software-only, OS swap-based page prefetching scheme for managing such hybrid DRAM + NVM systems. We show that it is possible to gain about 50% of the lost performance due to swapping into the NVM and thus enable the utilization of such hybrid systems for memory-hungry applications, lowering the memory cost while keeping the performance comparable to the DRAM-only system

    Low power data-dependent transform video and still image coding

    Get PDF
    Thesis (Ph.D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1999.Includes bibliographical references (p. 139-144).This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.This work introduces the idea of data dependent video coding for low power. Algorithms for the Discrete Cosine Transform (DCT) and its inverse are introduced which exploit statistical properties of the input data in both the space and spatial frequency domains in order to minimize the total number of arithmetic operations. Two VLSI chips have been built as a proof-of-concept of data dependent processing implementing the DCT and its inverse (IDCT). The IDCT core processor exploits the presence of a large number of zerovalued spectral coefficients in the input stream when stimulated with MPEG-compressed video sequences. Adata-driven IDCT computation algorithm along with clock gating techniques are used to reduce the number of arithmetic operations for video inputs. The second chip is a DCT core processor that exhibits two innovative techniques for arithmetic operation reduction in the DCT computation context along with standard voltage scaling techniques such as pipelining and parallelism. The first method reduces the bitwidth of arithmetic operations in the presence of data spatial correlation. The second method trades off power dissipation and image compression quality (arithmetic precision.) Both chips are fully functional and exhibit the lowest switched capacitance per sample among past DCT/IDCT chips reported in the literature. Their power dissipation profile shows a strong dependence with certain statistical properties of the data that they operate on, according to the design goal.by Thucydides Xanthopoulos.Ph.D

    Design and analysis of SRAMs for energy harvesting systems

    Get PDF
    PhD ThesisAt present, the battery is employed as a power source for wide varieties of microelectronic systems ranging from biomedical implants and sensor net-works to portable devices. However, the battery has several limitations and incurs many challenges for the majority of these systems. For instance, the design considerations of implantable devices concern about the battery from two aspects, the toxic materials it contains and its lifetime since replacing the battery means a surgical operation. Another challenge appears in wire-less sensor networks, where hundreds or thousands of nodes are scattered around the monitored environment and the battery of each node should be maintained and replaced regularly, nonetheless, the batteries in these nodes do not all run out at the same time. Since the introduction of portable systems, the area of low power designs has witnessed extensive research, driven by the industrial needs, towards the aim of extending the lives of batteries. Coincidentally, the continuing innovations in the field of micro-generators made their outputs in the same range of several portable applications. This overlap creates a clear oppor-tunity to develop new generations of electronic systems that can be powered, or at least augmented, by energy harvesters. Such self-powered systems benefit applications where maintaining and replacing batteries are impossi-ble, inconvenient, costly, or hazardous, in addition to decreasing the adverse effects the battery has on the environment. The main goal of this research study is to investigate energy harvesting aware design techniques for computational logic in order to enable the capa- II bility of working under non-deterministic energy sources. As a case study, the research concentrates on a vital part of all computational loads, SRAM, which occupies more than 90% of the chip area according to the ITRS re-ports. Essentially, this research conducted experiments to find out the design met-ric of an SRAM that is the most vulnerable to unpredictable energy sources, which has been confirmed to be the timing. Accordingly, the study proposed a truly self-timed SRAM that is realized based on complete handshaking protocols in the 6T bit-cell regulated by a fully Speed Independent (SI) tim-ing circuitry. The study proved the functionality of the proposed design in real silicon. Finally, the project enhanced other performance metrics of the self-timed SRAM concentrating on the bit-line length and the minimum operational voltage by employing several additional design techniques.Umm Al-Qura University, the Ministry of Higher Education in the Kingdom of Saudi Arabia, and the Saudi Cultural Burea

    Voltage and Timing Adaptation for Variation and Aging Tolerance in Nanometer VLSI Circuits

    Get PDF
    Process variations and circuit aging continue to be main challenges to the power-efficiency of VLSI circuits, as considerable power budget must be allocated at design time to mitigate timing variations. Modern designs incorporate adaptive techniques for variation compensation to reduce the extra power consumption. The efficiency of existing adaptive approaches, however, is often significantly attenuated by the fine-grained nature of variations in nanometer technology such as random dopant fluctuation, litho-variation, and different rates of transistor degradation due to non-uniform activity factors. This dissertation addresses the limitations from existing adaptation techniques, and proposes new adaptive approaches to effectively compensate the fine-grained variations. Adaptive supply voltage (ASV) is one of the effective adaptation approaches for power-performance tuning. ASV has advantages on controlling dynamic and leakage power, while voltage generation and delivery overheads from conventional ASV systems make their application to mitigate fine-grained variations demanding. This dissertation presents a dual-level ASV system which provides ASV at both coarse-grained and fine-grained level, and has limited power routing overhead. Significant power reduction from our dual-ASV system demonstrates its superiority over existing approaches. Another novel technique on supply voltage adaptation for variation resilience in VLSI interconnects is proposed. A programmable boostable repeater design boosts switching speed by raising its internal voltage rail transiently and autonomously, and achieves fine-grained voltage adaptation without stand-alone voltage regulators or additional power grid. Since interconnect is a widely recognized bottleneck to chip performance and tremendous repeaters are employed on chip designs, boostable repeater has plenty of chances to improve system robustness. A low cost scheme for delay variation detection is essential to compose an efficient adaptation system. This dissertation presents an area-efficient built-in delay testing scheme which exploits BIST SCAN architecture and dynamic clock skew control. Using this built-in delay testing scheme, a fine-grained adaptation system composed of the proposed boostable repeater design and adaptive clock skew control is proposed, and demonstrated to mitigate process variation and aging induced timing degradations in a power as well as area efficient manner

    Optimization Schemes for Variability-Driven VLSI Design Automation

    Get PDF
    Today's IC design is facing several challenges due to increasing circuit complexity and decreasing feature size, as it pushes to extend Moore's law into nano-scale dimensions. Apart from the challenges that arise directly as a result of feature scaling (e.g., increasing leakage power, reliability issues), imperfections in the manufacturing process have recently turned into a major design hurdle, due to the variations they cause in the device and interconnect parameters from their target values. From an IC design automation perspective, a shift in paradigm, from deterministic to probabilistic, is needed to handle the unpredictable nature of these fabrication variations. In such a probabilistic paradigm, the varying circuit parameters such as leakage power or delay should be accurately modeled, and their correlations due to common sources of variations or physical location on the chip should be well captured. Moreover, variability-driven (probabilistic) design automation needs to efficiently generate a high quality solution. A particular challenge in variability-driven design automation is to define optimality measures among candidate solutions, which allow for inferior solutions to be removed from the solution space thus reducing the run-time complexity. In this dissertation, the superiority probability is introduced as such an optimality measure, and two methods are proposed to compute this probability: an accurate Conditional Monte Carlo simulation method, and an efficient moment-matching approximation method. The effectiveness of using the superiority probability is shown in the context of two important design automation applications: 1) the buffer insertion problem, 2) the dual-Vth leakage optimization problem. Another important task in variability-driven design automation is to develop optimization techniques that can provably generate the optimal solution in an efficient way. In this dissertation, the application of the gate sizing problem is explored to optimally reduce the loss due to fabrication variations in the presence of a timing constraint. The presented formulation, in contrast with the existing variability-driven approaches which are all based on heuristics, is provably optimal. Moreover, unlike existing approaches, it is independent of any assumption on the source and nature of variations

    CAD methodologies for low power and reliable 3D ICs

    Get PDF
    The main objective of this dissertation is to explore and develop computer-aided-design (CAD) methodologies and optimization techniques for reliability, timing performance, and power consumption of through-silicon-via(TSV)-based and monolithic 3D IC designs. The 3D IC technology is a promising answer to the device scaling and interconnect problems that industry faces today. Yet, since multiple dies are stacked vertically in 3D ICs, new problems arise such as thermal, power delivery, and so on. New physical design methodologies and optimization techniques should be developed to address the problems and exploit the design freedom in 3D ICs. Towards the objective, this dissertation includes four research projects. The first project is on the co-optimization of traditional design metrics and reliability metrics for 3D ICs. It is well known that heat removal and power delivery are two major reliability concerns in 3D ICs. To alleviate thermal problem, two possible solutions have been proposed: thermal-through-silicon-vias (T-TSVs) and micro-fluidic-channel (MFC) based cooling. For power delivery, a complex power distribution network is required to deliver currents reliably to all parts of the 3D IC while suppressing the power supply noise to an acceptable level. However, these thermal and power networks pose major challenges in signal routability and congestion. In this project, a co-optimization methodology for signal, power, and thermal interconnects in 3D ICs is presented. The goal of the proposed approach is to improve signal, thermal, and power noise metrics and to provide fast and accurate design space explorations for early design stages. The second project is a study on 3D IC partition. For a 3D IC, the target circuit needs to be partitioned into multiple parts then mapped onto the dies. The partition style impacts design quality such as footprint, wirelength, timing, and so on. In this project, the design methodologies of 3D ICs with different partition styles are demonstrated. For the LEON3 multi-core microprocessor, three partitioning styles are compared: core-level, block-level, and gate-level. The design methodologies for such partitioning styles and their implications on the physical layout are discussed. Then, to perform timing optimizations for 3D ICs, two timing constraint generation methods are demonstrated that lead to different design quality. The third project is on the buffer insertion for timing optimization of 3D ICs. For high performance 3D ICs, it is crucial to perform thorough timing optimizations. Among timing optimization techniques, buffer insertion is known to be the most effective way. The TSVs have a large parasitic capacitance that increases the signal slew and the delay on the downstream. In this project, a slew-aware buffer insertion algorithm is developed that handles full 3D nets and considers TSV parasitics and slew effects on delay. Compared with the well-known van Ginneken algorithm and a commercial tool, the proposed algorithm finds buffering solutions with lower delay values and acceptable runtime overhead. The last project is on the ultra-high-density logic designs for monolithic 3D ICs. The nano-scale 3D interconnects available in monolithic 3D IC technology enable ultra-high-density device integration at the individual transistor-level. The benefits and challenges of monolithic 3D integration technology for logic designs are investigated. First, a 3D standard cell library for transistor-level monolithic 3D ICs is built and their timing and power behavior are characterized. Then, various interconnect options for monolithic 3D ICs that improve design quality are explored. Next, timing-closed, full-chip GDSII layouts are built and iso-performance power comparisons with 2D IC designs are performed. Important design metrics such as area, wirelength, timing, and power consumption are compared among transistor-level monolithic 3D, gate-level monolithic 3D, TSV-based 3D, and traditional 2D designs.PhDCommittee Chair: Lim, Sung Kyu; Committee Member: Bakir, Muhannad; Committee Member: Kim, Hyesoon; Committee Member: Lee, Hsien-Hsin; Committee Member: Mukhopadhyay, Saiba

    Dynamic Partial Reconfiguration for Dependable Systems

    Get PDF
    Moore’s law has served as goal and motivation for consumer electronics manufacturers in the last decades. The results in terms of processing power increase in the consumer electronics devices have been mainly achieved due to cost reduction and technology shrinking. However, reducing physical geometries mainly affects the electronic devices’ dependability, making them more sensitive to soft-errors like Single Event Transient (SET) of Single Event Upset (SEU) and hard (permanent) faults, e.g. due to aging effects. Accordingly, safety critical systems often rely on the adoption of old technology nodes, even if they introduce longer design time w.r.t. consumer electronics. In fact, functional safety requirements are increasingly pushing industry in developing innovative methodologies to design high-dependable systems with the required diagnostic coverage. On the other hand commercial off-the-shelf (COTS) devices adoption began to be considered for safety-related systems due to real-time requirements, the need for the implementation of computationally hungry algorithms and lower design costs. In this field FPGA market share is constantly increased, thanks to their flexibility and low non-recurrent engineering costs, making them suitable for a set of safety critical applications with low production volumes. The works presented in this thesis tries to face new dependability issues in modern reconfigurable systems, exploiting their special features to take proper counteractions with low impacton performances, namely Dynamic Partial Reconfiguration

    Source-synchronous I/O Links using Adaptive Interface Training for High Bandwidth Applications

    Get PDF
    Mobility is the key to the global business which requires people to be always connected to a central server. With the exponential increase in smart phones, tablets, laptops, mobile traffic will soon reach in the range of Exabytes per month by 2018. Applications like video streaming, on-demand-video, online gaming, social media applications will further increase the traffic load. Future application scenarios, such as Smart Cities, Industry 4.0, Machine-to-Machine (M2M) communications bring the concepts of Internet of Things (IoT) which requires high-speed low power communication infrastructures. Scientific applications, such as space exploration, oil exploration also require computing speed in the range of Exaflops/s by 2018 which means TB/s bandwidth at each memory node. To achieve such bandwidth, Input/Output (I/O) link speed between two devices needs to be increased to GB/s. The data at high speed between devices can be transferred serially using complex Clock-Data-Recovery (CDR) I/O links or parallely using simple source-synchronous I/O links. Even though CDR is more efficient than the source-synchronous method for single I/O link, but to achieve TB/s bandwidth from a single device, additional I/O links will be required and the source-synchronous method will be more advantageous in terms of area and power requirements as additional I/O links do not require extra hardware resources. At high speed, there are several non-idealities (Supply noise, crosstalk, Inter- Symbol-Interference (ISI), etc.) which create unwanted skew problem among parallel source-synchronous I/O links. To solve these problems, adaptive trainings are used in time domain to synchronize parallel source-synchronous I/O links irrespective of these non-idealities. In this thesis, two novel adaptive training architectures for source-synchronous I/O links are discussed which require significantly less silicon area and power in comparison to state-of-the-art architectures. First novel adaptive architecture is based on the unit delay concept to synchronize two parallel clocks by adjusting the phase of one clock in only one direction. Second novel adaptive architecture concept consists of Phase Interpolator (PI)-based Phase Locked Loop (PLL) which can adjust the phase in both direction and achieve faster synchronization at the expense of added complexity. With an increase in parallel I/O links, clock skew which is generated by the improper clock tree, also affects the timing margin. Incorrect duty cycle further reduces the timing margin mainly in Double Data Rate (DDR) systems which are generally used to increase the bandwidth of a high-speed communication system. To solve clock skew and duty cycle problems, a novel clock tree buffering algorithm and a novel duty cycle corrector are described which further reduce the power consumption of a source-synchronous system
    • …
    corecore