215 research outputs found

    Concertina: Squeezing in cache content to operate at near-threshold voltage

    Get PDF
    © 2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Scaling supply voltage to values near the threshold voltage allows a dramatic decrease in the power consumption of processors; however, the lower the voltage, the higher the sensitivity to process variation, and, hence, the lower the reliability. Large SRAM structures, like the last-level cache (LLC), are extremely vulnerable to process variation because they are aggressively sized to satisfy high density requirements. In this paper, we propose Concertina, an LLC designed to enable reliable operation at low voltages with conventional SRAM cells. Based on the observation that for many applications the LLC contains large amounts of null data, Concertina compresses cache blocks in order that they can be allocated to cache entries with faulty cells, enabling use of 100 percent of the LLC capacity. To distribute blocks among cache entries, Concertina implements a compression- and fault-aware insertion/replacement policy that reduces the LLC miss rate. Concertina reaches the performance of an ideal system implementing an LLC that does not suffer from parameter variation with a modest storage overhead. Specifically, performance degrades by less than 2 percent, even when using small SRAM cells, which implies over 90 percent of cache entries having defective cells, and this represents a notable improvement on previously proposed techniques.Peer ReviewedPostprint (author's final draft

    Temperature Variation Aware Energy Optimization in Heterogeneous MPSoCs

    Get PDF
    Thermal effects are rapidly gaining importance in nanometer heterogeneous integrated systems. Increased power density, coupled with spatio-temporal variability of chip workload, cause lateral and vertical temperature non-uniformities (variations) in the chip structure. The assumption of an uniform temperature for a large circuit leads to inaccurate determination of key design parameters. To improve design quality, we need precise estimation of temperature at detailed spatial resolution which is very computationally intensive. Consequently, thermal analysis of the designs needs to be done at multiple levels of granularity. To further investigate the flow of chip/package thermal analysis we exploit the Intel Single Chip Cloud Computer (SCC) and propose a methodology for calibration of SCC on-die temperature sensors. We also develop an infrastructure for online monitoring of SCC temperature sensor readings and SCC power consumption. Having the thermal simulation tool in hand, we propose MiMAPT, an approach for analyzing delay, power and temperature in digital integrated circuits. MiMAPT integrates seamlessly into industrial Front-end and Back-end chip design flows. It accounts for temperature non-uniformities and self-heating while performing analysis. Furthermore, we extend the temperature variation aware analysis of designs to 3D MPSoCs with Wide-I/O DRAM. We improve the DRAM refresh power by considering the lateral and vertical temperature variations in the 3D structure and adapting the per-DRAM-bank refresh period accordingly. We develop an advanced virtual platform which models the performance, power, and thermal behavior of a 3D-integrated MPSoC with Wide-I/O DRAMs in detail. Moving towards real-world multi-core heterogeneous SoC designs, a reconfigurable heterogeneous platform (ZYNQ) is exploited to further study the performance and energy efficiency of various CPU-accelerator data sharing methods in heterogeneous hardware architectures. A complete hardware accelerator featuring clusters of OpenRISC CPUs, with dynamic address remapping capability is built and verified on a real hardware

    A fault-tolerant last level cache for CMPs operating at ultra-low voltage

    Get PDF
    Voltage scaling to values near the threshold voltage is a promising technique to hold off the many-core power wall. However, as voltage decreases, some SRAM cells are unable to operate reliably and show a behavior consistent with a hard fault. Block disabling is a micro-architectural technique that allows low-voltage operation by deactivating faulty cache entries, at the expense of reducing the effective cache capacity. In the case of the last-level cache, this capacity reduction leads to an increase in off-chip memory accesses, diminishing the overall energy benefit of reducing the voltage supply. In this work, we exploit the reuse locality and the intrinsic redundancy of multi-level inclusive hierarchies to enhance the performance of block disabling with negligible cost. The proposed fault-aware last-level cache management policy maps critical blocks, those not present in private caches and with a higher probability of being reused, to active cache entries. Our evaluation shows that this fault-aware management results in up to 37.3% and 54.2% fewer misses per kilo instruction (MPKI) than block disabling for multiprogrammed and parallel workloads, respectively. This translates to performance enhancements of up to 13% and 34.6% for multiprogrammed and parallel workloads, respectively.Peer ReviewedPostprint (author's final draft

    Variation-Tolerant Non-Uniform 3D Cache Management in Memory Stacked Multi-Core Processors

    Get PDF
    Process variations in integrated circuits have significant impact on their performance, leakage and stability. This is particularly evident in large, regular and dense structures such as DRAMs. DRAMs are built using minimized transistors with presumably uniform speed in an organized array structure. Process variation can introduce latency disparity among different memory arrays. With the proliferation of 3D stacking technology, DRAMs become a favorable choice for stacking on top of a multi-core processor as a last level cache for large capacity, high bandwidth, and low power. Hence, variations in bank speed create a unique problem of non-uniform cache accesses in the 3D space.In this thesis, we investigate cache management techniques for tolerating process variation in a 3D DRAM stacked onto a multi-core processor. We modeled the process variation in a 4-layer DRAM memory to characterize the latency variations among different banks. As a result, the notion of fast and slow banks from the core's standpoint is no longer associated with their physical distances with the banks. They are determined by the different bank latencies due to process variation. We develop cache migration schemes that utilize fast banks while limiting the cost due to migration. Our experiments show that there is a great performance benefit in exploiting fast memory banks through migration. On average, a variation-aware management can improve the performance of a workload over the baseline (where the speed of the slowest bank is assumed for all banks) by 17.8%. We are also only 0.45% away in performance from an ideal memory where no PV is present

    COMMERCIALIZATION AND OPTIMIZATION OF THE PIXEL ROUTER

    Get PDF
    The Pixel Router was developed at the University of Kentucky with the intent of supporting multi-projector displays by combining the scalability of commercial software solutions with the flexibility of commercial hardware solutions. This custom hardware solution uses a Look Up Table for an arbitrary input to output pixel mapping, but suffers from high memory latencies due to random SDRAM accesses. In order for this device to achieve marketability, the image interpolation method needed improvement as well. The previous design used the nearest neighbor interpolation method, which produces poor looking results but requires the least amount of memory accesses. A cache was implemented to support bilinear interpolation to simultaneously increase the output frame rate and image quality. A number of software simulations were conducted to test and refine the cache design, and these results were verified by testing the implementation on hardware. The frame rate was improved by a factor of 6 versus bilinear interpolation on the previous design, and by as much as 50% versus nearest neighbor on the previous design. The Pixel Router was also certified for FCC conducted and radiated emissions compliance, and potential commercial market areas were explored

    Urban Agriculture and Small Farm Water Use: Case Studies and Trends from Cache Valley, Utah

    Get PDF
    The landscape of water in Utah is changing due to population growth, conversion of agricultural land to urban development, and increasing awareness of water scarcity. At the same time, Utah is experiencing a growing number of urban and small farms, but knowledge of water use in this sector is limited. Better understanding of what occurs at the field level on urban and small farms can aid state water use estimates and conservation efforts, and assist farmers in moving towards wiser water management. For the 2015 growing season, we performed irrigation evaluations for 24 urban and small farms in Cache Valley, Utah and we explore the results through case studies and identify trends among gross irrigation depth and field variables including field size, irrigation method, application uniformity, and scheduling practices. Results show a great degree of heterogeneity in irrigation methods, equipment used, and management practices. The beneficial consumed fraction of irrigation water ranged from 0.06 to 1.0. Small fields had lower application uniformities and greater irrigation depths than large fields. Surface irrigated fields had higher irrigation depths than sprinkle and drip irrigated fields. Additionally, fields using a fixed irrigation schedule had higher depths than fields that were irrigated inconsistently due to other factors. The results show that urban and small farm irrigators need improved knowledge of proper irrigation management. Irrigators, university extension services, and state water authorities working in this sector need to recognize the link between proper management and total water use, and focus more efforts on improving management, specifically how to use 1) low-cost methods to measure flow rates, 2) simple irrigation scheduling tools, and 3) improve application uniformity

    Efficient Machine Learning Approach for Optimizing Scientific Computing Applications on Emerging HPC Architectures

    Get PDF
    Efficient parallel implementations of scientific applications on multi-core CPUs with accelerators such as GPUs and Xeon Phis is challenging. This requires - exploiting the data parallel architecture of the accelerator along with the vector pipelines of modern x86 CPU architectures, load balancing, and efficient memory transfer between different devices. It is relatively easy to meet these requirements for highly-structured scientific applications. In contrast, a number of scientific and engineering applications are unstructured. Getting performance on accelerators for these applications is extremely challenging because many of these applications employ irregular algorithms which exhibit data-dependent control-flow and irregular memory accesses. Furthermore, these applications are often iterative with dependency between steps, and thus making it hard to parallelize across steps. As a result, parallelism in these applications is often limited to a single step. Numerical simulation of charged particles beam dynamics is one such application where the distribution of work and memory access pattern at each time step is irregular. Applications with these properties tend to present significant branch and memory divergence, load imbalance between different processor cores, and poor compute and memory utilization. Prior research on parallelizing such irregular applications have been focused around optimizing the irregular, data-dependent memory accesses and control-flow during a single step of the application independent of the other steps, with the assumption that these patterns are completely unpredictable. We observed that the structure of computation leading to control-flow divergence and irregular memory accesses in one step is similar to that in the next step. It is possible to predict this structure in the current step by observing the computation structure of previous steps. In this dissertation, we present novel machine learning based optimization techniques to address the parallel implementation challenges of such irregular applications on different HPC architectures. In particular, we use supervised learning to predict the computation structure and use it to address the control-flow and memory access irregularities in the parallel implementation of such applications on GPUs, Xeon Phis, and heterogeneous architectures composed of multi-core CPUs with GPUs or Xeon Phis. We use numerical simulation of charged particles beam dynamics simulation as a motivating example throughout the dissertation to present our new approach, though they should be equally applicable to a wide range of irregular applications. The machine learning approach presented here use predictive analytics and forecasting techniques to adaptively model and track the irregular memory access pattern at each time step of the simulation to anticipate the future memory access pattern. Access pattern forecasts can then be used to formulate optimization decisions during application execution which improves the performance of the application at a future time step based on the observations from earlier time steps. In heterogeneous architectures, forecasts can also be used to improve the memory performance and resource utilization of all the processing units to deliver a good aggregate performance. We used these optimization techniques and anticipation strategy to design a cache-aware, memory efficient parallel algorithm to address the irregularities in the parallel implementation of charged particles beam dynamics simulation on different HPC architectures. Experimental result using a diverse mix of HPC architectures shows that our approach in using anticipation strategy is effective in maximizing data reuse, ensuring workload balance, minimizing branch and memory divergence, and in improving resource utilization

    Exploiting Natural On-chip Redundancy for Energy Efficient Memory and Computing

    Get PDF
    Power density is currently the primary design constraint across most computing segments and the main performance limiting factor. For years, industry has kept power density constant, while increasing frequency, lowering transistors supply (Vdd) and threshold (Vth) voltages. However, Vth scaling has stopped because leakage current is exponentially related to it. Transistor count and integration density keep doubling every process generation (Moore’s Law), but the power budget caps the amount of hardware that can be active at the same time, leading to dark silicon. With each new generation, there are more resources available, but we cannot fully exploit their performance potential. In the last years, different research trends have explored how to cope with dark silicon and unlock the energy efficiency of the chips, including Near-Threshold voltage Computing (NTC) and approximate computing. NTC aggressively lowers Vdd to values near Vth. This allows a substantial reduction in power, as dynamic power scales quadratically with supply voltage. The resultant power reduction could be used to activate more chip resources and potentially achieve performance improvements. Unfortunately, Vdd scaling is limited by the tight functionality margins of on-chip SRAM transistors. When scaling Vdd down to values near-threshold, manufacture-induced parameter variations affect the functionality of SRAM cells, which eventually become not reliable. A large amount of emerging applications, on the other hand, features an intrinsic error-resilience property, tolerating a certain amount of noise. In this context, approximate computing takes advantage of this observation and exploits the gap between the level of accuracy required by the application and the level of accuracy given by the computation, providing that reducing the accuracy translates into an energy gain. However, deciding which instructions and data and which techniques are best suited for approximation still poses a major challenge. This dissertation contributes in these two directions. First, it proposes a new approach to mitigate the impact of SRAM failures due to parameter variation for effective operation at ultra-low voltages. We identify two levels of natural on-chip redundancy: cache level and content level. The first arises because of the replication of blocks in multi-level cache hierarchies. We exploit this redundancy with a cache management policy that allocates blocks to entries taking into account the nature of the cache entry and the use pattern of the block. This policy obtains performance improvements between 2% and 34%, with respect to block disabling, a technique with similar complexity, incurring no additional storage overhead. The latter (content level redundancy) arises because of the redundancy of data in real world applications. We exploit this redundancy compressing cache blocks to fit them in partially functional cache entries. At the cost of a slight overhead increase, we can obtain performance within 2% of that obtained when the cache is built with fault-free cells, even if more than 90% of the cache entries have at least a faulty cell. Then, we analyze how the intrinsic noise tolerance of emerging applications can be exploited to design an approximate Instruction Set Architecture (ISA). Exploiting the ISA redundancy, we explore a set of techniques to approximate the execution of instructions across a set of emerging applications, pointing out the potential of reducing the complexity of the ISA, and the trade-offs of the approach. In a proof-of-concept implementation, the ISA is shrunk in two dimensions: Breadth (i.e., simplifying instructions) and Depth (i.e., dropping instructions). This proof-of-concept shows that energy can be reduced on average 20.6% at around 14.9% accuracy loss

    Development of a 3D structural MRI preprocessing pipeline based on advanced normalization tools (ANTs)

    Get PDF
    Para distinguir el envejecimiento cerebral normal del envejecimiento patológico, se requiere un estudio de la estructura del cerebro a través de resonancias magnéticas. No obstante, dichas imágenes suelen presentar varios problemas que dificultan la extracción de conclusiones, por lo que se deben someter a una serie de técnicas de procesado de la imagen relacionadas con normalizaciones. Esta tesis estudia el uso de Advanced Normalization Tools (ANTs) como núcleo del procesado, complementado con un sistema de control de calidad para evaluar sus resultados y un método para monitorear el uso de recursos computacionales de todo el proceso, además de explorar cómo usar los datos extraídos para alimentar un sistema Machine Learning de predicción de la edad cerebral. Los resultados de esta tesis incluyen: la forma óptima de usar ANTs en este contexto, un sistema de correlación de volúmenes para evaluar sus resultados, un sistema simple de registro del uso de recursos basado en comandos de Linux y los resultados de alimentar diferentes los datos extraídos al sistema de predicción de la edad cerebral XGBoost.To tackle distinguishing normal brain aging from pathological aging, MRI scans are often used to study the brains' structure. Nevertheless, these scans frequently present several problems that harden extracting conclusions, so they have to undergo a series of image processing techniques to improve their usefulness. This thesis studies the use of Advanced Normalization Tools (ANTs) as the core of the processing pipeline, complemented by a quality check system to assess its outputs and a method to monitor the computational resource usage of the whole process, in addition to extracting structural data to feed a Machine Learning brain age prediction system. The outcomes of this thesis include the optimal way to use ANTs in this context, a volume correlation system to evaluate its results, a simple Linux command based resource usage summarisation program and the results from feeding the extracted data to a XGBoost brain age prediction system.Per distingir l'envelliment cerebral normal de l'envelliment patològic, cal un estudi de l'estructura del cervell a través de ressonàncies magnètiques. Tot i això, aquestes imatges solen presentar diversos problemes que dificulten l'extracció de conclusions, per la qual cosa s'han de sotmetre a una sèrie de tècniques de processament de la imatge relacionades amb normalitzacions. Aquesta tesi estudia l'ús d'Advanced Normalization Tools (ANTs) com a nucli del processat, complementat amb un sistema de control de qualitat per avaluar-ne els resultats i un mètode per monitoritzar l'ús de recursos computacionals de tot el procés, a més d'explorar com fer servir les dades extretes per alimentar un sistema Machine Learning de predicció de l'edat cerebral. Els resultats d'aquesta tesi inclouen: la forma òptima de fer servir ANTs en aquest context, un sistema de correlació de volums per avaluar els seus resultats, un sistema simple de registre de l'ús de recursos basat en comandes de Linux i els diversos resultats d'alimentar les dades extretes al sistema de predicció de l'edat cerebral XGBoost
    • …
    corecore