183 research outputs found
Informed microarchitecture design space exploration using workload dynamics
Program runtime characteristics exhibit significant variation. As microprocessor architectures become more complex, their efficiency depends on the capability of adapting with workload dynamics. Moreover, with the approaching billion-transistor microprocessor era, it is not always economical or feasible to design processors with thermal cooling and reliability redundancy capabilities that target an application’s worst case scenario. Therefore, analyzing complex workload dynamics early, at the microarchitecture design stage, is crucial to forecast workload runtime behavior across architecture design alternatives and evaluate the efficiency of workload scenariobased architecture optimizations. Existing methods focus exclusively on predicting aggregated workload behavior. In this paper, we propose accurate and efficient techniques and models to reason about workload dynamics across the microarchitecture design space without using detailed cyclelevel simulations. Our proposed techniques employ waveletbased multiresolution decomposition and neural network based non-linear regression modeling. We extensively evaluate the efficiency of our predictive models in forecasting performance, power and reliability domain workload dynamics that the SPEC CPU 2000 benchmarks manifest on high-performance microprocessors with a microarchitecture design space that consists of 9 key parameters. Our results show that the models achieve high accuracy in revealing workload dynamic behavior across a large microarchitecture design space. We also demonstrate that the proposed techniques can be used to efficiently explore workload scenario-driven architecture optimizations. 1
Mechanistic modeling of architectural vulnerability factor
Reliability to soft errors is a significant design challenge in modern microprocessors owing to an exponential increase in the number of transistors on chip and the reduction in operating voltages with each process generation. Architectural Vulnerability Factor (AVF) modeling using microarchitectural simulators enables architects to make informed performance, power, and reliability tradeoffs. However, such simulators are time-consuming and do not reveal the microarchitectural mechanisms that influence AVF. In this article, we present an accurate first-order mechanistic analytical model to compute AVF, developed using the first principles of an out-of-order superscalar execution. This model provides insight into the fundamental interactions between the workload and microarchitecture that together influence AVF. We use the model to perform design space exploration, parametric sweeps, and workload characterization for AVF
Contextual Bandit Modeling for Dynamic Runtime Control in Computer Systems
Modern operating systems and microarchitectures provide a myriad of mechanisms for monitoring and affecting system operation and resource utilization at runtime. Dynamic runtime control of these mechanisms can tailor system operation to the characteristics and behavior of the current workload, resulting in improved performance. However, developing effective models for system control can be challenging. Existing methods often require extensive manual effort, computation time, and domain knowledge to identify relevant low-level performance metrics, relate low-level performance metrics and high-level control decisions to workload performance, and to evaluate the resulting control models.
This dissertation develops a general framework, based on the contextual bandit, for describing and learning effective models for runtime system control. Random profiling is used to characterize the relationship between workload behavior, system configuration, and performance. The framework is evaluated in the context of two applications of progressive complexity; first, the selection of paging modes (Shadow Paging, Hardware-Assisted Page) in the Xen virtual machine memory manager; second, the utilization of hardware memory prefetching for multi-core, multi-tenant workloads with cross-core contention for shared memory resources, such as the last-level cache and memory bandwidth. The resulting models for both applications are competitive in comparison to existing runtime control approaches. For paging mode selection, the resulting model provides equivalent performance to the state of the art while substantially reducing the computation requirements of profiling. For hardware memory prefetcher utilization, the resulting models are the first to provide dynamic control for hardware prefetchers using workload statistics. Finally, a correlation-based feature selection method is evaluated for identifying relevant low-level performance metrics related to hardware memory prefetching
Analyzing and Predicting Processor Vulnerability to Soft Errors Using Statistical Techniques
The shrinking processor feature size, lower threshold voltage and increasing on-chip transistor density make current processors highly vulnerable to soft errors. Architectural Vulnerability Factor (AVF) reflects the probability that a raw soft error eventually causes a visible error in the program output, indicating the processor’s susceptibility to soft errors at architectural level. The awareness of the AVF, both at the early design stage and during program runtime, is greatly useful for designing reliable processors. However, measuring the AVF is extremely costly, resulting in large overheads in hardware, computation, and power. The situation is further exacerbated in a multi-threaded processor environment where resource contention and data sharing exist among different threads. Consequently, predicting the AVF from other easily-measured metrics becomes extraordinarily attractive to computer designers. We propose a series of AVF modeling and prediction works via using advanced statistical techniques. First, we utilize the Boosted Regression Trees (BRT) scheme to dynamically predict the AVF during program execution from a variety of performance metrics. This correlation is generalized to be across different workloads, program phases, and processor configurations on a single-threaded superscalar processor. Second, the AVF prediction is extended to multi-threaded processors where the inter-thread resource contention shows significant and non-uniform impacts on different programs; we propose a two-level predictive mechanism using BRT as building blocks to characterize the contention behavior. Finally, we employ a rule search strategy named Patient Rule Induction Method (PRIM) to explore a large processor design space at the early design stage. We are capable of generating selective rules on important configuration parameters. These rules quantify the design space subregion yielding lowest values of the response, thereby providing useful guidelines for designing reliable processors while achieving high performance
Memory Subsystem Optimization Techniques for Modern High-Performance General-Purpose Processors
abstract: General-purpose processors propel the advances and innovations that are the subject of humanity’s many endeavors. Catering to this demand, chip-multiprocessors (CMPs) and general-purpose graphics processing units (GPGPUs) have seen many high-performance innovations in their architectures. With these advances, the memory subsystem has become the performance- and energy-limiting aspect of CMPs and GPGPUs alike. This dissertation identifies and mitigates the key performance and energy-efficiency bottlenecks in the memory subsystem of general-purpose processors via novel, practical, microarchitecture and system-architecture solutions.
Addressing the important Last Level Cache (LLC) management problem in CMPs, I observe that LLC management decisions made in isolation, as in prior proposals, often lead to sub-optimal system performance. I demonstrate that in order to maximize system performance, it is essential to manage the LLCs while being cognizant of its interaction with the system main memory. I propose ReMAP, which reduces the net memory access cost by evicting cache lines that either have no reuse, or have low memory access cost. ReMAP improves the performance of the CMP system by as much as 13%, and by an average of 6.5%.
Rather than the LLC, the L1 data cache has a pronounced impact on GPGPU performance by acting as the bandwidth filter for the rest of the memory subsystem. Prior work has shown that the severely constrained data cache capacity in GPGPUs leads to sub-optimal performance. In this thesis, I propose two novel techniques that address the GPGPU data cache capacity problem. I propose ID-Cache that performs effective cache bypassing and cache line size selection to improve cache capacity utilization. Next, I propose LATTE-CC that considers the GPU’s latency tolerance feature and adaptively compresses the data stored in the data cache, thereby increasing its effective capacity. ID-Cache and LATTE-CC are shown to achieve 71% and 19.2% speedup, respectively, over a wide variety of GPGPU applications.
Complementing the aforementioned microarchitecture techniques, I identify the need for system architecture innovations to sustain performance scalability of GPG- PUs in the face of slowing Moore’s Law. I propose a novel GPU architecture called the Multi-Chip-Module GPU (MCM-GPU) that integrates multiple GPU modules to form a single logical GPU. With intelligent memory subsystem optimizations tailored for MCM-GPUs, it can achieve within 7% of the performance of a similar but hypothetical monolithic die GPU. Taking a step further, I present an in-depth study of the energy-efficiency characteristics of future MCM-GPUs. I demonstrate that the inherent non-uniform memory access side-effects form the key energy-efficiency bottleneck in the future.
In summary, this thesis offers key insights into the performance and energy-efficiency bottlenecks in CMPs and GPGPUs, which can guide future architects towards developing high-performance and energy-efficient general-purpose processors.Dissertation/ThesisDoctoral Dissertation Computer Science 201
Cross layer reliability estimation for digital systems
Forthcoming manufacturing technologies hold the promise to increase multifuctional computing systems performance and functionality thanks to a remarkable growth of the device integration density. Despite the benefits introduced by this technology improvements, reliability is becoming a key challenge for the semiconductor industry. With transistor size reaching the atomic dimensions, vulnerability to unavoidable fluctuations in the manufacturing process and environmental stress rise dramatically. Failing to meet a reliability requirement may add excessive re-design cost to recover and may have severe consequences on the success of a product. %Worst-case design with large margins to guarantee reliable operation has been employed for long time. However, it is reaching a limit that makes it economically unsustainable due to its performance, area, and power cost.
One of the open challenges for future technologies is building ``dependable'' systems on top of unreliable components, which will degrade and even fail during normal lifetime of the chip. Conventional design techniques are highly inefficient. They expend significant amount of energy to tolerate the device unpredictability by adding safety margins to a circuit's operating voltage, clock frequency or charge stored per bit. Unfortunately, the additional cost introduced to compensate unreliability are rapidly becoming unacceptable in today's environment where power consumption is often the limiting factor for integrated circuit performance, and energy efficiency is a top concern.
Attention should be payed to tailor techniques to improve the reliability of a system on the basis of its requirements, ending up with cost-effective solutions favoring the success of the product on the market. Cross-layer reliability is one of the most promising approaches to achieve this goal. Cross-layer reliability techniques take into account the interactions between the layers composing a complex system (i.e., technology, hardware and software layers) to implement efficient cross-layer fault mitigation mechanisms. Fault tolerance mechanism are carefully implemented at different layers starting from the technology up to the software layer to carefully optimize the system by exploiting the inner capability of each layer to mask lower level faults.
For this purpose, cross-layer reliability design techniques need to be complemented with cross-layer reliability evaluation tools, able to precisely assess the reliability level of a selected design early in the design cycle. Accurate and early reliability estimates would enable the exploration of the system design space and the optimization of multiple constraints such as performance, power consumption, cost and reliability.
This Ph.D. thesis is devoted to the development of new methodologies and tools to evaluate and optimize the reliability of complex digital systems during the early design stages. More specifically, techniques addressing hardware accelerators (i.e., FPGAs and GPUs), microprocessors and full systems are discussed. All developed methodologies are presented in conjunction with their application to real-world use cases belonging to different computational domains
An Intelligent Framework for Energy-Aware Mobile Computing Subject to Stochastic System Dynamics
abstract: User satisfaction is pivotal to the success of mobile applications. At the same time, it is imperative to maximize the energy efficiency of the mobile device to ensure optimal usage of the limited energy source available to mobile devices while maintaining the necessary levels of user satisfaction. However, this is complicated due to user interactions, numerous shared resources, and network conditions that produce substantial uncertainty to the mobile device's performance and power characteristics. In this dissertation, a new approach is presented to characterize and control mobile devices that accurately models these uncertainties. The proposed modeling framework is a completely data-driven approach to predicting power and performance. The approach makes no assumptions on the distributions of the underlying sources of uncertainty and is capable of predicting power and performance with over 93% accuracy.
Using this data-driven prediction framework, a closed-loop solution to the DEM problem is derived to maximize the energy efficiency of the mobile device subject to various thermal, reliability and deadline constraints. The design of the controller imposes minimal operational overhead and is able to tune the performance and power prediction models to changing system conditions. The proposed controller is implemented on a real mobile platform, the Google Pixel smartphone, and demonstrates a 19% improvement in energy efficiency over the standard frequency governor implemented on all Android devices.Dissertation/ThesisDoctoral Dissertation Computer Engineering 201
Recommended from our members
Hybrid Analog-Digital Co-Processing for Scientific Computation
In the past 10 years computer architecture research has moved to more heterogeneity and less adherence to conventional abstractions. Scientists and engineers hold an unshakable belief that computing holds keys to unlocking humanity's Grand Challenges. Acting on that belief they have looked deeper into computer architecture to find specialized support for their applications. Likewise, computer architects have looked deeper into circuits and devices in search of untapped performance and efficiency. The lines between computer architecture layers---applications, algorithms, architectures, microarchitectures, circuits and devices---have blurred. Against this backdrop, a menagerie of computer architectures are on the horizon, ones that forgo basic assumptions about computer hardware, and require new thinking of how such hardware supports problems and algorithms.
This thesis is about revisiting hybrid analog-digital computing in support of diverse modern workloads. Hybrid computing had extensive applications in early computing history, and has been revisited for small-scale applications in embedded systems. But architectural support for using hybrid computing in modern workloads, at scale and with high accuracy solutions, has been lacking.
I demonstrate solving a variety of scientific computing problems, including stochastic ODEs, partial differential equations, linear algebra, and nonlinear systems of equations, as case studies in hybrid computing. I solve these problems on a system of multiple prototype analog accelerator chips built by a team at Columbia University. On that team I made contributions toward programming the chips, building the digital interface, and validating the chips' functionality. The analog accelerator chip is intended for use in conjunction with a conventional digital host computer.
The appeal and motivation for using an analog accelerator is efficiency and performance, but it comes with limitations in accuracy and problem sizes that we have to work around.
The first problem is how to do problems in this unconventional computation model. Scientific computing phrases problems as differential equations and algebraic equations. Differential equations are a continuous view of the world, while algebraic equations are a discrete one. Prior work in analog computing mostly focused on differential equations; algebraic equations played a minor role in prior work in analog computing. The secret to using the analog accelerator to support modern workloads on conventional computers is that these two viewpoints are interchangeable. The algebraic equations that underlie most workloads can be solved as differential equations,
and differential equations are naturally solvable in the analog accelerator chip. A hybrid analog-digital computer architecture can focus on solving linear and nonlinear algebra problems to support many workloads.
The second problem is how to get accurate solutions using hybrid analog-digital computing. The reason that the analog computation model gives less accurate solutions is it gives up representing numbers as digital binary numbers, and instead uses the full range of analog voltage and current to represent real numbers. Prior work has established that encoding data in analog signals gives an energy efficiency advantage as long as the analog data precision is limited. While the analog accelerator alone may be useful for energy-constrained applications where inputs and outputs are imprecise, we are more interested in using analog in conjunction with digital for precise solutions. This thesis gives novel insight that the trick to do so is to solve nonlinear problems where low-precision guesses are useful for conventional digital algorithms.
The third problem is how to solve large problems using hybrid analog-digital computing. The reason the analog computation model can't handle large problems is it gives up step-by-step discrete-time operation, instead allowing variables to evolve smoothly in continuous time. To make that happen the analog accelerator works by chaining hardware for mathematical operations end-to-end. During computation analog data flows through the hardware with no overheads in control logic and memory accesses. The downside is then the needed hardware size grows alongside problem sizes. While scientific computing researchers have for a long time split large problems into smaller subproblems to fit in digital computer constraints, this thesis is a first attempt to consider these divide-and-conquer algorithms as an essential tool in using the analog model of computation.
As we enter the post-Moore’s law era of computing, unconventional architectures will offer specialized models of computation that uniquely support specific problem types. Two prominent examples are deep neural networks and quantum computers. Recent trends in computer science research show these unconventional architectures will soon have broad adoption. In this thesis I show another specialized, unconventional architecture is to use analog accelerators to solve problems in scientific computing. Computer architecture researchers will discover other important models of computation in the future. This thesis is an example of the discovery process, implementation, and evaluation of how an unconventional architecture supports specialized workloads
Thermal Management for Dependable On-Chip Systems
This thesis addresses the dependability issues in on-chip systems from a thermal perspective. This includes an explanation and analysis of models to show the relationship between dependability and tempature. Additionally, multiple novel methods for on-chip thermal management are introduced aiming to optimize thermal properties. Analysis of the methods is done through simulation and through infrared thermal camera measurements
- …