78 research outputs found

    Analyzing and Predicting Processor Vulnerability to Soft Errors Using Statistical Techniques

    Get PDF
    The shrinking processor feature size, lower threshold voltage and increasing on-chip transistor density make current processors highly vulnerable to soft errors. Architectural Vulnerability Factor (AVF) reflects the probability that a raw soft error eventually causes a visible error in the program output, indicating the processor’s susceptibility to soft errors at architectural level. The awareness of the AVF, both at the early design stage and during program runtime, is greatly useful for designing reliable processors. However, measuring the AVF is extremely costly, resulting in large overheads in hardware, computation, and power. The situation is further exacerbated in a multi-threaded processor environment where resource contention and data sharing exist among different threads. Consequently, predicting the AVF from other easily-measured metrics becomes extraordinarily attractive to computer designers. We propose a series of AVF modeling and prediction works via using advanced statistical techniques. First, we utilize the Boosted Regression Trees (BRT) scheme to dynamically predict the AVF during program execution from a variety of performance metrics. This correlation is generalized to be across different workloads, program phases, and processor configurations on a single-threaded superscalar processor. Second, the AVF prediction is extended to multi-threaded processors where the inter-thread resource contention shows significant and non-uniform impacts on different programs; we propose a two-level predictive mechanism using BRT as building blocks to characterize the contention behavior. Finally, we employ a rule search strategy named Patient Rule Induction Method (PRIM) to explore a large processor design space at the early design stage. We are capable of generating selective rules on important configuration parameters. These rules quantify the design space subregion yielding lowest values of the response, thereby providing useful guidelines for designing reliable processors while achieving high performance

    Adaptive thread and memory access schelduling in chip multiprocessors

    Get PDF
    Ankara : The Department of Computer Engineering and the Graduate School of Engineering and Science of Bilkent University, 2013.Thesis (Master's) -- Bilkent University, 2013.Includes bibliographical references leaves 76-81.The full potential of chip multiprocessors remains unexploited due to architecture oblivious thread schedulers used in operating systems, and thread-oblivious memory access schedulers used in off-chip main memory controllers. For the thread scheduling, we introduce an adaptive cache-hierarchy-aware scheduler that tries to schedule threads in a way that inter-thread contention is minimized. A novel multi-metric scoring scheme is used that specifies the L1 cache access characteristics of a thread. The scheduling decisions are made based on multi-metric scores of threads. For the memory access scheduling, we introduce an adaptive compute-phase prediction and thread prioritization scheme that efficiently categorize threads based on execution characteristics and provides fine-grained prioritization that allows to differentiate threads and prioritize their memory access requests accordingly.AktĂĽrk, Ä°smailM.S

    Thread Scheduling Mechanisms for Multiple-Context Parallel Processors

    Get PDF
    Scheduling tasks to efficiently use the available processor resources is crucial to minimizing the runtime of applications on shared-memory parallel processors. One factor that contributes to poor processor utilization is the idle time caused by long latency operations, such as remote memory references or processor synchronization operations. One way of tolerating this latency is to use a processor with multiple hardware contexts that can rapidly switch to executing another thread of computation whenever a long latency operation occurs, thus increasing processor utilization by overlapping computation with communication. Although multiple contexts are effective for tolerating latency, this effectiveness can be limited by memory and network bandwidth, by cache interference effects among the multiple contexts, and by critical tasks sharing processor resources with less critical tasks. This thesis presents techniques that increase the effectiveness of multiple contexts by intelligently scheduling threads to make more efficient use of processor pipeline, bandwidth, and cache resources. This thesis proposes thread prioritization as a fundamental mechanism for directing the thread schedule on a multiple-context processor. A priority is assigned to each thread either statically or dynamically and is used by the thread scheduler to decide which threads to load in the contexts, and to decide which context to switch to on a context switch. We develop a multiple-context model that integrates both cache and network effects, and shows how thread prioritization can both maintain high processor utilization, and limit increases in critical path runtime caused by multithreading. The model also shows that in order to be effective in bandwidth limited applications, thread prioritization must be extended to prioritize memory requests. We show how simple hardware can prioritize the running of threads in the multiple contexts, and the issuing of requests to both the local memory and the network. Simulation experiments show how thread prioritization is used in a variety of applications. Thread prioritization can improve the performance of synchronization primitives by minimizing the number of processor cycles wasted in spinning and devoting more cycles to critical threads. Thread prioritization can be used in combination with other techniques to improve cache performance and minimize cache interference between different working sets in the cache. For applications that are critical path limited, thread prioritization can improve performance by allowing processor resources to be devoted preferentially to critical threads. These experimental results show that thread prioritization is a mechanism that can be used to implement a wide range of scheduling policies

    Building Evolvable Networks:Flexible and Predictable Packet Processing

    Get PDF
    Software packet-processing platforms-€-”network devices running on general-purpose servers--€”are emerging as a compelling alternative to the traditional high-end switches and routers running on specialized hardware. Their promise is to enable the fast deployment of new, sophisticated kinds of packet processing without the need to buy and deploy expensive new equipment. This would allow us to transform the current Internet into a programmable network, a network that can evolve over time and provide a better service for the users. In order to become a credible alternative to the hardware platforms, software packet processing needs to offer not just flexibility, but also a competitive level of performance and, equally important, predictability. Recent works have demonstrated high performance for software platforms, but this was shown only for simple, conventional workloads like packet forwarding and IP routing. And this was achieved for systems where all the processing cores handle the same type/amount of traffic and run identical code, a critical simplifying assumption. One challenge is to achieve high and predictable performance in the context of software platforms running a diverse set of applications and serving multiple clients with different needs. Another challenge is to offer such flexibility without the risk of disrupting the network by introducing bugs, unpredictable performance, or security vulnerabilities. In this thesis we focus on how to design software packet-processing systems so as to achieve both high performance and predictability, while maintaining the ease of programmability. First, we identify the main factors that affect packet-processing performance on a modern multicore server--€”cache misses, cache contention, load-balancing across processing cores--€”and show how to parallelize the functionality across the available cores in order to maximize the throughput. Second, we analyze the way contention for shared resources--€”caches, memory controllers, buses--€”affects performance in a system that runs a diverse set of packet-processing applications. The key observation is that contention for the last-level cache represents the dominant contention factor and the performance drop suffered by a given application is mostly determined by the number of cache references/second performed by the competing applications. We leverage this observation and we show that such a system is able to provide predictable performance in the face of resource contention. Third, we present the result of working iteratively on two tasks: designing a domain-specific verification tool for packet-processing software, while trying to identify a minimal set of restrictions that packet-processing software must satisfy in order to be verification-friendly. The main insight is that packet-processing software is a good fit for verification because it typically consists of distinct pieces of code that share limited mutable state and we can leverage this domain specificity to sidestep fundamental scalability challenges in software verification. We demonstrate that it is feasible to automatically prove useful properties of software dataplanes to ensure a smooth network operation. Based on our results, we conclude that we can design software network equipment that combines both flexibility and predictability

    Building Evolvable Networks:Flexible and Predictable Packet Processing

    Get PDF
    Software packet-processing platforms--network devices running on general-purpose servers--are emerging as a compelling alternative to the traditional high-end switches and routers running on specialized hardware. Their promise is to enable the fast deployment of new, sophisticated kinds of packet processing without the need to buy and deploy expensive new equipment. This would allow us to transform the current Internet into a programmable network, a network that can evolve over time and provide a better service for the users. In order to become a credible alternative to the hardware platforms, software packet processing needs to offer not just flexibility, but also a competitive level of performance and, equally important, predictability. Recent works have demonstrated high performance for software platforms, but this was shown only for simple, conventional workloads like packet forwarding and IP routing. And this was achieved for systems where all the processing cores handle the same type/amount of traffic and run identical code, a critical simplifying assumption. One challenge is to achieve high and predictable performance in the context of software platforms running a diverse set of applications and serving multiple clients with different needs. Another challenge is to offer such flexibility without the risk of disrupting the network by introducing bugs, unpredictable performance, or security vulnerabilities. In this thesis we focus on how to design software packet-processing systems so as to achieve both high performance and predictability, while maintaining the ease of programmability. First, we identify the main factors that affect packet-processing performance on a modern multicore server--cache misses, cache contention, load-balancing across processing cores--and show how to parallelize the functionality across the available cores in order to maximize the throughput. Second, we analyze the way contention for shared resources--caches, memory controllers, buses--affects performance in a system that runs a diverse set of packet-processing applications. The key observation is that contention for the last-level cache represents the dominant contention factor and the performance drop suffered by a given application is mostly determined by the number of cache references/second performed by the competing applications. We leverage this observation and we show that such a system is able to provide predictable performance in the face of resource contention. Third, we present the result of working iteratively on two tasks: designing a domain-specific verification tool for packet-processing software, while trying to identify a minimal set of restrictions that packet-processing software must satisfy in order to be verification-friendly. The main insight is that packet-processing software is a good fit for verification because it typically consists of distinct pieces of code that share limited mutable state and we can leverage this domain specificity to sidestep fundamental scalability challenges in software verification. We demonstrate that it is feasible to automatically prove useful properties of software dataplanes to ensure a smooth network operation. Based on our results, we conclude that we can design software network equipment that combines both flexibility and predictability

    Harnessing Checker Hierarchy for Reliable Microprocessors

    Get PDF
    Traditional fault-tolerant multi-threading architectures provide good fault tolerance by re-executing all the computations. However, such a full re-execution significantly increases the demand on the processor resources, resulting in severe performance degradation. To address this problem, this dissertation presents Active Verification Management (AVM) approaches that utilize a checker hierarchy to increase its performance with a minimal effect on the overall reliability. Based on a simplified queueing model, AVM employs a filter checker which prioritizes the verification candidates to selectively do verification. This dissertation proposes three filter checkers - based on (1) result usage, (2) result bitwidth, and (3) result anomaly - that exploit correctness-criticality metrics and anomaly speculation. Binary Correctness Criticality (BCC) and Likelihood of Correctness Criticality (LoCC) are metrics that quantify whether an instruction is important for reliability or how likely an instruction is correctness-critical, respectively. Based on the BCC, a result-usage-based filter checker mitigates the verification workload by bypassing instructions that are unnecessary for correct execution. Computing the LoCC is accomplished by exploiting information redundancy of compressing computationally useful data bits. Numerical significance hints let the result-bitwidth-based filter checker guide a verification priority effectively before the re-execution process starts. A result-anomaly-based filter checker exploits a value similarity property, which is defined by a frequent occurrence of partially identical values. Based on the biased distribution of similarity distance measure, this dissertation further investigates another application to exploit similar values for soft error tolerance with anomaly speculation. Extensive measurements show that the majority of instructions produce values that are different from the previous result value only in a few bits. Experimental results show that the proposed schemes accelerate the processor to be 180% faster than traditional fully-fault-tolerant processor, with a minimal impact on the overall soft error rate. With no AVM, congestion at the checker badly affects performance, to the tune of 57%, when compared to that of a non-fault-tolerant processor. These results explain that the proposed AVM has the potential to solve the verification congestion problem when perfect fault coverage is not needed

    Adaptive Microarchitectural Optimizations to Improve Performance and Security of Multi-Core Architectures

    Get PDF
    With the current technological barriers, microarchitectural optimizations are increasingly important to ensure performance scalability of computing systems. The shift to multi-core architectures increases the demands on the memory system, and amplifies the role of microarchitectural optimizations in performance improvement. In a multi-core system, microarchitectural resources are usually shared, such as the cache, to maximize utilization but sharing can also lead to contention and lower performance. This can be mitigated through partitioning of shared caches.However, microarchitectural optimizations which were assumed to be fundamentally secure for a long time, can be used in side-channel attacks to exploit secrets, as cryptographic keys. Timing-based side-channels exploit predictable timing variations due to the interaction with microarchitectural optimizations during program execution. Going forward, there is a strong need to be able to leverage microarchitectural optimizations for performance without compromising security. This thesis contributes with three adaptive microarchitectural resource management optimizations to improve security and/or\ua0performance\ua0of multi-core architectures\ua0and a systematization-of-knowledge of timing-based side-channel attacks.\ua0We observe that to achieve high-performance cache partitioning in a multi-core system\ua0three requirements need to be met: i) fine-granularity of partitions, ii) locality-aware placement and iii) frequent changes. These requirements lead to\ua0high overheads for current centralized partitioning solutions, especially as the number of cores in the\ua0system increases. To address this problem, we present an adaptive and scalable cache partitioning solution (DELTA) using a distributed and asynchronous allocation algorithm. The\ua0allocations occur through core-to-core challenges, where applications with larger performance benefit will gain cache capacity. The\ua0solution is implementable in hardware, due to low computational complexity, and can scale to large core counts.According to our analysis, better performance can be achieved by coordination of multiple optimizations for different resources, e.g., off-chip bandwidth and cache, but is challenging due to the increased number of possible allocations which need to be evaluated.\ua0Based on these observations, we present a solution (CBP) for coordinated management of the optimizations: cache partitioning, bandwidth partitioning and prefetching.\ua0Efficient allocations, considering the inter-resource interactions and trade-offs, are achieved using local resource managers to limit the solution space.The continuously growing number of\ua0side-channel attacks leveraging\ua0microarchitectural optimizations prompts us to review attacks and defenses to understand the vulnerabilities of different microarchitectural optimizations. We identify the four root causes of timing-based side-channel attacks: determinism, sharing, access violation\ua0and information flow.\ua0Our key insight is that eliminating any of the exploited root causes, in any of the attack steps, is enough to provide protection.\ua0Based on our framework, we present a systematization of the attacks and defenses on a wide range of microarchitectural optimizations, which highlights their key similarities.\ua0Shared caches are an attractive attack surface for side-channel attacks, while defenses need to be efficient since the cache is crucial for performance.\ua0To address this issue, we present an adaptive and scalable cache partitioning solution (SCALE) for protection against cache side-channel attacks. The solution leverages randomness,\ua0and provides quantifiable and information theoretic security guarantees using differential privacy. The solution closes the performance gap to a state-of-the-art non-secure allocation policy for a mix of secure and non-secure applications

    A Modern Primer on Processing in Memory

    Full text link
    Modern computing systems are overwhelmingly designed to move data to computation. This design choice goes directly against at least three key trends in computing that cause performance, scalability and energy bottlenecks: (1) data access is a key bottleneck as many important applications are increasingly data-intensive, and memory bandwidth and energy do not scale well, (2) energy consumption is a key limiter in almost all computing platforms, especially server and mobile systems, (3) data movement, especially off-chip to on-chip, is very expensive in terms of bandwidth, energy and latency, much more so than computation. These trends are especially severely-felt in the data-intensive server and energy-constrained mobile systems of today. At the same time, conventional memory technology is facing many technology scaling challenges in terms of reliability, energy, and performance. As a result, memory system architects are open to organizing memory in different ways and making it more intelligent, at the expense of higher cost. The emergence of 3D-stacked memory plus logic, the adoption of error correcting codes inside the latest DRAM chips, proliferation of different main memory standards and chips, specialized for different purposes (e.g., graphics, low-power, high bandwidth, low latency), and the necessity of designing new solutions to serious reliability and security issues, such as the RowHammer phenomenon, are an evidence of this trend. This chapter discusses recent research that aims to practically enable computation close to data, an approach we call processing-in-memory (PIM). PIM places computation mechanisms in or near where the data is stored (i.e., inside the memory chips, in the logic layer of 3D-stacked memory, or in the memory controllers), so that data movement between the computation units and memory is reduced or eliminated.Comment: arXiv admin note: substantial text overlap with arXiv:1903.0398
    • …
    corecore