138 research outputs found

    Energy-efficient and high-performance lock speculation hardware for embedded multicore systems

    Full text link
    Embedded systems are becoming increasingly common in everyday life and like their general-purpose counterparts, they have shifted towards shared memory multicore architectures. However, they are much more resource constrained, and as they often run on batteries, energy efficiency becomes critically important. In such systems, achieving high concurrency is a key demand for delivering satisfactory performance at low energy cost. In order to achieve this high concurrency, consistency across the shared memory hierarchy must be accomplished in a cost-effective manner in terms of performance, energy, and implementation complexity. In this article, we propose Embedded-Spec, a hardware solution for supporting transparent lock speculation, without the requirement for special supporting instructions. Using this approach, we evaluate the energy consumption and performance of a suite of benchmarks, exploring a range of contention management and retry policies. We conclude that for resource-constrained platforms, lock speculation can provide real benefits in terms of improved concurrency and energy efficiency, as long as the underlying hardware support is carefully configured.This work is supported in part by NSF under Grants CCF-0903384, CCF-0903295, CNS-1319495, and CNS-1319095 as well the Semiconductor Research Corporation under grant number 1983.001. (CCF-0903384 - NSF; CCF-0903295 - NSF; CNS-1319495 - NSF; CNS-1319095 - NSF; 1983.001 - Semiconductor Research Corporation

    Measurement, Modeling, and Characterization for Power-Aware Computing

    Get PDF
    Society’s increasing dependence on information technology has resulted in the deployment of vast compute resources. The energy costs of operating these resources coupled with environmental concerns have made power-aware computingone of the primary challenges for the IT sector. Making energy-efficient computing a rule rather than an exception requires that researchers and system designers use the right set of techniques and tools. These involve measuring,modeling, and characterizing the energy consumption of computers at varying degrees of granularity.In this thesis, we present techniques to measure power consumption of computer systems at various levels. We compare them for accuracy and sensitivityand discuss their effectiveness. We test Intel’s hardware power model for estimation accuracy and show that it is fairly accurate for estimating energy consumption when sampled at the temporal granularity of more than tens ofmilliseconds.We present a methodology to estimate per-core processor power consumption using performance counter and temperature-based power modeling and validate it across multiple platforms. We show our model exhibits negligible computationoverhead, and the median estimation errors ranges from 0.3% to 10.1% for applications from SPEC2006, SPEC-OMP and NAS benchmarks. We test the usefulness of the model in a meta-scheduler to enforce power constraint on a system.Finally, we perform a detailed performance and energy characterization of Intel’s Restricted Transactional Memory (RTM). We use TinySTM software transactional memory (STM) system to benchmark RTM’s performance against competing STM alternatives. We use microbenchmarks and STAMP benchmarksuite to compare RTM versus STM performance and energy behavior. We quantify the RTM hardware limitations that affect its success rate. We show that RTM performs better than TinySTM when working-set fits inside the cache and that RTM is better at handling high contention workloads

    Analysis, classification and comparison of scheduling techniques for software transactional memories

    Get PDF
    Transactional Memory (TM) is a practical programming paradigm for developing concurrent applications. Performance is a critical factor for TM implementations, and various studies demonstrated that specialised transaction/thread scheduling support is essential for implementing performance-effective TM systems. After one decade of research, this article reviews the wide variety of scheduling techniques proposed for Software Transactional Memories. Based on peculiarities and differences of the adopted scheduling strategies, we propose a classification of the existing techniques, and we discuss the specific characteristics of each technique. Also, we analyse the results of previous evaluation and comparison studies, and we present the results of a new experimental study encompassing techniques based on different scheduling strategies. Finally, we identify potential strengths and weaknesses of the different techniques, as well as the issues that require to be further investigated

    Measurement, Modeling, and Characterization for Energy-Efficient Computing

    Get PDF
    The ever-increasing ecological footprint of Information Technology (IT) sector coupled with adverse effects of high power consumption on electronic circuits has increased the significance of energy-efficient computing in the last decade. Making energy-efficient computing a norm rather than an exception requires that system designers and programmers understand the energy implications of their design and implementation choices. This necessitates a detailed view of system’s energy expenditure and/or power consumption. We explore this aspect of energy-efficient computing in this thesis through power measurement, power modeling, and energy characterization.First, we present a quantitative comparison between power measurement data collected for computer systems using four techniques: a power meter at wall outlet, currenttransducers at ATX power rails, CPU voltage regulator’s current monitor, and Intel’s proprietary RAPL (Running Average Power Limit) interface. We compare them for accuracy, sensitivity and accessibility.Second, we present two different methodologies to model processor power consumption. The first model estimates power consumption at the granularity of individualcores using per-core performance events and temperature sensors. We validate the methodology on six different platforms and show that our model estimates power consumption with high accuracy across all platforms consistently. To understand the energy expenditure trends across different frequencies and different degrees of parallelism, we need to model power at a much finer granularity. The second power model addresses this issue by estimating static and dynamic power consumption for individual cores and the uncore. We validate this model on Intel’s Haswell platform for single-threaded and multi-threaded benchmarks. We use this power model to characterize energy efficiency of frequency scaling on Haswell microarchitecture and use the insights to implementa low overhead DVFS scheduler. We also characterize the energy efficiency of thread scaling using the power model and demonstrate how different communication parametersand microarchitectural traits affect application’s energy when it scales.Finally, we perform detailed performance and energy characterization of Intel’s RestrictedTransactional Memory (RTM).We use TinySTM software transactional memory(STM) system to benchmark RTM’s performance against competing STM alternatives.We use microbenchmarks and STAMP benchmark suite to compare RTM an STM performanceand energy behavior. We quantify the RTM hardware limitations and identifyconditions required for RTM to outperform STM

    Improving Performance of Transactional Applications through Adaptive Transactional Memory

    Get PDF
    With the rise of chip multiprocessors (CMPs), it is necessary to use parallel programming to exploit computational power of CMPs. Traditionally, lock-based mechanisms have been used to synchronize shared variables in parallel programs. However, with the complexity associated with locks, writing a correct parallel program is a huge burden for programmers. As an alternative, Transactional Memory (TM) is gaining momentum as a parallel programming model for multi--‐core processors. TM provides programmers with an atomic construct (transaction), which can be used to guarantee atomicity of accesses to shared variables, as the synchronization is handled through the underlying system. Transactional memory comes in two variants: Software transaction memory (STM) and Hardware transaction memory (HTM). Both STM and HTM systems have advantages and disadvantages that either enhance or penalize performance in transactional applications. In this thesis, the focus is on implementing an adaptive system that exploits both STM and HTM at transaction granularity. The goal is to achieve performance gain by incorporating the benefits of both TM systems. A synchronization technique is developed to seamlessly switch between HTM and STM based on the characteristics of a transaction. We exploit decision tree to predict the optimum system for each transaction in a given application. The decision tree is a form of supervised machine learning to classify transactions based on parameters such as transaction size, transaction write ratio, etc. From the evaluations using STAMP, NAS, and DiscoPoP benchmark suites, the proposed adaptive system is able to improve speed of transactional applications by 20.82% on average

    Concurrent Copying Garbage Collection with Hardware Transactional Memory

    Get PDF
    Many applications, such as video-based or transaction-based ones, are latency-critical. Any additional latency may greatly degrade the user experience, inflicting significant financial loss on the vendor. Recently, an increasing number of these applications are written in managed languages, such as C#, Java, JavaScript, and PHP, for productivity and reliability. Garbage collection (GC) provides automatic memory management to managed languages. However, GC can also induce pauses in the application, greatly affecting the user experience. This thesis explores the challenges of minimizing GC pauses. Concurrent GC reduces pauses by working concurrently with the application (the mutator). Copying GC improves the mutator locality and reduces the heap fragmentation. Concurrent copying GC achieves both, but requires heavyweight synchronization to ensure that the concurrently executing mutator has a consistent view of the heap while the collector changes it. Existing implementations of concurrent copying GC use read barriers or page protections to prevent the mutator from using stale references. Unfortunately, these synchronization mechanisms introduce high overhead to the mutator. My thesis is that, by using hardware transactional memory (HTM), mutators can execute transactionally during concurrent copying, achieving a consistent view of the heap, but with lower overhead than read barriers or page protection. The contributions of this thesis are twofold. (1) I implement and evaluate a novel algorithm of using HTM to reduce the mutator overhead of concurrent copying GC. (2) I conduct a detailed analysis of HTM capacity, filling a significant gap in the literature, and informing the design of our HTM-based algorithm. I then use the insights on HTM capacity to implement several optimizations to improve the algorithm. Using the Intel Transactional Synchronization Extension (TSX) as a case study, I measure the transaction capacity on this popular HTM implementation, and cross-validate the results with the literature and fill a gap in the literature, resolving ostensibly contradictory results. I have also explored different factors that may affect the effective capacity of transactions, which have not yet been reported in the literature (to the best of my knowledge). I implement the algorithm in MMTk, a framework for the design and implementation of GC. The implementation is evaluated on Intel TSX using several test programs. The results suggest that performing concurrent copying GC using HTM is viable. This work deepens the understanding of HTM, its strengths and weaknesses, in the research community. Strategies using this work to fully exploit the capabilities of HTM can be generalized and applied to other applications of HTM. Finally, this work enables the design and implementation of concurrent copying GC with lower mutator overhead with similar hardware support

    Hardware-Assisted Dependable Systems

    Get PDF
    Unpredictable hardware faults and software bugs lead to application crashes, incorrect computations, unavailability of internet services, data losses, malfunctioning components, and consequently financial losses or even death of people. In particular, faults in microprocessors (CPUs) and memory corruption bugs are among the major unresolved issues of today. CPU faults may result in benign crashes and, more problematically, in silent data corruptions that can lead to catastrophic consequences, silently propagating from component to component and finally shutting down the whole system. Similarly, memory corruption bugs (memory-safety vulnerabilities) may result in a benign application crash but may also be exploited by a malicious hacker to gain control over the system or leak confidential data. Both these classes of errors are notoriously hard to detect and tolerate. Usual mitigation strategy is to apply ad-hoc local patches: checksums to protect specific computations against hardware faults and bug fixes to protect programs against known vulnerabilities. This strategy is unsatisfactory since it is prone to errors, requires significant manual effort, and protects only against anticipated faults. On the other extreme, Byzantine Fault Tolerance solutions defend against all kinds of hardware and software errors, but are inadequately expensive in terms of resources and performance overhead. In this thesis, we examine and propose five techniques to protect against hardware CPU faults and software memory-corruption bugs. All these techniques are hardware-assisted: they use recent advancements in CPU designs and modern CPU extensions. Three of these techniques target hardware CPU faults and rely on specific CPU features: ∆-encoding efficiently utilizes instruction-level parallelism of modern CPUs, Elzar re-purposes Intel AVX extensions, and HAFT builds on Intel TSX instructions. The rest two target software bugs: SGXBounds detects vulnerabilities inside Intel SGX enclaves, and “MPX Explained” analyzes the recent Intel MPX extension to protect against buffer overflow bugs. Our techniques achieve three goals: transparency, practicality, and efficiency. All our systems are implemented as compiler passes which transparently harden unmodified applications against hardware faults and software bugs. They are practical since they rely on commodity CPUs and require no specialized hardware or operating system support. Finally, they are efficient because they use hardware assistance in the form of CPU extensions to lower performance overhead

    Advances in Engineering Software for Multicore Systems

    Get PDF
    The vast amounts of data to be processed by today’s applications demand higher computational power. To meet application requirements and achieve reasonable application performance, it becomes increasingly profitable, or even necessary, to exploit any available hardware parallelism. For both new and legacy applications, successful parallelization is often subject to high cost and price. This chapter proposes a set of methods that employ an optimistic semi-automatic approach, which enables programmers to exploit parallelism on modern hardware architectures. It provides a set of methods, including an LLVM-based tool, to help programmers identify the most promising parallelization targets and understand the key types of parallelism. The approach reduces the manual effort needed for parallelization. A contribution of this work is an efficient profiling method to determine the control and data dependences for performing parallelism discovery or other types of code analysis. Another contribution is a method for detecting code sections where parallel design patterns might be applicable and suggesting relevant code transformations. Our approach efficiently reports detailed runtime data dependences. It accurately identifies opportunities for parallelism and the appropriate type of parallelism to use as task-based or loop-based

    Jitsu: Just-in-time summoning of unikernel

    Get PDF
    Network latency is a problem for all cloud services. It can be mitigated by moving computation out of remote datacenters by rapidly instantiating local services near the user. This requires an embedded cloud platform on which to deploy multiple applications securely and quickly. We present Jitsu, a new Xen toolstack that satisfies the demands of secure multi-tenant isolation on resource-constrained embedded ARM devices. It does this by using unikernels: lightweight, compact, single address space, memory-safe virtual machines (VMs) written in a high-level language. Using fast shared memory channels, Jitsu provides a directory service that launches unikernels in response to network traffic and masks boot latency. Our evaluation shows Jitsu to be a power-efficient and responsive platform for hosting cloud services in the edge network while preserving the strong isolation guarantees of a type-1 hypervisor.The research leading to these results received funding from the European Union’s Seventh Framework Programme FP7/2007–2013 under the Trilogy 2 project (grant agreement no. 317756), and the User Centric Networking project, (grant agreement no. 611001), and the Defense Advanced Research Projects Agency (DARPA) and the Air Force Research Laboratory (AFRL), under contract FA8750-11-C-0249.This is the author accepted manuscript. The final version is available from USENIX via https://www.usenix.org/conference/nsdi15/technical-sessions/presentation/madhavapedd

    Stretching the capacity of Hardware Transactional Memory in IBM POWER architectures

    Full text link
    The hardware transactional memory (HTM) implementations in commercially available processors are significantly hindered by their tight capacity constraints. In practice, this renders current HTMs unsuitable to many real-world workloads of in-memory databases. This paper proposes SI-HTM, which stretches the capacity bounds of the underlying HTM, thus opening HTM to a much broader class of applications. SI-HTM leverages the HTM implementation of the IBM POWER architecture with a software layer to offer a single-version implementation of Snapshot Isolation. When compared to HTM- and software-based concurrency control alternatives, SI-HTM exhibits improved scalability, achieving speedups of up to 300% relatively to HTM on in-memory database benchmarks
    corecore