68 research outputs found
Automatic skeleton-driven performance optimizations for transactional memory
The recent shift toward multi -core chips has pushed the burden of extracting performance to the programmer. In fact, programmers now have to be able to uncover more
coarse -grain parallelism with every new generation of processors, or the performance
of their applications will remain roughly the same or even degrade. Unfortunately,
parallel programming is still hard and error prone. This has driven the development of
many new parallel programming models that aim to make this process efficient.This thesis first combines the skeleton -based and transactional memory programming models in a new framework, called OpenSkel, in order to improve performance
and programmability of parallel applications. This framework provides a single skeleton that allows the implementation of transactional worklist applications. Skeleton or
pattern-based programming allows parallel programs to be expressed as specialized instances of generic communication and computation patterns. This leaves the programmer with only the implementation of the particular operations required to solve the
problem at hand. Thus, this programming approach simplifies parallel programming
by eliminating some of the major challenges of parallel programming, namely thread
communication, scheduling and orchestration. However, the application programmer
has still to correctly synchronize threads on data races. This commonly requires the
use of locks to guarantee atomic access to shared data. In particular, lock programming
is vulnerable to deadlocks and also limits coarse grain parallelism by blocking threads
that could be potentially executed in parallel.Transactional Memory (TM) thus emerges as an attractive alternative model to simplify parallel programming by removing this burden of handling data races explicitly.
This model allows programmers to write parallel code as transactions, which are then
guaranteed by the runtime system to execute atomically and in isolation regardless of
eventual data races. TM programming thus frees the application from deadlocks and
enables the exploitation of coarse grain parallelism when transactions do not conflict
very often. Nevertheless, thread management and orchestration are left for the application programmer. Fortunately, this can be naturally handled by a skeleton framework.
This fact makes the combination of skeleton -based and transactional programming a
natural step to improve programmability since these models complement each other.
In fact, this combination releases the application programmer from dealing with thread
management and data races, and also inherits the performance improvements of both
models. In addition to it, a skeleton framework is also amenable to skeleton - driven
iii
performance optimizations that exploits the application pattern and system information.This thesis thus also presents a set of pattern- oriented optimizations that are automatically selected and applied in a significant subset of transactional memory applications that shares a common pattern called worklist. These optimizations exploit the
knowledge about the worklist pattern and the TM nature of the applications to avoid
transaction conflicts, to prefetch data, to reduce contention etc. Using a novel autotuning mechanism, OpenSkel dynamically selects the most suitable set of these patternoriented performance optimizations for each application and adjusts them accordingly.
Experimental results on a subset of five applications from the STAMP benchmark suite
show that the proposed autotuning mechanism can achieve performance improvements
within 2 %, on average, of a static oracle for a 16 -core UMA (Uniform Memory Access) platform and surpasses it by 7% on average for a 32 -core NUMA (Non -Uniform
Memory Access) platform.Finally, this thesis also investigates skeleton -driven system- oriented performance
optimizations such as thread mapping and memory page allocation. In order to do
it, the OpenSkel system and also the autotuning mechanism are extended to accommodate these optimizations. The conducted experimental results on a subset of five
applications from the STAMP benchmark show that the OpenSkel framework with the
extended autotuning mechanism driving both pattern and system- oriented optimizations can achieve performance improvements of up to 88 %, with an average of 46 %,
over a baseline version for a 16 -core UMA platform and up to 162 %, with an average
of 91 %, for a 32 -core NUMA platform
Automatic skeleton-driven performance optimizations for transactional memory
The recent shift toward multi-core chips has pushed the burden of extracting performance to the programmer. In fact, programmers now have to be able to uncover more
coarse-grain parallelism with every new generation of processors, or the performance
of their applications will remain roughly the same or even degrade. Unfortunately,
parallel programming is still hard and error prone. This has driven the development of
many new parallel programming models that aim to make this process efficient.
This thesis first combines the skeleton-based and transactional memory programming models in a new framework, called OpenSkel, in order to improve performance
and programmability of parallel applications. This framework provides a single skeleton that allows the implementation of transactional worklist applications. Skeleton or
pattern-based programming allows parallel programs to be expressed as specialized instances of generic communication and computation patterns. This leaves the programmer with only the implementation of the particular operations required to solve the
problem at hand. Thus, this programming approach simplifies parallel programming
by eliminating some of the major challenges of parallel programming, namely thread
communication, scheduling and orchestration. However, the application programmer
has still to correctly synchronize threads on data races. This commonly requires the
use of locks to guarantee atomic access to shared data. In particular, lock programming
is vulnerable to deadlocks and also limits coarse grain parallelism by blocking threads
that could be potentially executed in parallel.
Transactional Memory (TM) thus emerges as an attractive alternative model to simplify parallel programming by removing this burden of handling data races explicitly.
This model allows programmers to write parallel code as transactions, which are then
guaranteed by the runtime system to execute atomically and in isolation regardless of
eventual data races. TM programming thus frees the application from deadlocks and
enables the exploitation of coarse grain parallelism when transactions do not conflict
very often. Nevertheless, thread management and orchestration are left for the application programmer. Fortunately, this can be naturally handled by a skeleton framework.
This fact makes the combination of skeleton-based and transactional programming a
natural step to improve programmability since these models complement each other.
In fact, this combination releases the application programmer from dealing with thread
management and data races, and also inherits the performance improvements of both
models. In addition to it, a skeleton framework is also amenable to skeleton-driven
performance optimizations that exploits the application pattern and system information.
This thesis thus also presents a set of pattern-oriented optimizations that are automatically selected and applied in a significant subset of transactional memory applications that shares a common pattern called worklist. These optimizations exploit the
knowledge about the worklist pattern and the TM nature of the applications to avoid
transaction conflicts, to prefetch data, to reduce contention etc. Using a novel autotuning mechanism, OpenSkel dynamically selects the most suitable set of these pattern-oriented performance optimizations for each application and adjusts them accordingly.
Experimental results on a subset of five applications from the STAMP benchmark suite
show that the proposed autotuning mechanism can achieve performance improvements
within 2%, on average, of a static oracle for a 16-core UMA (Uniform Memory Access) platform and surpasses it by 7% on average for a 32-core NUMA (Non-Uniform
Memory Access) platform.
Finally, this thesis also investigates skeleton-driven system-oriented performance
optimizations such as thread mapping and memory page allocation. In order to do
it, the OpenSkel system and also the autotuning mechanism are extended to accommodate these optimizations. The conducted experimental results on a subset of five
applications from the STAMP benchmark show that the OpenSkel framework with the
extended autotuning mechanism driving both pattern and system-oriented optimizations can achieve performance improvements of up to 88%, with an average of 46%,
over a baseline version for a 16-core UMA platform and up to 162%, with an average
of 91%, for a 32-core NUMA platform
Transactions Chasing Scalability and Instruction Locality on Multicores
For several decades, online transaction processing (OLTP) has been one of the main server applications that drives innovations in the data management ecosystem, and in turn the database and computer architecture communities. Recent hardware trends oblige software to overcome two major challenges against systems scalability on modern multicore processors: (1) exploiting the abundant thread-level parallelism across cores and (2) taking advantage of the implicit parallelism within a core. The traditional design of the OLTP systems, however, faces inherent scalability problems due to its tightly coupled components. In addition, OLTP cannot exploit the full capability of the micro-architectural resources of modern processors because of the conventional scheduling decisions that ignore the cache locality for transactions. As a result, today’s commonly used server hardware remains largely underutilized leading to a huge waste of hardware resources and energy. .... In this thesis, we first identify the unbounded critical sections of traditional OLTP systems as the main enemy of thread-level parallelism. We design an alternative shared-everything system based on physiological partitioning (PLP) to eliminate the unbounded critical sections while providing an infrastructure for low-cost dynamic repartitioning and without introducing high-cost distributed transactions. Then, we demonstrate that L1 instruction cache stalls are the dominant factor leading to underutilization in the commodity servers. However, we also observe that independently of their high-level functionality, transactions running in parallel on a multicore system share significant amount of common instructions. By adaptively spreading the execution of a transaction over multiple cores through thread migration or multiplexing transactions on one core, we enable both an ample L1 instruction cache capacity for a transaction and reuse of common instructions across concurrent transactions. .... As the hardware demands more from the software to exploit the complexity and parallelism it offers in the multicore era, this work would change the way we traditionally schedule transactions. Instead of viewing a transaction as a single big task, we split it into smaller parts that can exploit data and instruction locality through careful dynamic scheduling decisions. The methods this thesis presents are not only specific to OLTP systems, but they can also benefit other types of applications that have concurrent requests executing a series of actions from a predefined set and face similar scalability problems on emerging hardware
Cache Attacks and Defenses
In the digital age, as our daily lives depend heavily on interconnected computing devices, information security has become a crucial concern. The continuous exchange of data between devices over the Internet exposes our information vulnerable to potential security breaches. Yet, even with measures in place to protect devices, computing equipment inadvertently leaks information through side-channels, which emerge as byproducts of computational activities. One particular source of such side channels is the cache, a vital component of modern processors that enhances computational speed by storing frequently accessed data from random access memory (RAM). Due to their limited capacity, caches often need to be shared among concurrently running applications, resulting in vulnerabilities. Cache side-channel attacks, which exploit such vulnerabilities, have received significant attention due to their ability to stealthily compromise information confidentiality and the challenge in detecting and countering them. Consequently, numerous defense strategies have been proposed to mitigate these attacks. This thesis explores these defense strategies against cache side-channels, assesses their effectiveness, and identifies any potential vulnerabilities that could be used to undermine the effectiveness of these defense strategies. The first contribution of this thesis is a software framework to assess the security of secure cache designs. We show that while most secure caches are protected from eviction-set-based attacks, they are vulnerable to occupancybased attacks, which works just as well as eviction-set-based attacks, and therefore should be taken into account when designing and evaluating secure caches. Our second contribution presents a method that utilizes speculative execution to enable high-resolution attacks on low-resolution timers, a common cache attack countermeasure adopted by web browsers. We demonstrate that our technique not only allows for high-resolution attacks to be performed on low-resolution timers, but is also Turing-complete and is capable of performing robust calculations on cache states. Through this research, we uncover a new attack vector on low-resolution timers. By exposing this vulnerability, we hope to prompt the necessary measures to address the issue and enhance the security of systems in the future. Our third contribution is a survey, paired with experimental assessment of cache side-channel attack detection techniques using hardware performance counters. We show that, despite numerous claims regarding their efficacy, most detection techniques fail to perform proper evaluation of their performance, leaving them vulnerable to more advanced attacks. We identify and outline these shortcomings, and furnish experimental evidence to corroborate our findings. Furthermore, we demonstrate a new attack that is capable of compromising these detection methods. Our aim is to bring attention to these shortcomings and provide insights that can aid in the development of more robust cache side-channel attack detection techniques. This thesis contributes to a deeper comprehension of cache side-channel attacks and their potential effects on information security. Furthermore, it offers valuable insights into the efficacy of existing mitigation approaches and detection methods, while identifying areas for future research and development to better safeguard our computing devices and data from these insidious attacks.Thesis (MPhil) -- University of Adelaide, School of Computer and Mathematical Sciences, 202
Cooperative Data and Computation Partitioning for Decentralized Architectures.
Scalability of future wide-issue processor designs is severely hampered by the
use of centralized resources such as register files, memories and interconnect
networks. While the use of centralized resources eases both hardware design and
compiler code generation efforts, they can become performance bottlenecks as
access latencies increase with larger designs. The natural solution to this
problem is to adapt the architecture to use smaller, decentralized resources.
Decentralized architectures use smaller, faster components and exploit
distributed instruction-level parallelism across the resources. A multicluster
architecture is an example of such a decentralized processor, where subsets of
smaller register files, functional units, and memories are grouped together in a
tightly coupled unit, forming a cluster. These clusters can then be replicated
and connected together to form a scalable, high-performance architecture.
The main difficulty with decentralized architectures resides in compiler code
generation. In a centralized Very Long Instruction Word (VLIW) processor, the
compiler must statically schedule each operation to both a functional unit and a
time slot for execution. In contrast, for a decentralized multicluster VLIW,
the compiler must consider the additional effects of cluster assignment,
recognizing that communication between clusters will result in a delay penalty.
In addition, if the multicluster processor also has partitioned data memories,
the compiler has the additional task of assigning data objects to their
respective memories. Each decision, of cluster, functional unit, memory, and
time slot, are highly interrelated and can have dramatic effects on the best
choice for every other decision.
This dissertation addresses the issues of extracting and exploiting inherent
parallelism across decentralized resources through compiler analysis and code
generation techniques. First, a static analysis technique to partition data
objects is presented, which maps data objects to decentralized scratchpad
memories. Second, an alternative profile-guided technique for memory
partitioning is presented which can effectively map data access operations onto
distributed caches. Finally, a detailed, resource-aware partitioning algorithm
is presented which can effectively split computation operations of an
application across a set of processing elements. These partitioners work in
tandem to create a high-performance partition assignment of both memory and
computation operations for decentralized multicluster or multicore processors.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/57649/2/mchu_1.pd
Techniques to improve concurrency in hardware transactional memory
Transactional Memory (TM) aims to make shared memory parallel programming easier by abstracting away the complexity of managing shared data. The programmer defines sections of code, called transactions, which the TM system guarantees that will execute atomically and in isolation from the rest of the system. The programmer is not required to implement such behaviour, as happens in traditional mutual exclusion techniques like locks - that responsibility is delegated to the underlying TM system. In addition, transactions can exploit parallelism that would not be available in mutual exclusion techniques; this is achieved by allowing optimistic execution assuming no other transaction operates concurrently on the same data. If that assumption is true the transaction commits its updates to shared memory by the end of its execution, otherwise, a conflict occurs and the TM system may abort one of the conflicting transactions to guarantee correctness; the aborted transaction would roll-back its local updates and be re-executed. Hardware and software implementations of TM have been studied in detail. However, large-scale adoption of software-only approaches have been hindered for long due to severe performance limitations.
In this thesis, we focus on identifying and solving hardware transactional memory (HTM) issues in order to improve concurrency and scalability. Two key dimensions determine the HTM design space: conflict detection and speculative version management. The first determines how conflicts are detected between concurrent transactions and how to resolve them. The latter defines where transactional updates are stored and how the system deals with two versions of the same logical data. This thesis proposes a flexible mechanism that allows efficient storage and access to two versions of the same logical data, improving overall system performance and energy efficiency.
Additionally, in this thesis we explore two solutions to reduce system contention - circumstances where transactions abort due to data dependencies - in order to improve concurrency of HTM systems. The first mechanism provides a suitable design to apply prefetching to speed-up transaction executions, lowering the window of time in which such transactions can experience contention. The second is an accurate abort prediction mechanism able to identify, before a transaction's execution, potential conflicts with running transactions. This mechanism uses past behaviour of transactions and locality in memory references to infer predictions, adapting to variations in workload characteristics. We demonstrate that this mechanism is able to manage contention efficiently in single-application and multi-application scenarios.
Finally, this thesis also analyses initial real-world HTM protocols that recently appeared in market products. These protocols have been designed to be simple and easy to incorporate in existing chip-multiprocessors. However, this simplicity comes at the cost of severe performance degradation due to transient and persistent livelock conditions, potentially preventing forward progress. We show that existing techniques are unable to mitigate this degradation effectively. To deal with this issue we propose a set of techniques that retain the simplicity of the protocol while providing improved performance and forward progress guarantees in a wide variety of transactional workloads
An input centric paradigm for program dynamic optimizations and lifetime evolvement
Accurately predicting program behaviors (e.g., memory locality, method calling frequency) is fundamental for program optimizations and runtime adaptations. Despite decades of remarkable progress, prior studies have not systematically exploited the use of program inputs, a deciding factor of program behaviors, to help in program dynamic optimizations. Triggered by the strong and predictive correlations between program inputs and program behaviors that recent studies have uncovered, the dissertation work aims to bring program inputs into the focus of program behavior analysis and program dynamic optimization, cultivating a new paradigm named input-centric program behavior analysis and dynamic optimization.;The new optimization paradigm consists of three components, forming a three-layer pyramid. at the base is program input characterization, a component for resolving the complexity in program raw inputs and extracting important features. In the middle is input-behavior modeling, a component for recognizing and modeling the correlations between characterized input features and program behaviors. These two components constitute input-centric program behavior analysis, which (ideally) is able to predict the large-scope behaviors of a program\u27s execution as soon as the execution starts. The top layer is input-centric adaptation, which capitalizes on the novel opportunities created by the first two components to facilitate proactive adaptation for program optimizations.;This dissertation aims to develop this paradigm in two stages. In the first stage, we concentrate on exploring the implications of program inputs for program behaviors and dynamic optimization. We construct the basic input-centric optimization framework based on of line training to realize the basic functionalities of the three major components of the paradigm. For the second stage, we focus on making the paradigm practical by addressing multi-facet issues in handling input complexities, transparent training data collection, predictive model evolvement across production runs. The techniques proposed in this stage together cultivate a lifelong continuous optimization scheme with cross-input adaptivity.;Fundamentally the new optimization paradigm provides a brand new solution for program dynamic optimization. The techniques proposed in the dissertation together resolve the adaptivity-proactivity dilemma that has been limiting the effectiveness of existing optimization techniques. its benefits are demonstrated through proactive dynamic optimizations in Jikes RVM and version selection using IBM XL C Compiler, yielding significant performance improvement on a set of Java and C/C++ programs. It may open new opportunities for a broad range of runtime optimizations and adaptations. The evaluation results on both Java and C/C++ applications demonstrate the new paradigm is promising in advancing the current state of program optimizations
- …