502 research outputs found
Complementing user-level coarse-grain parallelism with implicit speculative parallelism
Multi-core and many-core systems are the norm in contemporary processor technology
and are expected to remain so for the foreseeable future. Parallel programming
is, thus, here to stay and programmers have to endorse it if they are to exploit such
systems for their applications. Programs using parallel programming primitives like
PThreads or OpenMP often exploit coarse-grain parallelism, because it offers a good
trade-off between programming effort versus performance gain. Some parallel applications
show limited or no scaling beyond a number of cores. Given the abundant
number of cores expected in future many-cores, several cores would remain idle in such
cases while execution performance stagnates. This thesis proposes using cores that do
not contribute to performance improvement for running implicit fine-grain speculative
threads. In particular, we present a many-core architecture and protocols that allow
applications with coarse-grain explicit parallelism to further exploit implicit speculative
parallelism within each thread. We show that complementing parallel programs
with implicit speculative mechanisms offers significant performance improvements for
a large and diverse set of parallel benchmarks. Implicit speculative parallelism frees
the programmer from the additional effort to explicitly partition the work into finer
and properly synchronized tasks. Our results show that, for a many-core comprising
128 cores supporting implicit speculative parallelism in clusters of 2 or 4 cores, performance
improves on top of the highest scalability point by 44% on average for the
4-core cluster and by 31% on average for the 2-core cluster. We also show that this
approach often leads to better performance and energy efficiency compared to existing
alternatives such as Core Fusion and Turbo Boost. Moreover, we present a dynamic
mechanism to choose the number of explicit and implicit threads, which performs
within 6% of the static oracle selection of threads.
To improve energy efficiency processors allow for Dynamic Voltage and Frequency
Scaling (DVFS), which enables changing their performance and power consumption
on-the-fly. We evaluate the amenability of the proposed explicit plus implicit threads
scheme to traditional power management techniques for multithreaded applications
and identify room for improvement. We thus augment prior schemes and introduce
a novel multithreaded power management scheme that accounts for implicit threads
and aims to minimize the Energy Delay2 product (ED2). Our scheme comprises two
components: a “local” component that tries to adapt to the different program phases
on a per explicit thread basis, taking into account implicit thread behavior, and a
“global” component that augments the local components with information regarding
inter-thread synchronization. Experimental results show a reduction of ED2 of 8%
compared to having no power management, with an average reduction in power of
15% that comes at a minimal loss of performance of less than 3% on average
A scalable architecture for ordered parallelism
We present Swarm, a novel architecture that exploits ordered irregular parallelism, which is abundant but hard to mine with current software and hardware techniques. In this architecture, programs consist of short tasks with programmer-specified timestamps. Swarm executes tasks speculatively and out of order, and efficiently speculates thousands of tasks ahead of the earliest active task to uncover ordered parallelism. Swarm builds on prior TLS and HTM schemes, and contributes several new techniques that allow it to scale to large core counts and speculation windows, including a new execution model, speculation-aware hardware task management, selective aborts, and scalable ordered commits.
We evaluate Swarm on graph analytics, simulation, and database benchmarks. At 64 cores, Swarm achieves 51--122Ă— speedups over a single-core system, and out-performs software-only parallel algorithms by 3--18Ă—.National Science Foundation (U.S.) (Award CAREER-145299
A pattern language for parallelizing irregular algorithms
Dissertação apresentada na Faculdade de Ciências e Tecnologia da Universidade Nova de Lisboa para obtenção do grau de Mestre em Engenharia InformáticaIn irregular algorithms, data set’s dependences and distributions cannot be statically predicted.
This class of algorithms tends to organize computations in terms of data locality instead of parallelizing control in multiple threads. Thus, opportunities for exploiting parallelism vary dynamically, according to how the algorithm changes data dependences. As such, effective parallelization of such algorithms requires new approaches that account for that dynamic nature.
This dissertation addresses the problem of building efficient parallel implementations of irregular algorithms by proposing to extract, analyze and document patterns of concurrency and parallelism present in the Galois parallelization framework for irregular algorithms.
Patterns capture formal representations of a tangible solution to a problem that arises in a well defined context within a specific domain.
We document the said patterns in a pattern language, i.e., a set of inter-dependent patterns that compose well-documented template solutions that can be reused whenever a certain problem arises in a well-known context
Adaptive Transactional Memories: Performance and Energy Consumption Tradeoffs
Energy efficiency is becoming a pressing issue, especially in large data centers where it entails, at the same time, a non-negligible management cost, an enhancement of hardware fault probability, and a significant environmental footprint. In this paper, we study how Software Transactional Memories (STM) can provide benefits on both power saving and the overall applications’ execution performance. This is related to the fact that encapsulating shared-data accesses within transactions gives the freedom to the STM middleware to both ensure consistency and reduce the actual data contention, the latter having been shown to affect the overall power needed to complete the application’s execution.
We have selected a set of self-adaptive extensions to existing STM middlewares (namely, TinySTM and R-STM) to prove how self-adapting computation can capture the actual degree of parallelism and/or logical contention on shared data in a better way, enhancing even more the intrinsic benefits provided by STM. Of course, this benefit comes at a cost, which is the actual execution time required by the proposed approaches to precisely tune the execution parameters for reducing power consumption and enhancing execution performance. Nevertheless, the results hereby provided show that adaptivity is a strictly necessary requirement to reduce energy consumption in STM systems: Without it, it is not possible to reach any acceptable level of energy efficiency at all
- …