199 research outputs found
Bounding Cache Miss Costs of Multithreaded Computations Under General Schedulers
We analyze the caching overhead incurred by a class of multithreaded
algorithms when scheduled by an arbitrary scheduler. We obtain bounds that
match or improve upon the well-known caching cost for the
randomized work stealing (RWS) scheduler, where is the number of steals,
is the sequential caching cost, and and are the cache size and
block (or cache line) size respectively.Comment: Extended abstract in Proceedings of ACM Symp. on Parallel Alg. and
Architectures (SPAA) 2017, pp. 339-350. This revision has a few small updates
including a missing citation and the replacement of some big Oh terms with
precise constant
Scheduling threads for constructive cache sharing on CMPs
In chip multiprocessors (CMPs), limiting the number of offchip cache misses is crucial for good performance. Many multithreaded programs provide opportunities for constructive cache sharing, in which concurrently scheduled threads share a largely overlapping working set. In this paper, we compare the performance of two state-of-the-art schedulers proposed for fine-grained multithreaded programs: Parallel Depth First (PDF), which is specifically designed for constructive cache sharing, and Work Stealing (WS), which is a more traditional design. Our experimental results indicate that PDF scheduling yields a 1.3 - 1.6X performance improvement relative to WS for several fine- grain parallel benchmarks on projected future CMP configurations; we also report several issues that may limit the advantage of PDF in certain applications. These results also indicate that PDF more effectively utilizes off-chip bandwidth, making it possible to trade-off on-chip cache for a larger number of cores. Moreover, we find that task granularity plays a key role in cache performance. Therefore, we present an automatic approach for selecting effective grain sizes, based on a new working set profiling algorithm that is an order of magnitude faster than previous approaches. This is the first paper demonstrating the effectiveness of PDF on real benchmarks, providing a direct comparison between PDF and WS, revealing the limiting factors for PDF in practice, and presenting an approach for overcoming these factors. Copyright 2007 ACM
Revisiting LP-NUCA Energy Consumption: Cache Access Policies and Adaptive Block Dropping
Cache working-set adaptation is key as embedded systems move to multiprocessor and Simultaneous Multithreaded Architectures (SMT) because interthread pollution harms system performance and battery life. Light-Power NUCA (LP-NUCA) is a working-set adaptive cache that depends on temporal-locality to save energy. This work identifies the sources of energy waste in LP-NUCAs: parallel access to the tag and data arrays of the tiles and low locality phases with useless block migration. To counteract both issues, we prove that switching to serial access reduces energy without harming performance and propose a machine learning Adaptive Drop Rate (ADR) controller that minimizes the amount of replacement and migration when locality is low.
This work demonstrates that these techniques efficiently adapt the cache drop and access policies to save energy. They reduce LP-NUCA consumption 22.7% for 1SMT. With interthread cache contention in 2SMT, the savings rise to 29%. Versus a conventional organization, energy--delay improves 20.8% and 25% for 1- and 2SMT benchmarks, and, in 65% of the 2SMT mixes, gains are larger than 20%
Case for holistic query evaluation
In this thesis we present the holistic query evaluation model. We propose a novel
query engine design that exploits the characteristics of modern processors when queries
execute inside main memory. The holistic model (a) is based on template-based code
generation for each executed query, (b) uses multithreading to adapt to multicore processor
architectures and (c) addresses the optimization problem of scheduling multiple
threads for intra-query parallelism.
Main-memory query execution is a usual operation in modern database servers
equipped with tens or hundreds of gigabytes of RAM. In such an execution environment,
the query engine needs to adapt to the CPU characteristics to boost performance.
For this purpose, holistic query evaluation applies customized code generation
to database query evaluation. The idea is to use a collection of highly efficient code
templates and dynamically instantiate them to create query- and hardware-specific
source code. The source code is compiled and dynamically linked to the database
server for processing. Code generation diminishes the bloat of higher-level programming
abstractions necessary for implementing generic, interpreted, SQL query engines.
At the same time, the generated code is customized for the hardware it will run on. The
holistic model supports the most frequently used query processing algorithms, namely
sorting, partitioning, join evaluation, and aggregation, thus allowing the efficient evaluation
of complex DSS or OLAP queries.
Modern CPUs follow multicore designs with multiple threads running in parallel.
The dataflow of query engine algorithms needs to be adapted to exploit such designs.
We identify memory accesses and thread synchronization as the main bottlenecks in
a multicore execution environment. We extend the holistic query evaluation model
and propose techniques to mitigate the impact of these bottlenecks on multithreaded
query evaluation. We analytically model the expected performance and scalability of
the proposed algorithms according to the hardware specifications. The analytical performance
expressions can be used by the optimizer to statically estimate the speedup
of multithreaded query execution.
Finally, we examine the problem of thread scheduling in the context of multithreaded
query evaluation on multicore CPUs. The search space for possible operator
execution schedules scales fast, thus forbidding the use of exhaustive techniques. We
model intra-query parallelism on multicore systems and present scheduling heuristics
that result in different degrees of schedule quality and optimization cost. We identify
cases where each of our proposed algorithms, or combinations of them, are expected
to generate schedules of high quality at an acceptable running cost
Data Oblivious Algorithms for Multicores
As secure processors such as Intel SGX (with hyperthreading) become widely adopted, there is a growing appetite for private analytics on big data. Most prior works on data-oblivious algorithms adopt the classical PRAM model to capture parallelism. However, it is widely understood that PRAM does not best capture realistic multicore processors, nor does it reflect
parallel programming models adopted in practice.
In this paper, we initiate the study of parallel data oblivious algorithms on realistic multicores, best captured by the binary fork-join model of computation. We first show that data-oblivious sorting can be accomplished by a binary fork-join algorithm with optimal total work and optimal (cache-oblivious) cache complexity, and in O(log n log log n) span (i.e., parallel time) that matches the best-known insecure algorithm. Using our sorting algorithm as a core primitive, we show how to data-obliviously simulate general PRAM algorithms in the binary fork-join model with non-trivial efficiency. We also present results for several applications including list ranking, Euler tour, tree contraction, connected components, and minimum spanning forest. For a subset of these applications, our data-oblivious algorithms asymptotically outperform the best known insecure algorithms. For other applications, we show data oblivious algorithms whose performance bounds match the best known insecure algorithms.
Complementing these asymptotically efficient results, we present a practical variant of our sorting algorithm that is self-contained and potentially implementable. It has optimal caching cost, and it is only a log log n factor off from optimal work and about a log n factor off in terms of span; moreover, it achieves small constant factors in its bounds
- …