7,966 research outputs found
Pervasive Parallel And Distributed Computing In A Liberal Arts College Curriculum
We present a model for incorporating parallel and distributed computing (PDC) throughout an undergraduate CS curriculum. Our curriculum is designed to introduce students early to parallel and distributed computing topics and to expose students to these topics repeatedly in the context of a wide variety of CS courses. The key to our approach is the development of a required intermediate-level course that serves as a introduction to computer systems and parallel computing. It serves as a requirement for every CS major and minor and is a prerequisite to upper-level courses that expand on parallel and distributed computing topics in different contexts. With the addition of this new course, we are able to easily make room in upper-level courses to add and expand parallel and distributed computing topics. The goal of our curricular design is to ensure that every graduating CS major has exposure to parallel and distributed computing, with both a breadth and depth of coverage. Our curriculum is particularly designed for the constraints of a small liberal arts college, however, much of its ideas and its design are applicable to any undergraduate CS curriculum
RPPM : Rapid Performance Prediction of Multithreaded workloads on multicore processors
Analytical performance modeling is a useful complement to detailed cycle-level simulation to quickly explore the design space in an early design stage. Mechanistic analytical modeling is particularly interesting as it provides deep insight and does not require expensive offline profiling as empirical modeling. Previous work in mechanistic analytical modeling, unfortunately, is limited to single-threaded applications running on single-core processors.
This work proposes RPPM, a mechanistic analytical performance model for multi-threaded applications on multicore hardware. RPPM collects microarchitecture-independent characteristics of a multi-threaded workload to predict performance on a previously unseen multicore architecture. The profile needs to be collected only once to predict a range of processor architectures. We evaluate RPPM's accuracy against simulation and report a performance prediction error of 11.2% on average (23% max). We demonstrate RPPM's usefulness for conducting design space exploration experiments as well as for analyzing parallel application performance
Boosting Multi-Core Reachability Performance with Shared Hash Tables
This paper focuses on data structures for multi-core reachability, which is a
key component in model checking algorithms and other verification methods. A
cornerstone of an efficient solution is the storage of visited states. In
related work, static partitioning of the state space was combined with
thread-local storage and resulted in reasonable speedups, but left open whether
improvements are possible. In this paper, we present a scaling solution for
shared state storage which is based on a lockless hash table implementation.
The solution is specifically designed for the cache architecture of modern
CPUs. Because model checking algorithms impose loose requirements on the hash
table operations, their design can be streamlined substantially compared to
related work on lockless hash tables. Still, an implementation of the hash
table presented here has dozens of sensitive performance parameters (bucket
size, cache line size, data layout, probing sequence, etc.). We analyzed their
impact and compared the resulting speedups with related tools. Our
implementation outperforms two state-of-the-art multi-core model checkers (SPIN
and DiVinE) by a substantial margin, while placing fewer constraints on the
load balancing and search algorithms.Comment: preliminary repor
A Wait-free Multi-word Atomic (1,N) Register for Large-scale Data Sharing on Multi-core Machines
We present a multi-word atomic (1,N) register for multi-core machines
exploiting Read-Modify-Write (RMW) instructions to coordinate the writer and
the readers in a wait-free manner. Our proposal, called Anonymous Readers
Counting (ARC), enables large-scale data sharing by admitting up to
concurrent readers on off-the-shelf 64-bits machines, as opposed to the most
advanced RMW-based approach which is limited to 58 readers. Further, ARC avoids
multiple copies of the register content when accessing it---this affects
classical register's algorithms based on atomic read/write operations on single
words. Thus it allows for higher scalability with respect to the register size.
Moreover, ARC explicitly reduces improves performance via a proper limitation
of RMW instructions in case of read operations, and by supporting constant time
for read operations and amortized constant time for write operations. A proof
of correctness of our register algorithm is also provided, together with
experimental data for a comparison with literature proposals. Beyond assessing
ARC on physical platforms, we carry out as well an experimentation on
virtualized infrastructures, which shows the resilience of wait-free
synchronization as provided by ARC with respect to CPU-steal times, proper of
more modern paradigms such as cloud computing.Comment: non
- …