562 research outputs found
Cache-afïŹnity scheduling for fine grain multithreading
Cache utilisation is often very poor in multithreaded applications, due to the loss of data access locality incurred by frequent context switching. This problem is compounded on shared memory multiprocessors when dynamic load balancing is introduced and thread migration disrupts cache content. In this paper, we present a technique, which we refer to as âbatchingâ, for reducing the negative impact of fine grain multithreading on cache performance. Prototype schedulers running on uniprocessors and shared memory multiprocessors are described, and finally experimental results which illustrate the improvements observed after applying our techniques are presented.peer-reviewe
Parallel Finger Search Structures
In this paper we present two versions of a parallel finger structure FS on p processors that supports searches, insertions and deletions, and has a finger at each end. This is to our knowledge the first implementation of a parallel search structure that is work-optimal with respect to the finger bound and yet has very good parallelism (within a factor of O(log p)^2) of optimal). We utilize an extended implicit batching framework that transparently facilitates the use of FS by any parallel program P that is modelled by a dynamically generated DAG D where each node is either a unit-time instruction or a call to FS.
The work done by FS is bounded by the finger bound F_L (for some linearization L of D), i.e. each operation on an item with distance r from a finger takes O(log r+1) amortized work. Running P using the simpler version takes O((T_1+F_L)/p + T_infty + d * ((log p)^2 + log n)) time on a greedy scheduler, where T_1, T_infty are the size and span of D respectively, and n is the maximum number of items in FS, and d is the maximum number of calls to FS along any path in D. Using the faster version, this is reduced to O((T_1+F_L)/p + T_infty + d *(log p)^2 + s_L) time, where s_L is the weighted span of D where each call to FS is weighted by its cost according to F_L. FS can be extended to a fixed number of movable fingers.
The data structures in our paper fit into the dynamic multithreading paradigm, and their performance bounds are directly composable with other data structures given in the same paradigm. Also, the results can be translated to practical implementations using work-stealing schedulers
Power models, energy models and libraries for energy-efficient concurrent data structures and algorithms
EXCESS deliverable D2.3. More information at http://www.excess-project.eu/This deliverable reports the results of the power models, energy models and librariesfor energy-efficient concurrent data structures and algorithms as available by projectmonth 30 of Work Package 2 (WP2). It reports i) the latest results of Task 2.2-2.4 onproviding programming abstractions and libraries for developing energy-efficient datastructures and algorithms and ii) the improved results of Task 2.1 on investigating andmodeling the trade-off between energy and performance of concurrent data structuresand algorithms. The work has been conducted on two main EXCESS platforms: Intelplatforms with recent Intel multicore CPUs and Movidius Myriad platforms
Performance Enhancement of Multicore Architecture
Multicore processors integrate several cores on a single chip. The fixed architecture of multicore platforms often fails to accommodate the inherent diverse requirements of different applications. The permanent need to enhance the performance of multicore architecture motivates the development of a dynamic architecture. To address this issue, this paper presents new algorithms for thread selection in fetch stage. Moreover, this paper presents three new fetch stage policies, EACH_LOOP_FETCH, INC-FETCH, and WZ-FETCH, based on Ordinary Least Square (OLS) regression statistic method. These new fetch policies differ on thread selection time which is represented by instructionsâ count and window size. Furthermore, the simulation multicore tool, , is adapted to cope with multicore processor dynamic design by adding a dynamic feature in the policy of thread selection in fetch stage. SPLASH2, parallel scientific workloads, has been used to validate the proposed adaptation for multi2sim. Intensive simulated experiments have been conducted and the obtained results show that remarkable performance enhancements have been achieved in terms of execution time and number of instructions per second produces less broadcast operations compared to the typical algorithm
Agile Development of Linux Schedulers with Ekiben
Kernel task scheduling is important for application performance, adaptability
to new hardware, and complex user requirements. However, developing, testing,
and debugging new scheduling algorithms in Linux, the most widely used cloud
operating system, is slow and difficult. We developed Ekiben, a framework for
high velocity development of Linux kernel schedulers. Ekiben schedulers are
written in safe Rust, and the system supports live upgrade of new scheduling
policies into the kernel, userspace debugging, and bidirectional communication
with applications. A scheduler implemented with Ekiben achieved near identical
performance (within 1% on average) to the default Linux scheduler CFS on a wide
range of benchmarks. Ekiben is also able to support a range of research
schedulers, specifically the Shinjuku scheduler, a locality aware scheduler,
and the Arachne core arbiter, with good performance.Comment: 13 pages, 5 figures, submitted to Eurosys 202
AIR: A Light-Weight Yet High-Performance Dataflow Engine based on Asynchronous Iterative Routing
Distributed Stream Processing Systems (DSPSs) are among the currently most
emerging topics in data management, with applications ranging from real-time
event monitoring to processing complex dataflow programs and big data
analytics. The major market players in this domain are clearly represented by
Apache Spark and Flink, which provide a variety of frontend APIs for SQL,
statistical inference, machine learning, stream processing, and many others.
Yet rather few details are reported on the integration of these engines into
the underlying High-Performance Computing (HPC) infrastructure and the
communication protocols they use. Spark and Flink, for example, are implemented
in Java and still rely on a dedicated master node for managing their control
flow among the worker nodes in a compute cluster.
In this paper, we describe the architecture of our AIR engine, which is
designed from scratch in C++ using the Message Passing Interface (MPI),
pthreads for multithreading, and is directly deployed on top of a common HPC
workload manager such as SLURM. AIR implements a light-weight, dynamic sharding
protocol (referred to as "Asynchronous Iterative Routing"), which facilitates a
direct and asynchronous communication among all client nodes and thereby
completely avoids the overhead induced by the control flow with a master node
that may otherwise form a performance bottleneck. Our experiments over a
variety of benchmark settings confirm that AIR outperforms Spark and Flink in
terms of latency and throughput by a factor of up to 15; moreover, we
demonstrate that AIR scales out much better than existing DSPSs to clusters
consisting of up to 8 nodes and 224 cores.Comment: 16 pages, 6 figures, 15 plot
Event Stream Processing with Multiple Threads
Current runtime verification tools seldom make use of multi-threading to
speed up the evaluation of a property on a large event trace. In this paper, we
present an extension to the BeepBeep 3 event stream engine that allows the use
of multiple threads during the evaluation of a query. Various parallelization
strategies are presented and described on simple examples. The implementation
of these strategies is then evaluated empirically on a sample of problems.
Compared to the previous, single-threaded version of the BeepBeep engine, the
allocation of just a few threads to specific portions of a query provides
dramatic improvement in terms of running time
- âŠ