50,336 research outputs found
Tuning the Level of Concurrency in Software Transactional Memory: An Overview of Recent Analytical, Machine Learning and Mixed Approaches
Synchronization transparency offered by Software Transactional Memory (STM) must not come at the expense of run-time efficiency, thus demanding from the STM-designer the inclusion of mechanisms properly oriented to performance and other quality indexes. Particularly, one core issue to cope with in STM is related to exploiting parallelism while also avoiding thrashing phenomena due to excessive transaction rollbacks, caused by excessively high levels of contention on logical resources, namely concurrently accessed data portions. A means to address run-time efficiency consists in dynamically determining the best-suited level of concurrency (number of threads) to be employed for running the application (or specific application phases) on top of the STM layer. For too low levels of concurrency, parallelism can be hampered. Conversely, over-dimensioning the concurrency level may give rise to the aforementioned thrashing phenomena caused by excessive data contention—an aspect which has reflections also on the side of reduced energy-efficiency. In this chapter we overview a set of recent techniques aimed at building “application-specific” performance models that can be exploited to dynamically tune the level of concurrency to the best-suited value. Although they share some base concepts while modeling the system performance vs the degree of concurrency, these techniques rely on disparate methods, such as machine learning or analytic methods (or combinations of the two), and achieve different tradeoffs in terms of the relation between the precision of the performance model and the latency for model instantiation. Implications of the different tradeoffs in real-life scenarios are also discussed
Analytical/ML Mixed Approach for Concurrency Regulation in Software Transactional Memory
In this article we exploit a combination of analytical and Machine Learning (ML) techniques in order to build a performance model allowing to dynamically tune the level of concurrency of applications based on Software Transactional Memory (STM). Our mixed approach has the advantage of reducing the training time of pure machine learning methods, and avoiding approximation errors typically affecting pure analytical approaches. Hence it allows very fast construction of highly reliable performance models, which can be promptly and effectively exploited for optimizing actual application runs. We also present a real implementation of a concurrency regulation architecture, based on the mixed modeling approach, which has been integrated with the open source Tiny STM package, together with experimental data related to runs of applications taken from the STAMP benchmark suite demonstrating the effectiveness of our proposal. © 2014 IEEE
Reconfigurable interconnects in DSM systems: a focus on context switch behavior
Recent advances in the development of reconfigurable optical interconnect technologies allow for the fabrication of low cost and run-time adaptable interconnects in large distributed shared-memory (DSM) multiprocessor machines. This can allow the use of adaptable interconnection networks that alleviate the huge bottleneck present due to the gap between the processing speed and the memory access time over the network. In this paper we have studied the scheduling of tasks by the kernel of the operating system (OS) and its influence on communication between the processing nodes of the system, focusing on the traffic generated just after a context switch. We aim to use these results as a basis to propose a potential reconfiguration of the network that could provide a significant speedup
Performance Reproduction and Prediction of Selected Dynamic Loop Scheduling Experiments
Scientific applications are complex, large, and often exhibit irregular and
stochastic behavior. The use of efficient loop scheduling techniques in
computationally-intensive applications is crucial for improving their
performance on high-performance computing (HPC) platforms. A number of dynamic
loop scheduling (DLS) techniques have been proposed between the late 1980s and
early 2000s, and efficiently used in scientific applications. In most cases,
the computing systems on which they have been tested and validated are no
longer available. This work is concerned with the minimization of the sources
of uncertainty in the implementation of DLS techniques to avoid unnecessary
influences on the performance of scientific applications. Therefore, it is
important to ensure that the DLS techniques employed in scientific applications
today adhere to their original design goals and specifications. The goal of
this work is to attain and increase the trust in the implementation of DLS
techniques in present studies. To achieve this goal, the performance of a
selection of scheduling experiments from the 1992 original work that introduced
factoring is reproduced and predicted via both, simulative and native
experimentation. The experiments show that the simulation reproduces the
performance achieved on the past computing platform and accurately predicts the
performance achieved on the present computing platform. The performance
reproduction and prediction confirm that the present implementation of the DLS
techniques considered both, in simulation and natively, adheres to their
original description. The results confirm the hypothesis that reproducing
experiments of identical scheduling scenarios on past and modern hardware leads
to an entirely different behavior from expected
Task-based adaptive multiresolution for time-space multi-scale reaction-diffusion systems on multi-core architectures
A new solver featuring time-space adaptation and error control has been
recently introduced to tackle the numerical solution of stiff
reaction-diffusion systems. Based on operator splitting, finite volume adaptive
multiresolution and high order time integrators with specific stability
properties for each operator, this strategy yields high computational
efficiency for large multidimensional computations on standard architectures
such as powerful workstations. However, the data structure of the original
implementation, based on trees of pointers, provides limited opportunities for
efficiency enhancements, while posing serious challenges in terms of parallel
programming and load balancing. The present contribution proposes a new
implementation of the whole set of numerical methods including Radau5 and
ROCK4, relying on a fully different data structure together with the use of a
specific library, TBB, for shared-memory, task-based parallelism with
work-stealing. The performance of our implementation is assessed in a series of
test-cases of increasing difficulty in two and three dimensions on multi-core
and many-core architectures, demonstrating high scalability
- …