878 research outputs found
Garbage collection auto-tuning for Java MapReduce on Multi-Cores
MapReduce has been widely accepted as a simple programming pattern that can form the basis for efficient, large-scale, distributed data processing. The success of the MapReduce pattern has led to a variety of implementations for different computational scenarios. In this paper we present MRJ, a MapReduce Java framework for multi-core architectures. We evaluate its scalability on a four-core, hyperthreaded Intel Core i7 processor, using a set of standard MapReduce benchmarks. We investigate the significant impact that Java runtime garbage collection has on the performance and scalability of MRJ. We propose the use of memory management auto-tuning techniques based on machine learning. With our auto-tuning approach, we are able to achieve MRJ performance within 10% of optimal on 75% of our benchmark tests
Memory and Parallelism Analysis Using a Platform-Independent Approach
Emerging computing architectures such as near-memory computing (NMC) promise
improved performance for applications by reducing the data movement between CPU
and memory. However, detecting such applications is not a trivial task. In this
ongoing work, we extend the state-of-the-art platform-independent software
analysis tool with NMC related metrics such as memory entropy, spatial
locality, data-level, and basic-block-level parallelism. These metrics help to
identify the applications more suitable for NMC architectures.Comment: 22nd ACM International Workshop on Software and Compilers for
Embedded Systems (SCOPES '19), May 201
Pac-Sim: Simulation of Multi-threaded Workloads using Intelligent, Live Sampling
High-performance, multi-core processors are the key to accelerating workloads
in several application domains. To continue to scale performance at the limit
of Moore's Law and Dennard scaling, software and hardware designers have turned
to dynamic solutions that adapt to the needs of applications in a transparent,
automatic way. For example, modern hardware improves its performance and power
efficiency by changing the hardware configuration, like the frequency and
voltage of cores, according to a number of parameters such as the technology
used, the workload running, etc. With this level of dynamism, it is essential
to simulate next-generation multi-core processors in a way that can both
respond to system changes and accurately determine system performance metrics.
Currently, no sampled simulation platform can achieve these goals of dynamic,
fast, and accurate simulation of multi-threaded workloads.
In this work, we propose a solution that allows for fast, accurate simulation
in the presence of both hardware and software dynamism. To accomplish this
goal, we present Pac-Sim, a novel sampled simulation methodology for fast,
accurate sampled simulation that requires no upfront analysis of the workload.
With our proposed methodology, it is now possible to simulate long-running
dynamically scheduled multi-threaded programs with significant simulation
speedups even in the presence of dynamic hardware events. We evaluate Pac-Sim
using the multi-threaded SPEC CPU2017, NPB, and PARSEC benchmarks with both
static and dynamic thread scheduling. The experimental results show that
Pac-Sim achieves a very low sampling error of 1.63% and 3.81% on average for
statically and dynamically scheduled benchmarks, respectively. Pac-Sim also
demonstrates significant simulation speedups as high as 523.5
(210.3 on average) for the train input set of SPEC CPU2017.Comment: 14 pages, 14 figure
Modeling and scheduling heterogeneous multi-core architectures
Om de prestatie van toekomstige processors en processorarchitecturen te evalueren wordt vaak gebruik gemaakt van een simulator die het gedrag en de prestatie van de processor modelleert. De prestatie bepalen van de uitvoering van een computerprogramma op een gegeven processorarchitectuur m.b.v. een simulator duurt echter vele grootteordes langer dan de werkelijke uitvoeringstijd. Dit beperkt in belangrijke mate de hoeveelheid experimenten die gedaan kunnen worden.
In dit doctoraatswerk werd het Multi-Program Performance Model (MPPM) ontwikkeld, een innovatief alternatief voor traditionele simulatie, dat het mogelijk maakt om tot 100.000x sneller een processorconfiguratie te evalueren. MPPM laat ons toe om nooit geziene exploraties te doen. Gebruik makend van dit raamwerk hebben we aangetoond dat de taakplanning cruciaal is om heterogene meerkernige processors optimaal te benutten.
Vervolgens werd een nieuwe manier voorgesteld om op een schaalbare manier de taakplanning uit te voeren, namelijk Performance Impact Estimation (PIE). Tijdens de uitvoering van een draad op een gegeven processorkern schatten we de prestatie op een ander type kern op basis van eenvoudig op te meten prestatiemetrieken. Zo beschikken we op elk moment over alle nodige informatie om een efficiƫnte taakplanning te doen. Dit laat ons bovendien toe te optimaliseren voor verschillende criteria zoals uitvoeringstijd, doorvoersnelheid of fairness
Identifying, Quantifying, Extracting and Enhancing Implicit Parallelism
The shift of the microprocessor industry towards multicore architectures has
placed a huge burden on the programmers by requiring explicit parallelization
for performance. Implicit Parallelization is an alternative that could ease the
burden on programmers by parallelizing applications ???under the covers??? while
maintaining sequential semantics externally. This thesis develops a novel
approach for thinking about parallelism, by casting the problem of
parallelization in terms of instruction criticality. Using this approach,
parallelism in a program region is readily identified when certain conditions
about fetch-criticality are satisfied by the region. The thesis formalizes this
approach by developing a criticality-driven model of task-based
parallelization. The model can accurately predict the parallelism that would be
exposed by potential task choices by capturing a wide set of sources of
parallelism as well as costs to parallelization.
The criticality-driven model enables the development of two key components for
Implicit Parallelization: a task selection policy, and a bottleneck analysis
tool. The task selection policy can partition a single-threaded program into
tasks that will profitably execute concurrently on a multicore architecture in
spite of the costs associated with enforcing data-dependences and with
task-related actions. The bottleneck analysis tool gives feedback to the
programmers about data-dependences that limit parallelism. In particular, there
are several ???accidental dependences??? that can be easily removed with large
improvements in parallelism. These tools combine into a systematic methodology
for performance tuning in Implicit Parallelization. Finally, armed with the
criticality-driven model, the thesis revisits several architectural design
decisions, and finds several encouraging ways forward to increase the scope of
Implicit Parallelization.unpublishednot peer reviewe
Accelerating the pace of protein functional annotation with intel xeon phi coprocessors
Ā© 2002-2011 IEEE. Intel Xeon Phi is a new addition to the family of powerful parallel accelerators. The range of its potential applications in computationally driven research is broad; however, at present, the repository of scientific codes is still relatively limited. In this study, we describe the development and benchmarking of a parallel version of {\mmb e}FindSite, a structural bioinformatics algorithm for the prediction of ligand-binding sites in proteins. Implemented for the Intel Xeon Phi platform, the parallelization of the structure alignment portion of {\mmb e}FindSite using pragma-based OpenMP brings about the desired performance improvements, which scale well with the number of computing cores. Compared to a serial version, the parallel code runs 11.8 and 10.1 times faster on the CPU and the coprocessor, respectively; when both resources are utilized simultaneously, the speedup is 17.6. For example, ligand-binding predictions for 501 benchmarking proteins are completed in 2.1 hours on a single Stampede node equipped with the Intel Xeon Phi card compared to 3.1 hours without the accelerator and 36.8 hours required by a serial version. In addition to the satisfactory parallel performance, porting existing scientific codes to the Intel Xeon Phi architecture is relatively straightforward with a short development time due to the support of common parallel programming models by the coprocessor. The parallel version of {\mmb e}FindSite is freely available to the academic community at www.brylinski.org/efindsite
Packrat: Automatic Reconfiguration for Latency Minimization in CPU-based DNN Serving
In this paper, we investigate how to push the performance limits of serving
Deep Neural Network (DNN) models on CPU-based servers. Specifically, we observe
that while intra-operator parallelism across multiple threads is an effective
way to reduce inference latency, it provides diminishing returns. Our primary
insight is that instead of running a single instance of a model with all
available threads on a server, running multiple instances each with smaller
batch sizes and fewer threads for intra-op parallelism can provide lower
inference latency. However, the right configuration is hard to determine
manually since it is workload- (DNN model and batch size used by the serving
system) and deployment-dependent (number of CPU cores on server). We present
Packrat, a new serving system for online inference that given a model and batch
size () algorithmically picks the optimal number of instances (), the
number of threads each should be allocated (), and the batch sizes each
should operate on () that minimizes latency. Packrat is built as an
extension to TorchServe and supports online reconfigurations to avoid serving
downtime. Averaged across a range of batch sizes, Packrat improves inference
latency by 1.43 to 1.83 on a range of commonly used DNNs
- ā¦