1,354 research outputs found
Run Time Approximation of Non-blocking Service Rates for Streaming Systems
Stream processing is a compute paradigm that promises safe and efficient
parallelism. Modern big-data problems are often well suited for stream
processing's throughput-oriented nature. Realization of efficient stream
processing requires monitoring and optimization of multiple communications
links. Most techniques to optimize these links use queueing network models or
network flow models, which require some idea of the actual execution rate of
each independent compute kernel within the system. What we want to know is how
fast can each kernel process data independent of other communicating kernels.
This is known as the "service rate" of the kernel within the queueing
literature. Current approaches to divining service rates are static. Modern
workloads, however, are often dynamic. Shared cloud systems also present
applications with highly dynamic execution environments (multiple users,
hardware migration, etc.). It is therefore desirable to continuously re-tune an
application during run time (online) in response to changing conditions. Our
approach enables online service rate monitoring under most conditions,
obviating the need for reliance on steady state predictions for what are
probably non-steady state phenomena. First, some of the difficulties associated
with online service rate determination are examined. Second, the algorithm to
approximate the online non-blocking service rate is described. Lastly, the
algorithm is implemented within the open source RaftLib framework for
validation using a simple microbenchmark as well as two full streaming
applications.Comment: technical repor
ActiveMonitor: Asynchronous Monitor Framework for Scalability and Multi-Object Synchronization
Monitor objects are used extensively for thread-safety and synchronization in shared memory parallel programs. They provide ease of use, and enable straightforward correctness analysis. However, they inhibit parallelism by enforcing serial executions of critical sections, and thus the performance of parallel programs with monitors scales poorly with number of processes. Their current design and implementation is also ill-suited for thread synchronization across multiple thread-safe objects. We present ActiveMonitor - a framework that allows multi-object synchronization without global locks, and improves parallelism by exploiting asynchronous execution of critical sections. We evaluate the performance of Java based implementation of ActiveMonitor on micro-benchmarks involving light and heavy critical sections, as well as on single-source-shortest-path problem in directed graphs. Our results show that on most of these problems, ActiveMonitor based programs outperform programs implemented using Java\u27s reentrant-lock and condition constructs
Actors vs Shared Memory: two models at work on Big Data application frameworks
This work aims at analyzing how two different concurrency models, namely the
shared memory model and the actor model, can influence the development of
applications that manage huge masses of data, distinctive of Big Data
applications. The paper compares the two models by analyzing a couple of
concrete projects based on the MapReduce and Bulk Synchronous Parallel
algorithmic schemes. Both projects are doubly implemented on two concrete
platforms: Akka Cluster and Managed X10. The result is both a conceptual
comparison of models in the Big Data Analytics scenario, and an experimental
analysis based on concrete executions on a cluster platform
The HPCG benchmark: analysis, shared memory preliminary improvements and evaluation on an Arm-based platform
The High-Performance Conjugate Gradient (HPCG) benchmark complements the LINPACK benchmark in the performance evaluation coverage of large High-Performance Computing (HPC) systems. Due to its lower arithmetic intensity and higher memory pressure, HPCG is recognized as a more representative benchmark for data-center and irregular memory access pattern workloads, therefore its popularity and acceptance is raising within the HPC community. As only a small fraction of the reference version of the HPCG benchmark is parallelized with shared memory techniques (OpenMP), we introduce in this report two OpenMP parallelization methods. Due to the increasing importance of Arm architecture in the HPC scenario, we evaluate our HPCG code at scale on a state-of-the-art HPC system based on Cavium ThunderX2 SoC. We consider our work as a contribution to the Arm ecosystem: along with this technical report, we plan in fact to release our code for boosting the tuning of the HPCG benchmark within the Arm community.Postprint (author's final draft
OpenCL Actors - Adding Data Parallelism to Actor-based Programming with CAF
The actor model of computation has been designed for a seamless support of
concurrency and distribution. However, it remains unspecific about data
parallel program flows, while available processing power of modern many core
hardware such as graphics processing units (GPUs) or coprocessors increases the
relevance of data parallelism for general-purpose computation.
In this work, we introduce OpenCL-enabled actors to the C++ Actor Framework
(CAF). This offers a high level interface for accessing any OpenCL device
without leaving the actor paradigm. The new type of actor is integrated into
the runtime environment of CAF and gives rise to transparent message passing in
distributed systems on heterogeneous hardware. Following the actor logic in
CAF, OpenCL kernels can be composed while encapsulated in C++ actors, hence
operate in a multi-stage fashion on data resident at the GPU. Developers are
thus enabled to build complex data parallel programs from primitives without
leaving the actor paradigm, nor sacrificing performance. Our evaluations on
commodity GPUs, an Nvidia TESLA, and an Intel PHI reveal the expected linear
scaling behavior when offloading larger workloads. For sub-second duties, the
efficiency of offloading was found to largely differ between devices. Moreover,
our findings indicate a negligible overhead over programming with the native
OpenCL API.Comment: 28 page
High performance cloud computing on multicore computers
The cloud has become a major computing platform, with virtualization being a key to allow applications to run and share the resources in the cloud. A wide spectrum of applications need to process large amounts of data at high speeds in the cloud, e.g., analyzing customer data to find out purchase behavior, processing location data to determine geographical trends, or mining social media data to assess brand sentiment. To achieve high performance, these applications create and use multiple threads running on multicore processors. However, existing virtualization technology cannot support the efficient execution of such applications on virtual machines, making them suffer poor and unstable performance in the cloud.
Targeting multi-threaded applications, the dissertation analyzes and diagnoses their performance issues on virtual machines, and designs practical solutions to improve their performance. The dissertation makes the following contributions. First, the dissertation conducts extensive experiments with standard multicore applications, in order to evaluate the performance overhead on virtualization systems and diagnose the causing factors. Second, focusing on one main source of the performance overhead, excessive spinning, the dissertation designs and evaluates a holistic solution to make effective utilization of the hardware virtualization support in processors to reduce excessive spinning with low cost. Third, focusing on application scalability, which is the most important performance feature for multi-threaded applications, the dissertation models application scalability in virtual machines and analyzes how application scalability changes with virtualization and resource sharing. Based on the modeling and analysis, the dissertation identifies key application features and system factors that have impacts on application scalability, and reveals possible approaches for improving scalability. Forth, the dissertation explores one approach to improving application scalability by making fully utilization of virtual resources of each virtual machine. The general idea is to match the workload distribution among the virtual CPUs in a virtual machine and the virtual CPU resource of the virtual machine manager
- âŠ