4,480 research outputs found
Asynchronous Execution of Python Code on Task Based Runtime Systems
Despite advancements in the areas of parallel and distributed computing, the
complexity of programming on High Performance Computing (HPC) resources has
deterred many domain experts, especially in the areas of machine learning and
artificial intelligence (AI), from utilizing performance benefits of such
systems. Researchers and scientists favor high-productivity languages to avoid
the inconvenience of programming in low-level languages and costs of acquiring
the necessary skills required for programming at this level. In recent years,
Python, with the support of linear algebra libraries like NumPy, has gained
popularity despite facing limitations which prevent this code from distributed
runs. Here we present a solution which maintains both high level programming
abstractions as well as parallel and distributed efficiency. Phylanx, is an
asynchronous array processing toolkit which transforms Python and NumPy
operations into code which can be executed in parallel on HPC resources by
mapping Python and NumPy functions and variables into a dependency tree
executed by HPX, a general purpose, parallel, task-based runtime system written
in C++. Phylanx additionally provides introspection and visualization
capabilities for debugging and performance analysis. We have tested the
foundations of our approach by comparing our implementation of widely used
machine learning algorithms to accepted NumPy standards
Redesigning OP2 Compiler to Use HPX Runtime Asynchronous Techniques
Maximizing parallelism level in applications can be achieved by minimizing
overheads due to load imbalances and waiting time due to memory latencies.
Compiler optimization is one of the most effective solutions to tackle this
problem. The compiler is able to detect the data dependencies in an application
and is able to analyze the specific sections of code for parallelization
potential. However, all of these techniques provided with a compiler are
usually applied at compile time, so they rely on static analysis, which is
insufficient for achieving maximum parallelism and producing desired
application scalability. One solution to address this challenge is the use of
runtime methods. This strategy can be implemented by delaying certain amount of
code analysis to be done at runtime. In this research, we improve the parallel
application performance generated by the OP2 compiler by leveraging HPX, a C++
runtime system, to provide runtime optimizations. These optimizations include
asynchronous tasking, loop interleaving, dynamic chunk sizing, and data
prefetching. The results of the research were evaluated using an Airfoil
application which showed a 40-50% improvement in parallel performance.Comment: 18th IEEE International Workshop on Parallel and Distributed
Scientific and Engineering Computing (PDSEC 2017
Dynamic Control Flow in Large-Scale Machine Learning
Many recent machine learning models rely on fine-grained dynamic control flow
for training and inference. In particular, models based on recurrent neural
networks and on reinforcement learning depend on recurrence relations,
data-dependent conditional execution, and other features that call for dynamic
control flow. These applications benefit from the ability to make rapid
control-flow decisions across a set of computing devices in a distributed
system. For performance, scalability, and expressiveness, a machine learning
system must support dynamic control flow in distributed and heterogeneous
environments.
This paper presents a programming model for distributed machine learning that
supports dynamic control flow. We describe the design of the programming model,
and its implementation in TensorFlow, a distributed machine learning system.
Our approach extends the use of dataflow graphs to represent machine learning
models, offering several distinctive features. First, the branches of
conditionals and bodies of loops can be partitioned across many machines to run
on a set of heterogeneous devices, including CPUs, GPUs, and custom ASICs.
Second, programs written in our model support automatic differentiation and
distributed gradient computations, which are necessary for training machine
learning models that use control flow. Third, our choice of non-strict
semantics enables multiple loop iterations to execute in parallel across
machines, and to overlap compute and I/O operations.
We have done our work in the context of TensorFlow, and it has been used
extensively in research and production. We evaluate it using several real-world
applications, and demonstrate its performance and scalability.Comment: Appeared in EuroSys 2018. 14 pages, 16 figure
RootJS: Node.js Bindings for ROOT 6
We present rootJS, an interface making it possible to seamlessly integrate
ROOT 6 into applications written for Node.js, the JavaScript runtime platform
increasingly commonly used to create high-performance Web applications. ROOT
features can be called both directly from Node.js code and by JIT-compiling C++
macros. All rootJS methods are invoked asynchronously and support callback
functions, allowing non-blocking operation of Node.js applications using them.
Last but not least, our bindings have been designed to platform-independent and
should therefore work on all systems supporting both ROOT 6 and Node.js.
Thanks to rootJS it is now possible to create ROOT-aware Web applications
taking full advantage of the high performance and extensive capabilities of
Node.js. Examples include platforms for the quality assurance of acquired,
reconstructed or simulated data, book-keeping and e-log systems, and even Web
browser-based data visualisation and analysis.Comment: 7 pages, 1 figure. To appear in the Proceedings of the 22nd
International Conference on Computing in High Energy and Nuclear Physics
(CHEP 2016
- …