4,085 research outputs found
An occam Style Communications System for UNIX Networks
This document describes the design of a communications system which provides occam style communications primitives under a Unix environment, using TCP/IP protocols, and any number of other protocols deemed suitable as underlying transport layers. The system will integrate with a low overhead scheduler/kernel without incurring significant costs to the execution of processes within the run time environment. A survey of relevant occam and occam3 features and related research is followed by a look at the Unix and TCP/IP facilities which determine our working constraints, and a description of the T9000 transputer's Virtual Channel Processor, which was instrumental in our formulation. Drawing from the information presented here, a design for the communications system is subsequently proposed. Finally, a preliminary investigation of methods for lightweight access control to shared resources in an environment which does not provide support for critical sections, semaphores, or busy waiting, is made. This is presented with relevance to mutual exclusion problems which arise within the proposed design. Future directions for the evolution of this project are discussed in conclusion
Dynamic Control Flow in Large-Scale Machine Learning
Many recent machine learning models rely on fine-grained dynamic control flow
for training and inference. In particular, models based on recurrent neural
networks and on reinforcement learning depend on recurrence relations,
data-dependent conditional execution, and other features that call for dynamic
control flow. These applications benefit from the ability to make rapid
control-flow decisions across a set of computing devices in a distributed
system. For performance, scalability, and expressiveness, a machine learning
system must support dynamic control flow in distributed and heterogeneous
environments.
This paper presents a programming model for distributed machine learning that
supports dynamic control flow. We describe the design of the programming model,
and its implementation in TensorFlow, a distributed machine learning system.
Our approach extends the use of dataflow graphs to represent machine learning
models, offering several distinctive features. First, the branches of
conditionals and bodies of loops can be partitioned across many machines to run
on a set of heterogeneous devices, including CPUs, GPUs, and custom ASICs.
Second, programs written in our model support automatic differentiation and
distributed gradient computations, which are necessary for training machine
learning models that use control flow. Third, our choice of non-strict
semantics enables multiple loop iterations to execute in parallel across
machines, and to overlap compute and I/O operations.
We have done our work in the context of TensorFlow, and it has been used
extensively in research and production. We evaluate it using several real-world
applications, and demonstrate its performance and scalability.Comment: Appeared in EuroSys 2018. 14 pages, 16 figure
Cost-effective HPC clustering for computer vision applications
We will present a cost-effective and flexible realization of high performance computing (HPC) clustering and its potential in solving computationally intensive problems in computer vision. The featured software foundation to support the parallel programming is the GNU parallel Knoppix package with message passing interface (MPI) based Octave, Python and C interface capabilities. The implementation is especially of interest in applications where the main objective is to reuse the existing hardware infrastructure and to maintain the overall budget cost. We will present the benchmark results and compare and contrast the performances of Octave and MATLAB
A Review on Software Architectures for Heterogeneous Platforms
The increasing demands for computing performance have been a reality
regardless of the requirements for smaller and more energy efficient devices.
Throughout the years, the strategy adopted by industry was to increase the
robustness of a single processor by increasing its clock frequency and mounting
more transistors so more calculations could be executed. However, it is known
that the physical limits of such processors are being reached, and one way to
fulfill such increasing computing demands has been to adopt a strategy based on
heterogeneous computing, i.e., using a heterogeneous platform containing more
than one type of processor. This way, different types of tasks can be executed
by processors that are specialized in them. Heterogeneous computing, however,
poses a number of challenges to software engineering, especially in the
architecture and deployment phases. In this paper, we conduct an empirical
study that aims at discovering the state-of-the-art in software architecture
for heterogeneous computing, with focus on deployment. We conduct a systematic
mapping study that retrieved 28 studies, which were critically assessed to
obtain an overview of the research field. We identified gaps and trends that
can be used by both researchers and practitioners as guides to further
investigate the topic
FooPar: A Functional Object Oriented Parallel Framework in Scala
We present FooPar, an extension for highly efficient Parallel Computing in
the multi-paradigm programming language Scala. Scala offers concise and clean
syntax and integrates functional programming features. Our framework FooPar
combines these features with parallel computing techniques. FooPar is designed
modular and supports easy access to different communication backends for
distributed memory architectures as well as high performance math libraries. In
this article we use it to parallelize matrix matrix multiplication and show its
scalability by a isoefficiency analysis. In addition, results based on a
empirical analysis on two supercomputers are given. We achieve close-to-optimal
performance wrt. theoretical peak performance. Based on this result we conclude
that FooPar allows to fully access Scala's design features without suffering
from performance drops when compared to implementations purely based on C and
MPI
- …