1,005 research outputs found
Improving the scalability of parallel N-body applications with an event driven constraint based execution model
The scalability and efficiency of graph applications are significantly
constrained by conventional systems and their supporting programming models.
Technology trends like multicore, manycore, and heterogeneous system
architectures are introducing further challenges and possibilities for emerging
application domains such as graph applications. This paper explores the space
of effective parallel execution of ephemeral graphs that are dynamically
generated using the Barnes-Hut algorithm to exemplify dynamic workloads. The
workloads are expressed using the semantics of an Exascale computing execution
model called ParalleX. For comparison, results using conventional execution
model semantics are also presented. We find improved load balancing during
runtime and automatic parallelism discovery improving efficiency using the
advanced semantics for Exascale computing.Comment: 11 figure
Distributed Training Large-Scale Deep Architectures
Scale of data and scale of computation infrastructures together enable the
current deep learning renaissance. However, training large-scale deep
architectures demands both algorithmic improvement and careful system
configuration. In this paper, we focus on employing the system approach to
speed up large-scale training. Via lessons learned from our routine
benchmarking effort, we first identify bottlenecks and overheads that hinter
data parallelism. We then devise guidelines that help practitioners to
configure an effective system and fine-tune parameters to achieve desired
speedup. Specifically, we develop a procedure for setting minibatch size and
choosing computation algorithms. We also derive lemmas for determining the
quantity of key components such as the number of GPUs and parameter servers.
Experiments and examples show that these guidelines help effectively speed up
large-scale deep learning training
Compilation techniques for irregular problems on parallel machines
Massively parallel computers have ushered in the era of teraflop computing. Even though large and powerful machines are being built, they are used by only a fraction of the computing community. The fundamental reason for this situation is that parallel machines are difficult to program. Development of compilers that automatically parallelize programs will greatly increase the use of these machines.;A large class of scientific problems can be categorized as irregular computations. In this class of computation, the data access patterns are known only at runtime, creating significant difficulties for a parallelizing compiler to generate efficient parallel codes. Some compilers with very limited abilities to parallelize simple irregular computations exist, but the methods used by these compilers fail for any non-trivial applications code.;This research presents development of compiler transformation techniques that can be used to effectively parallelize an important class of irregular programs. A central aim of these transformation techniques is to generate codes that aggressively prefetch data. Program slicing methods are used as a part of the code generation process. In this approach, a program written in a data-parallel language, such as HPF, is transformed so that it can be executed on a distributed memory machine. An efficient compiler runtime support system has been developed that performs data movement and software caching
- …