Search CORE

27 research outputs found

Author
Publication venue: Published by Elsevier B.V.
Publication date
Field of study

Iterative Computations with Ordered Read-Write Locks

Author: Clauss Pierre-Nicolas
Gustedt Jens
Publication venue: 'Elsevier BV'
Publication date: 01/01/2010
Field of study

International audienceWe introduce the framework of ordered read-write locks, ORWL, that are characterized by two main features: a strict FIFO policy for access and the attribution of access to lock-handles instead of processes or threads. These two properties allow applications to have a controlled pro-active access to resources and thereby to achieve a high degree of asynchronicity between different tasks of the same application. For the case of iterative computations with many parallel tasks which access their resources in a cyclic pattern we provide a generic technique to implement them by means of ORWL. We show that the possible execution patterns for such a system correspond to a combinatorial lattice structure and that this lattice is finite iff the configuration contains a potential deadlock. In addition, we provide efficient algorithms: one that allows for a deadlock-free initialization of such a system and another one for the detection of deadlocks in an already initialized system

INRIA a CCSD electronic archive server

Theory and design of portable parallel programs for heterogeneous computing systems and networks

Author: Wu Ying-Chieh
Publication venue: Digital Commons @ NJIT
Publication date: 31/05/1996
Field of study

A recurring problem with high-performance computing is that advanced architectures generally achieve only a small fraction of their peak performance on many portions of real applications sets. The Amdahl\u27s law corollary of this is that such architectures often spend most of their time on tasks (codes/algorithms and the data sets upon which they operate) for which they are unsuited. Heterogeneous Computing (HC) is needed in the mid 90\u27s and beyond due to ever increasing super-speed requirements and the number of projects with these requirements. HC is defined as a special form of parallel and distributed computing that performs computations using a single autonomous computer operating in both SIMD and MIMD modes, or using a number of connected autonomous computers. Physical implementation of a heterogeneous network or system is currently possible due to the existing technological advances in networking and supercomputing. Unfortunately, software solutions for heterogeneous computing are still in their infancy. Theoretical models, software tools, and intelligent resource-management schemes need to be developed to support heterogeneous computing efficiently. In this thesis, we present a heterogeneous model of computation which encapsulates all the essential parameters for designing efficient software and hardware for HC. We also study a portable parallel programming tool, called Cluster-M, which implements this model. Furthermore, we study and analyze the hardware and software requirements of HC and show that, Cluster-M satisfies the requirements of HC environments

Digital Commons @ New Jersey Institute of Technology (NJIT)

Polyhedral+Dataflow Graphs

Author: Davis Eddie C.
Publication venue: 'IUScholarWorks'
Publication date: 01/05/2020
Field of study

This research presents an intermediate compiler representation that is designed for optimization, and emphasizes the temporary storage requirements and execution schedule of a given computation to guide optimization decisions. The representation is expressed as a dataflow graph that describes computational statements and data mappings within the polyhedral compilation model. The targeted applications include both the regular and irregular scientific domains. The intermediate representation can be integrated into existing compiler infrastructures. A specification language implemented as a domain specific language in C++ describes the graph components and the transformations that can be applied. The visual representation allows users to reason about optimizations. Graph variants can be translated into source code or other representation. The language, intermediate representation, and associated transformations have been applied to improve the performance of differential equation solvers, or sparse matrix operations, tensor decomposition, and structured multigrid methods

Boise State University - ScholarWorks

Locality-driven checkpoint and recovery

Author: Wei Zunce
Publication venue
Publication date: 01/01/2010
Field of study

Checkpoint and recovery are important fault-tolerance techniques for distributed systems. The two categories of existing strategies incur unacceptable performance cost either at run time or upon failure recovery, when applied to large-scale distributed systems. In particular, the large number of messages and processes in these systems causes either considerable checkpoint as well as logging overhead, or catastrophic global-wise recovery effect. This thesis proposes a locality-driven strategy for efficiently checkpointing and recovering such systems with both affordable runtime cost and controllable failure recoverability. Messages establish dependencies between distributed processes, which can be either preserved by coordinated checkpoints or removed via logging. Existing strategies enforce a uniform handling policy for all message dependencies, and hence gains advantage at one end but bears disadvantage at the other. In this thesis, a generic theory of Quasi-Atomic Recovery has been formulated to accommodate message handling requirements of both kinds, and to allow using different message handling methods together. Quasi-atomicity of recovery blocks implies proper confinement of recoveries, and thus enables localization of checkpointing and recovery around such a block and consequently a hybrid strategy with combined advantages from both ends. A strategy of group checkpointing with selective logging has been proposed, based on the observation of message localization around 'locality regions' in distributed systems. In essence, a group-wise coordinated checkpoint is created around such a region and only the few inter-region messages are logged subsequently. Runtime overhead is optimized due to largely reduced logging efforts, and recovery spread is as localized as region-wise. Various protocols have been developed to provide trade-offs between flexibility and performance. Also proposed is the idea of process clone that can be used to effectively remove program-order recovery dependencies among successive group checkpoints and thus to stop inter-group recovery spread. Distributed executions exhibit locality of message interactions. Such locality originates from resolving distributed dependency localization via message passing, and appears as a hierarchical 'region-transition' pattern. A bottom-up approach has been proposed to identify those regions, by detecting popular recurrence patterns from individual processes as 'locality intervals', and then composing them into 'locality regions' based on their tight message coupling relations between each other. Experiments conducted on real-life applications have shown the existence of hierarchical locality regions and have justified the feasibility of this approach. Performance optimization of group checkpoint strategies has to do with their uses of locality. An abstract performance measure has been-proposed to properly integrate both runtime overhead and failure recoverability in a region-wise marner. Taking this measure as the optimization objective, a greedy heuristic has been introduced to decompose a given distributed execution into optimized regions. Analysis implies that an execution pattern with good locality leads to good optimized performance, and the locality pattern itself can serve as a good candidate for the optimal decomposition. Consequently, checkpoint protocols have been developed to efficiently identify optimized regions in such an execution, with assistance of either design-time or runtime knowledge

Concordia University Research Repository

Software for Exascale Computing - SPPEXA 2016-2019

Author
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

This open access book summarizes the research done and results obtained in the second funding phase of the Priority Program 1648 "Software for Exascale Computing" (SPPEXA) of the German Research Foundation (DFG) presented at the SPPEXA Symposium in Dresden during October 21-23, 2019. In that respect, it both represents a continuation of Vol. 113 in Springer’s series Lecture Notes in Computational Science and Engineering, the corresponding report of SPPEXA’s first funding phase, and provides an overview of SPPEXA’s contributions towards exascale computing in today's sumpercomputer technology. The individual chapters address one or more of the research directions (1) computational algorithms, (2) system software, (3) application software, (4) data management and exploration, (5) programming, and (6) software tools. The book has an interdisciplinary appeal: scholars from computational sub-fields in computer science, mathematics, physics, or engineering will find it of particular interest

OAPEN Library

Relaxing coherence for modern learning applications

Author: Lee Joo Hwan
Publication venue: Georgia Institute of Technology
Publication date: 07/06/2017
Field of study

The main objective of this research is to efficiently execute learning (model training) of modern machine learning (ML) applications. The recent explosion in data has led to the emergence of data-intensive ML applications whose key phase is learning that requires significant amounts of computation. A unique characteristic of learning is that it is iterative- convergent, where a consistent view of memory does not always need to be guaranteed such that parallel workers are allowed to compute using stale values in intermediate computations to relax certain read-after-write data dependencies. While multiple workers read-and- modify shared model parameters multiple times during learning, incurring multiple data communication between workers, most of the data communication is redundant, due to the stale value tolerant characteristic. Relaxing coherence for these learning applications has the potential to provide extraordinary performance and energy benefits but requires innovations across the system stack from hardware and software. While considerable effort has utilized the stale value tolerance on distributed learning, still inefficient utilization of the full performance potential of this characteristic has caused modern ML applications to have low execution efficiency on the state-of-the-art systems. The inefficiency mainly comes from the lack of architectural considerations and detailed understanding of the different stale value tolerance of different ML applications. Today’s architecture, designed to cater to the needs of more traditional workloads, incurs high and often unnecessary overhead. The lack of detailed understanding has led to ambiguity for the stale value tolerance thus failing to take the full performance potential of this characteristic. This dissertation presents several innovations regarding this challenge. First, this dissertation proposes Bounded Staled Sync (BSSync), hardware support for the bounded staleness consistency model, which accompanies simple logic layers in the memory hierarchy, for reducing atomic operation overhead on data synchronization intensive workloads. The long latency and serialization caused by atomic operations have a significant impact on performance. The proposed technique overlaps the long latency atomic operation with the main computation. Compared to previous work that allows stale values for read operations, BSSync utilizes staleness for write operations, allowing stale- writes. It reduces the inefficiency coming from the data movement between where they are stored and where they are processed. Second, this dissertation presents StaleLearn, a learning acceleration mechanism to reduce the memory divergence overhead of GPU learning with sparse data. Sparse data induces divergent memory accesses with low locality, thereby consuming a large fraction of total execution time on transferring data across the memory hierarchy. StaleLearn trans- forms the problem of divergent memory accesses into the synchronization problem by replicating the model, and reduces the synchronization overhead by asynchronous synchronization on Processor-in-Memory (PIM). The stale value tolerance makes possible to clearly decompose tasks between the GPU and PIM, which can effectively exploit parallelism be- tween PIM and GPU cores by overlapping PIM operations with the main computation on GPU cores. Finally, this dissertation provides a detailed understanding of the different stale value tolerance of different ML applications. While relaxing coherence can reduce the data communication overhead, its complicated impact on the progress of learning has not been well studied thus leading to ambiguity for domain experts and modern systems. We define the stale value tolerance of ML training with the effective learning rate. The effective learning rate can be defined by the implicit momentum hyperparameter, the update density, the activation function selection, RNN cell types, and learning rate adaptation. Findings of this work will open further exploration of asynchronous learning including improving the findings laid out in this dissertation.Ph.D

Scholarly Materials And Research @ Georgia Tech

Recommended from our members

High-performance data-parallel input/output

Author: Moore Jason Andrew
Publication venue: 'Oregon State University'
Publication date
Field of study

Existing parallel file systems are proving inadequate in two important arenas: programmability and performance. Both of these inadequacies can largely be traced to the fact that nearly all parallel file systems evolved from Unix and rely on a Unix-oriented, single-stream, block-at-a-time approach to file I/O. This one-size-fits-all approach to parallel file systems is inadequate for supporting applications running on distributed-memory parallel computers. This research provides a migration path away from the traditional approaches to parallel I/O at two levels. At the level seen by the programmer, we show how file operations can be closely integrated with the semantics of a parallel language. Principles for this integration are illustrated in their application to C*, a virtual-processor- oriented language. The result is that traditional C file operations with familiar semantics can be used in C* where the programmer works--at the virtual processor level. To facilitate high performance within this framework, machine-independent modes are used. Modes change the performance of file operations, not their semantics, so programmers need not use ambiguous operations found in many parallel file systems. An automatic mode detection technique is presented that saves the programmer from extra syntax and low-level file system details. This mode detection system ensures that the most commonly encountered file operations are performed using high-performance modes. While the high-performance modes allow fast collective movement of file data, they must include optimizations for redistribution of file data, a common operation in production scientific code. This need is addressed at the file system level, where we provide enhancements to Disk-Directed I/O for redistributing file data. Two enhancements are geared to speeding fine-grained redistributions. One uses a two-phase, or indirect, approach to redistributing data among compute nodes. The other relies on I/O nodes to guide the redistribution by building packets bound for compute nodes. We model the performance of these enhancements and determine the key parameters determining when each approach should be used. Finally, we introduce the notion of collective prefetching and identify its performance benefits and implementation tradeoffs

ScholarsArchive@OSU

Progress Report : 1991 - 1994

Author
Publication venue: Max-Planck-Institut für Informatik
Publication date: 01/01/1994
Field of study

MPG.PuRe