Search CORE

64 research outputs found

A Practical Hierarchial Model of Parallel Computation: The Model

Author: Heywood Todd
Ranka Sanjay
Publication venue: SURFACE at Syracuse University
Publication date: 01/02/1991
Field of study

We introduce a model of parallel computation that retains the ideal properties of the PRAM by using it as a sub-model, while simultaneously being more reflective of realistic parallel architectures by accounting for and providing abstract control over communication and synchronization costs. The Hierarchical PRAM (H-PRAM) model controls conceptual complexity in the face of asynchrony in two ways. First, by providing the simplifying assumption of synchronization to the design of algorithms, but allowing the algorithms to work asynchronously with each other; and organizing this control asynchrony via an implicit hierarchy relation. Second, by allowing the restriction of communication asynchrony in order to obtain determinate algorithms (thus greatly simplifying proofs of correctness). It is shown that the model is reflective of a variety of existing and proposed parallel architectures, particularly ones that can support massive parallelism. Relationships to programming languages are discussed. Since the PRAM is a sub-model, we can use PRAM algorithms as sub-algorithms in algorithms for the H-PRAM; thus results that have been established with respect to the PRAM are potentially transferable to this new model. The H-PRAM can be used as a flexible tool to investigate general degrees of locality (“neighborhoods of activity) in problems, considering communication and synchronization simultaneously. This gives the potential of obtaining algorithms that map more efficiently to architectures, and of increasing the number of processors that can efficiently be used on a problem (in comparison to a PRAM that charges for communication and synchronization). The model presents a framework in which to study the extent that general locality can be exploited in parallel computing. A companion paper demonstrates the usage of the H-PRAM via the design and analysis of various algorithms for computing the complete binary tree and the FFT/butterfly graph

Syracuse University Research Facility and Collaborative Environment

Runtime Coordinated Heterogeneous Tasks in Charm++

Author: Buch Ronak
Kale Laxmikant V.
Robson Michael P.
Publication venue: Smith ScholarWorks
Publication date: 24/01/2017
Field of study

Effective utilization of the increasingly heterogeneous hardware in modern supercomputers is a significant challenge. Many applications have seen performance gains by using GPUs, but many implementations leave CPUs sitting idle.In this paper, we describe a runtime managed system for coordinating heterogeneous execution. This system manages data transfers to and from GPU devices and schedules work across the computational resources of the system. The programmer need only tag methods and parameters to enable heterogeneous execution.Using this system, we observe improvements in programmer productivity and application performance. For selected benchmarks, when using heterogeneous execution we observe speedups of up to 3.09x relative to using only the host cores or only the device

Crossref

Smith College: Smith ScholarWorks

On expressing different concurrency paradigms on virtual execution environment

Author: DITTAMO CRISTIAN
Publication venue: 'Pisa University Press'
Publication date: 29/11/2010
Field of study

Virtual execution environments (VEE) such as the Java Virtual Machine (JVM) and the Microsoft Common Language Runtime (CLR) have been designed when the dominant computer architecture featured a Von-Neumann interface to programs: a single processor hiding all the complexity of parallel computations inside its design. Programs are expressed in an intermediate form that is executed by the VEE that defines an abstract computational model in which the concurrency model has been influenced by these design choices and it basically exposes the multi-threading model of the underlying operating system. Recently computer systems have introduced computational units in which concurrency is explicit and under program control. Relevant examples are the Graphical Processing Units (GPU such as Nvidia or AMD) and the Cell BE architecture which allow for explicit control of single processing unit, local memories and communication channels. Unfortunately programs designed for Virtual Machines cannot access to these resources since are not available through the abstractions provided by the VEE. A major redesign of VEEs seems to be necessary in order to bridge this gap. In this thesis we study the problem of exposing non-von Neumann computing resources within the Virtual Machine without need for a redesign of the whole execution infrastructure. In this work we express parallel computations relying on extensible meta-data and reflection to encode information. Meta-programming techniques are then used to rewrite the program into an equivalent one using the special purpose underlying architecture. We provide a case study in which this approach is applied to compiling Common Intermediate Language (CIL) methods to multi-core GPUs; we show that it is possible to access these non-standard computing resources without any change to the virtual machine design

Electronic Thesis and Dissertation Archive - Università di Pisa

Gunrock: GPU Graph Analytics

Author: Davidson Andrew
Liu Weitang
Osama Muhammad
Owens John D.
Pan Yuechao
Riffel Andy T.
Wang Leyuan
Wang Yangzihao
Wu Yuduo
Yang Carl
Yuan Chenshan
Publication venue
Publication date: 04/01/2017
Field of study

For large-scale graph analytics on the GPU, the irregularity of data access and control flow, and the complexity of programming GPUs, have presented two significant challenges to developing a programmable high-performance graph library. "Gunrock", our graph-processing system designed specifically for the GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on operations on a vertex or edge frontier. Gunrock achieves a balance between performance and expressiveness by coupling high performance GPU computing primitives and optimization strategies with a high-level programming model that allows programmers to quickly develop new graph primitives with small code size and minimal GPU programming knowledge. We characterize the performance of various optimization strategies and evaluate Gunrock's overall performance on different GPU architectures on a wide range of graph primitives that span from traversal-based algorithms and ranking algorithms, to triangle counting and bipartite-graph-based algorithms. The results show that on a single GPU, Gunrock has on average at least an order of magnitude speedup over Boost and PowerGraph, comparable performance to the fastest GPU hardwired primitives and CPU shared-memory graph libraries such as Ligra and Galois, and better performance than any other GPU high-level graph library.Comment: 52 pages, invited paper to ACM Transactions on Parallel Computing (TOPC), an extended version of PPoPP'16 paper "Gunrock: A High-Performance Graph Processing Library on the GPU

arXiv.org e-Print Archive

eScholarship - University of California

FigShare

Recommended from our members

Divergence Reduction and Dependency Management in GPU Programs using Asynchronous Work Scheduling

Author: Cuneo Braxton
Publication venue: 'Oregon State University'
Publication date
Field of study

With continuing improvements in performance and capability, GPU processing has gained significant and growing interest across science and industry. With this interest, research has increasingly focused upon methods of processing algorithms with stochastic, non-uniform branching while maintaining low divergence. Central among these methods is thread-data remapping (TDR), whereby data is periodically labeled by what processing they require and re-assigned across threads to group data requiring similar processing into the same warp. In prior art, generalized TDR has been implemented as a phased series of exhaustive processing and sorting, with all data processed before the subsequent sorting phase can complete. This thesis discusses the drawbacks of this exhaustive approach and demonstrates a significant alternative for general divergence reduction: By organizing data requiring similar processing into warp-sized arrays, data can be efficiently re-mapped in shared memory on an opportunistic basis. Furthermore, by using a lock-free reservoir to store these warp sized arrays in global memory, load balancing can occur between work groups without the need for global synchronization. By abstracting this remapping process as scheduling in an asynchronous runtime, a general-purpose framework was developed to remap process across arbitrary processing patterns, allowing for easier programming, arbitrary synchronization through barriers, and other useful features

ScholarsArchive@OSU

Doctor of Philosophy in Computer Science

Author: Kopta Daniel
Publication venue: University of Utah
Publication date: 01/01/2016
Field of study

dissertationRay tracing is becoming more widely adopted in offline rendering systems due to its natural support for high quality lighting. Since quality is also a concern in most real time systems, we believe ray tracing would be a welcome change in the real time world, but is avoided due to insufficient performance. Since power consumption is one of the primary factors limiting the increase of processor performance, it must be addressed as a foremost concern in any future ray tracing system designs. This will require cooperating advances in both algorithms and architecture. In this dissertation I study ray tracing system designs from a data movement perspective, targeting the various memory resources that are the primary consumer of power on a modern processor. The result is high performance, low energy ray tracing architectures

The University of Utah: J. Willard Marriott Digital Library

A Migratable User-Level Process Package for PVM

Author: Kunuru Ravi
Otto Steve
Walpole Jonathan
Publication venue: PDXScholar
Publication date: 01/01/1997
Field of study

Shared, multi-user, workstation networks are characterized by unpredictable variability in system load. Further, the concept of workstation ownership is typically present. For efficient and unobtrusive computing in such environments, applications must not only overlap their computation with communication but also redistribute their computations adaptively based on changes in workstation availability and load. Managing these issues at application level leads to programs that are difficult to write and debug. In this paper, we present a system that manages this dynamic multi-processor environment while exporting a simple message-based programming model of a dedicated, distributed memory multiprocessor to applications. Programmers are thus insulated from the many complexities of the dynamic environment at the same time are able to achieve the benefits of multi-threading, adaptive load distribution and unobtrusive computing. To support the dedicated multi-processor model efficiently, the system defines a new kind of virtual processor called User-Level Process (ULP) that can be used to implement efficient multi-threading and application-transparent migration. The viability of ULPs is demonstrated through UPVM, a prototype implementation of the PVM message passing interface using ULPs. Typically, existing PVM programs written in Single Program Multiple Data (SPMD) style need only be re-compiled to use this package. The design of the package is presented and the performance analyzed with respect to both micro-benchmarks and some complete PVM applications. Finally, we discuss aspects of the ULP package that affect its portability and its support for heterogeneity, application transparency, and application debugging

PDXScholar (Portland State University)

Parallel and Distributed Computing

Author
Publication venue: 'IntechOpen'
Publication date: 20/04/2021
Field of study

The 14 chapters presented in this book cover a wide variety of representative works ranging from hardware design to application development. Particularly, the topics that are addressed are programmable and reconfigurable devices and systems, dependability of GPUs (General Purpose Units), network topologies, cache coherence protocols, resource allocation, scheduling algorithms, peertopeer networks, largescale network simulation, and parallel routines and algorithms. In this way, the articles included in this book constitute an excellent reference for engineers and researchers who have particular interests in each of these topics in parallel and distributed computing

Directory of Open Access Books (DOAB)

Doctor of Philosophy

Author: Sun Weibin
Publication venue: University of Utah
Publication date: 01/08/2014
Field of study

dissertationAs the base of the software stack, system-level software is expected to provide ecient and scalable storage, communication, security and resource management functionalities. However, there are many computationally expensive functionalities at the system level, such as encryption, packet inspection, and error correction. All of these require substantial computing power. What's more, today's application workloads have entered gigabyte and terabyte scales, which demand even more computing power. To solve the rapidly increased computing power demand at the system level, this dissertation proposes using parallel graphics pro- cessing units (GPUs) in system software. GPUs excel at parallel computing, and also have a much faster development trend in parallel performance than central processing units (CPUs). However, system-level software has been originally designed to be latency-oriented. GPUs are designed for long-running computation and large-scale data processing, which are throughput-oriented. Such mismatch makes it dicult to t the system-level software with the GPUs. This dissertation presents generic principles of system-level GPU computing developed during the process of creating our two general frameworks for integrating GPU computing in storage and network packet processing. The principles are generic design techniques and abstractions to deal with common system-level GPU computing challenges. Those principles have been evaluated in concrete cases including storage and network packet processing applications that have been augmented with GPU computing. The signicant performance improvement found in the evaluation shows the eectiveness and eciency of the proposed techniques and abstractions. This dissertation also presents a literature survey of the relatively young system-level GPU computing area, to introduce the state of the art in both applications and techniques, and also their future potentials

The University of Utah: J. Willard Marriott Digital Library