Search CORE

11,415 research outputs found

Introducing Molly: Distributed Memory Parallelization with LLVM

Author: Kruse Michael
Publication venue
Publication date: 01/01/2013
Field of study

Programming for distributed memory machines has always been a tedious task, but necessary because compilers have not been sufficiently able to optimize for such machines themselves. Molly is an extension to the LLVM compiler toolchain that is able to distribute and reorganize workload and data if the program is organized in statically determined loop control-flows. These are represented as polyhedral integer-point sets that allow program transformations applied on them. Memory distribution and layout can be declared by the programmer as needed and the necessary asynchronous MPI communication is generated automatically. The primary motivation is to run Lattice QCD simulations on IBM Blue Gene/Q supercomputers, but since the implementation is not yet completed, this paper shows the capabilities on Conway's Game of Life

arXiv.org e-Print Archive

HAL-CentraleSupelec

HAL - Lille 3

INRIA a CCSD electronic archive server

HAL-Rennes 1

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives

Author: Andión José M.
Arenaz Silva Manuel
Bodin François
Rodríguez Gabriel
Touriño Juan
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

This is a post-peer-review, pre-copyedit version of an article published in International Journal of Parallel Programming. The final authenticated version is available online at: https://doi.org/10.1007/s10766-015-0362-9[Abstract] The use of GPUs for general purpose computation has increased dramatically in the past years due to the rising demands of computing power and their tremendous computing capacity at low cost. Hence, new programming models have been developed to integrate these accelerators with high-level programming languages, giving place to heterogeneous computing systems. Unfortunately, this heterogeneity is also exposed to the programmer complicating its exploitation. This paper presents a new technique to automatically rewrite sequential programs into a parallel counterpart targeting GPU-based heterogeneous systems. The original source code is analyzed through domain-independent computational kernels, which hide the complexity of the implementation details by presenting a non-statement-based, high-level, hierarchical representation of the application. Next, a locality-aware technique based on standard compiler transformations is applied to the original code through OpenHMPP directives. Two representative case studies from scientific applications have been selected: the three-dimensional discrete convolution and the simple-precision general matrix multiplication. The effectiveness of our technique is corroborated by a performance evaluation on NVIDIA GPUs.Ministerio de Economía y Competitividad; TIN2010-16735Ministerio de Economía y Competitividad; TIN2013-42148-PGalicia, Consellería de Cultura, Educación e Ordenación Universitaria; GRC2013-055Ministerio de Educación; AP2008-0101

Repositorio da Universidade da Coruña

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Recommended from our members

Exploiting iteration-level parallelism in declarative programs

Author: Roy John M.A.
Publication venue: eScholarship, University of California
Publication date: 01/01/1991
Field of study

In order to achieve viable parallel processing three basic criteria must be met: (1) the system must provide a programming environment which hides the details of parallel processing from the programmer; (2) the system must execute efficiently on the given hardware; and (3) the system must be economically attractive.The first criterion can be met by providing the programmer with an implicit rather than explicit programming paradigm. In this way ali of the synchronization and distribution are handled automatically. To meet the second criterion, the system must perform synchronization and distribution in such a way that the available computing resources are used to their utmost. And to meet the third criterion, the system must not require esoteric or expensive hardware to achieve efficient utilization.This dissertation reports on the Process-Oriented Dataflow System (PODS), which meets all of the above criteria. PODS uses a hybrid von Neumann-Dataflow model of computation supported by an automatic partitioning and distribution scheme. The new partitioning and distribution algorithm is presented along with the underlying principles. Four new mechanisms for distribution are presented: (1) a distributed array allocation operator for data distribution; (2) a distributed L operator for code distribution; (3) a range filter for restriction index ranges for different PEs; and (4) a specialized apply operator for functional parallelism.Simulations show that PODS balances communication overhead with distributed processing to achieve efficient parallel execution on distributed memory multiprocessors. This is partially due to a new software array caching scheme, called remote caching, which greatly reduces the amount of remote memory reads. PODS is designed to use off-the-shelf components, with no specialized hardware. In this way a real PODS machine can be built quickly and cost effectively. The system is currently being retargeted to the Intel iPSC/2 so that it can be run on commercially available equipment

eScholarship - University of California

CoBe -- Coded Beacons for Localization, Object Tracking, and SLAM Augmentation

Author: Jubran Ibrahim
Kimmel Ron
Rabinovich Roman
Wetzler Aaron
Publication venue
Publication date: 21/04/2020
Field of study

This paper presents a novel beacon light coding protocol, which enables fast and accurate identification of the beacons in an image. The protocol is provably robust to a predefined set of detection and decoding errors, and does not require any synchronization between the beacons themselves and the optical sensor. A detailed guide is then given for developing an optical tracking and localization system, which is based on the suggested protocol and readily available hardware. Such a system operates either as a standalone system for recovering the six degrees of freedom of fast moving objects, or integrated with existing SLAM pipelines providing them with error-free and easily identifiable landmarks. Based on this guide, we implemented a low-cost positional tracking system which can run in real-time on an IoT board. We evaluate our system's accuracy and compare it to other popular methods which utilize the same optical hardware, in experiments where the ground truth is known. A companion video containing multiple real-world experiments demonstrates the accuracy, speed, and applicability of the proposed system in a wide range of environments and real-world tasks. Open source code is provided to encourage further development of low-cost localization systems integrating the suggested technology at its navigation core

arXiv.org e-Print Archive

Crossref

Reverse-mode algorithmic differentiation of an OpenMP-parallel compressible flow solver

Author: Blazek J
Förster M
Giles MB
Jan Hückelheim
Jens-Dominik Müller
Michelle Mills Strout
Naumann U
Paul Hovland
Spalart P
Publication venue: 'SAGE Publications'
Publication date: 03/05/2017
Field of study

Crossref

Queen Mary Research Online

Improved Distributed Estimation Method for Environmental\ud time-variant Physical variables in Static Sensor Networks

Author: Khalid Dr. Haris M.
Mahmoud Professor Magdi S.
Sabih Mr. Muhammad
Publication venue
Publication date: 01/12/2011
Field of study

In this paper, an improved distributed estimation scheme for static sensor networks is developed. The scheme is developed for environmental time-variant physical variables. The main contribution of this work is that the algorithm in [1]-[3] has been extended, and a filter has been designed with weights, such that the variance of the estimation errors is minimized, thereby improving the filter design considerably\ud and characterizing the performance limit of the filter, and thereby tracking a time-varying signal. Moreover, certain parameter optimization is alleviated with the application of a particular finite impulse response (FIR) filter. Simulation results are showing the effectiveness of the developed estimation algorithm

CogPrints Cognitive Sciences Eprint Archive