Search CORE

1,198 research outputs found

Using shared-data localization to reduce the cost of inspector-execution in unified-parallel-C programs

Author: Alvanos Michail
Amaral José Nelson
Farreras Esclusa Montserrat
Martorell Bofill Xavier
Tiotto Ettore
Publication venue: 'Elsevier BV'
Publication date: 01/01/2016
Field of study

Programs written in the Unified Parallel C (UPC) language can access any location of the entire local and remote address space via read/write operations. However, UPC programs that contain fine-grained shared accesses can exhibit performance degradation. One solution is to use the inspector-executor technique to coalesce fine-grained shared accesses to larger remote access operations. A straightforward implementation of the inspector executor transformation results in excessive instrumentation that hinders performance.; This paper addresses this issue and introduces various techniques that aim at reducing the generated instrumentation code: a shared-data localization transformation based on Constant-Stride Linear Memory Descriptors (CSLMADs) [S. Aarseth, Gravitational N-Body Simulations: Tools and Algorithms, Cambridge Monographs on Mathematical Physics, Cambridge University Press, 2003.], the inlining of data locality checks and the usage of an index vector to aggregate the data. Finally, the paper introduces a lightweight loop code motion transformation to privatize shared scalars that were propagated through the loop body.; A performance evaluation, using up to 2048 cores of a POWER 775, explores the impact of each optimization and characterizes the overheads of UPC programs. It also shows that the presented optimizations increase performance of UPC programs up to 1.8 x their UPC hand-optimized counterpart for applications with regular accesses and up to 6.3 x for applications with irregular accesses.Peer ReviewedPostprint (author's final draft

Learning from the Success of MPI

Author: A. Geist
A. Skjellum
C.H. Koelbel
J. Boyle
J. Cownie
J. Dongarra
J.L. Traeff
K. Krechmer
Message Passing Interface Forum
Message Passing Interface Forum MPI2
N. Carriero
O. Zaki
P.B. Hansen
R. Hempel
R.C. Whaley
R.W. Numrich
W. Gropp
W. Gropp
W.W. Carlson
Publication venue
Publication date: 01/01/2001
Field of study

The Message Passing Interface (MPI) has been extremely successful as a portable way to program high-performance parallel computers. This success has occurred in spite of the view of many that message passing is difficult and that other approaches, including automatic parallelization and directive-based parallelism, are easier to use. This paper argues that MPI has succeeded because it addresses all of the important issues in providing a parallel programming model.Comment: 12 pages, 1 figur

arXiv.org e-Print Archive

CiteSeerX

UNT Digital Library

Parallel Processing of Large Graphs

Author: Indyk Wojciech
Kajdanowicz Tomasz
Kazienko Przemyslaw
Publication venue
Publication date: 03/06/2013
Field of study

More and more large data collections are gathered worldwide in various IT systems. Many of them possess the networked nature and need to be processed and analysed as graph structures. Due to their size they require very often usage of parallel paradigm for efficient computation. Three parallel techniques have been compared in the paper: MapReduce, its map-side join extension and Bulk Synchronous Parallel (BSP). They are implemented for two different graph problems: calculation of single source shortest paths (SSSP) and collective classification of graph nodes by means of relational influence propagation (RIP). The methods and algorithms are applied to several network datasets differing in size and structural profile, originating from three domains: telecommunication, multimedia and microblog. The results revealed that iterative graph processing with the BSP implementation always and significantly, even up to 10 times outperforms MapReduce, especially for algorithms with many iterations and sparse communication. Also MapReduce extension based on map-side join usually noticeably presents better efficiency, although not as much as BSP. Nevertheless, MapReduce still remains the good alternative for enormous networks, whose data structures do not fit in local memories.Comment: Preprint submitted to Future Generation Computer System

arXiv.org e-Print Archive

CiteSeerX

Foggy clouds and cloudy fogs: a real need for coordinated management of fog-to-cloud computing systems

Author: Jukan Admela
Marín Tordera Eva
Masip Bruin Xavier
Ren Guang-Jie
Tashakor Ghazal
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2016
Field of study

The recent advances in cloud services technology are fueling a plethora of information technology innovation, including networking, storage, and computing. Today, various flavors have evolved of IoT, cloud computing, and so-called fog computing, a concept referring to capabilities of edge devices and users' clients to compute, store, and exchange data among each other and with the cloud. Although the rapid pace of this evolution was not easily foreseeable, today each piece of it facilitates and enables the deployment of what we commonly refer to as a smart scenario, including smart cities, smart transportation, and smart homes. As most current cloud, fog, and network services run simultaneously in each scenario, we observe that we are at the dawn of what may be the next big step in the cloud computing and networking evolution, whereby services might be executed at the network edge, both in parallel and in a coordinated fashion, as well as supported by the unstoppable technology evolution. As edge devices become richer in functionality and smarter, embedding capacities such as storage or processing, as well as new functionalities, such as decision making, data collection, forwarding, and sharing, a real need is emerging for coordinated management of fog-to-cloud (F2C) computing systems. This article introduces a layered F2C architecture, its benefits and strengths, as well as the arising open and research challenges, making the case for the real need for their coordinated management. Our architecture, the illustrative use case presented, and a comparative performance analysis, albeit conceptual, all clearly show the way forward toward a new IoT scenario with a set of existing and unforeseen services provided on highly distributed and dynamic compute, storage, and networking resources, bringing together heterogeneous and commodity edge devices, emerging fogs, as well as conventional clouds.Peer ReviewedPostprint (author's final draft

The Parallelism Motifs of Genomic Data Analysis

Author: Awan Muaaz
Azad Ariful
Brock Benjamin
Buluc Aydin
Egan Rob
Ekanayake Saliya
Ellis Marquita
Georganas Evangelos
Guidi Giulia
Hofmeyr Steven
Oliker Leonid
Selvitopi Oguz
Teodoropol Cristina
Yelick Katherine
Publication venue: 'The Royal Society'
Publication date: 20/01/2020
Field of study

Genomic data sets are growing dramatically as the cost of sequencing continues to decline and small sequencing devices become available. Enormous community databases store and share this data with the research community, but some of these genomic data analysis problems require large scale computational platforms to meet both the memory and computational requirements. These applications differ from scientific simulations that dominate the workload on high end parallel systems today and place different requirements on programming support, software libraries, and parallel architectural design. For example, they involve irregular communication patterns such as asynchronous updates to shared data structures. We consider several problems in high performance genomics analysis, including alignment, profiling, clustering, and assembly for both single genomes and metagenomes. We identify some of the common computational patterns or motifs that help inform parallelization strategies and compare our motifs to some of the established lists, arguing that at least two key patterns, sorting and hashing, are missing

arXiv.org e-Print Archive

eScholarship - University of California

Scalable RDMA performance in PGAS languages

Author: Almási George
Cortés Toni
Farreras Esclusa Montserrat
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/05/2009
Field of study

Partitioned global address space (PGAS) languages provide a unique programming model that can span shared-memory multiprocessor (SMP) architectures, distributed memory machines, or cluster ofSMPs. Users can program large scale machines with easy-to-use, shared memory paradigms. In order to exploit large scale machines efficiently, PGAS language implementations and their runtime system must be designed for scalability and performance. The IBM XLUPC compiler and runtime system provide a scalable design through the use of the shared variable directory (SVD). The SVD stores meta-information needed to access shared data. It is dereferenced, in the worst case, for every shared memory access, thus exposing a potential performance problem. In this paper we present a cache of remote addresses as an optimization that will reduce the SVD access overhead and allow the exploitation of native (remote) direct memory accesses. It results in a significant performance improvement while maintaining the run-time portability and scalability.Postprint (published version

Design and Implementation of MapReduce using the PGAS Programming Model with UPC

Author: Doallo Ramón
López Taboada Guillermo
Teijeiro Barjas Carlos
Touriño Juan
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 03/01/2012
Field of study

This is a post-peer-review, pre-copyedit version of an article published in International Conference on Parallel and Distributed Systems. Proceedings. The final authenticated version is available online at: http://dx.doi.org/10.1109/ICPADS.2011.162[Abstract] MapReduce is a powerful tool for processing large data sets used by many applications running in distributed environments. However, despite the increasing number of computationally intensive problems that require low-latency communications, the adoption of MapReduce in High Performance Computing (HPC) is still emerging. Here languages based on the Partitioned Global Address Space (PGAS) programming model have shown to be a good choice for implementing parallel applications, in order to take advantage of the increasing number of cores per node and the programmability benefits achieved by their global memory view, such as the transparent access to remote data. This paper presents the first PGAS-based MapReduce implementation that uses the Unified Parallel C (UPC) language, which (1) obtains programmability benefits in parallel programming, (2) offers advanced configuration options to define a customized load distribution for different codes, and (3) overcomes performance penalties and bottlenecks that have traditionally prevented the deployment of MapReduce applications in HPC. The performance evaluation of representative applications on shared and distributed memory environments assesses the scalability of the presented MapReduce framework, confirming its suitability.Ministerio de Ciencia e Innovación; TIN2010-1673

Applying autonomy to distributed satellite systems: Trends, challenges, and future prospects

Author: Alarcón Cot Eduardo José
Araguz López Carles
Bou Balust Elisenda
Publication venue: 'Wiley'
Publication date: 01/01/2018
Field of study

While monolithic satellite missions still pose significant advantages in terms of accuracy and operations, novel distributed architectures are promising improved flexibility, responsiveness, and adaptability to structural and functional changes. Large satellite swarms, opportunistic satellite networks or heterogeneous constellations hybridizing small-spacecraft nodes with highperformance satellites are becoming feasible and advantageous alternatives requiring the adoption of new operation paradigms that enhance their autonomy. While autonomy is a notion that is gaining acceptance in monolithic satellite missions, it can also be deemed an integral characteristic in Distributed Satellite Systems (DSS). In this context, this paper focuses on the motivations for system-level autonomy in DSS and justifies its need as an enabler of system qualities. Autonomy is also presented as a necessary feature to bring new distributed Earth observation functions (which require coordination and collaboration mechanisms) and to allow for novel structural functions (e.g., opportunistic coalitions, exchange of resources, or in-orbit data services). Mission Planning and Scheduling (MPS) frameworks are then presented as a key component to implement autonomous operations in satellite missions. An exhaustive knowledge classification explores the design aspects of MPS for DSS, and conceptually groups them into: components and organizational paradigms; problem modeling and representation; optimization techniques and metaheuristics; execution and runtime characteristics and the notions of tasks, resources, and constraints. This paper concludes by proposing future strands of work devoted to study the trade-offs of autonomy in large-scale, highly dynamic and heterogeneous networks through frameworks that consider some of the limitations of small spacecraft technologies.Postprint (author's final draft

Automatic mapping of parallel applications on multicore architectures using the Servet benchmark suite

Author: Fraguela Basilio B.
González-Domínguez Jorge
López Taboada Guillermo
Martín María J.
Touriño Juan
Publication venue: 'Elsevier BV'
Publication date: 01/01/2012
Field of study

This is a post-peer-review, pre-copyedit version of an article published in Computers & Electrical Engineering. The final authenticated version is available online at: https://doi.org/10.1016/j.compeleceng.2011.12.007[Abstract] Servet is a suite of benchmarks focused on detecting a set of parameters with high influence on the overall performance of multicore systems. These parameters can be used for autotuning codes to increase their performance on multicore clusters. Although Servet has been proved to detect accurately cache hierarchies, bandwidths and bottlenecks in memory accesses, as well as the communication overhead among cores, up to now the impact of the use of this information on application performance optimization has not been assessed. This paper presents a novel algorithm that automatically uses Servet for mapping parallel applications on multicore systems and analyzes its impact on three testbeds using three different parallel programming models: message-passing, shared memory and partitioned global address space (PGAS). Our results show that a suitable mapping policy based on the data provided by this tool can significantly improve the performance of parallel applications without source code modification.Ministerio de Ciencia e Innovación; TIN2010-16735Ministerio de Educación; FPU; AP2008-01578Xunta de Galicia; INCITE08PXIB105161P