Search CORE

39 research outputs found

A Second-Order Distributed Trotter-Suzuki Solver with a Hybrid Kernel

Author: Cucchietti
Dagum
De Raedt
De Raedt
Fernando M. Cucchietti
Lewenstein
Peter Wittek
Poulin
Suzuki
Suzuki
Suzuki
Trotter
Publication venue: 'Elsevier BV'
Publication date: 12/08/2012
Field of study

The Trotter-Suzuki approximation leads to an efficient algorithm for solving the time-dependent Schr\"odinger equation. Using existing highly optimized CPU and GPU kernels, we developed a distributed version of the algorithm that runs efficiently on a cluster. Our implementation also improves single node performance, and is able to use multiple GPUs within a node. The scaling is close to linear using the CPU kernels, whereas the efficiency of GPU kernels improve with larger matrices. We also introduce a hybrid kernel that simultaneously uses multicore CPUs and GPUs in a distributed system. This kernel is shown to be efficient when the matrix size would not fit in the GPU memory. Larger quantum systems scale especially well with a high number nodes. The code is available under an open source license.Comment: 11 pages, 10 figure

arXiv.org e-Print Archive

CiteSeerX

Crossref

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

University of Borås

Digitala Vetenskapliga Arkivet - Academic Archive On-line

A visual programming model to implement coarse-grained DSP applications on parallel and heterogeneous clusters

Author: B. Bhattacharya
D.B. Kirk
E. Lee
J. Bueno
K. Parhi
L. Itti
M. Flynn
P.S. Pacheco
R. Chandra
Publication venue: HAL CCSD
Publication date: 01/01/2014
Field of study

International audienceThe digital signal processing (DSP) applications are one of the biggest consumers of computing. They process a big data volume which is represented with a high accuracy. They use complex algorithms, and must satisfy a time constraints in most of cases. In the other hand, it's necessary today to use parallel and heterogeneous architectures in order to speedup the processing, where the best examples are the su-percomputers "Tianhe-2" and "Titan" from the top500 ranking. These architectures could contain several connected nodes, where each node includes a number of generalist processor (multi-core) and a number of accelerators (many-core) to finally allows several levels of parallelism. However, for DSP programmers, it's still complicated to exploit all these parallelism levels to reach good performance for their applications. They have to design their implementation to take advantage of all heteroge-neous computing units, taking into account the architecture specifici-ties of each of them: communication model, memory management, data management, jobs scheduling and synchronization . . . etc. In the present work, we characterize DSP applications, and based on their distinctive-ness, we propose a high level visual programming model and an execution model in order to drop down their implementations and in the same time make desirable performances

Crossref

Hal - Université Grenoble Alpes

Particle-in-cell simulation using asynchronous tasking

Author: Barreto João
Ceyrat Pedro
Fonseca Ricardo
Guidotti Nicolas
Martorell Bofill Xavier
Monteiro José
Peña Monferrer Antonio José
Rodrigues Rodrigo
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2021
Field of study

Recently, task-based programming models have emerged as a prominent alternative among shared-memory parallel programming paradigms. Inherently asynchronous, these models provide native support for dynamic load balancing and incorporate data flow concepts to selectively synchronize the tasks. However, tasking models are yet to be widely adopted by the HPC community and their effective advantages when applied to non-trivial, real-world HPC applications are still not well comprehended. In this paper, we study the parallelization of a production electromagnetic particle-in-cell (EM-PIC) code for kinetic plasma simulations exploring different strategies using asynchronous task-based models. Our fully asynchronous implementation not only significantly outperforms a conventional, synchronous approach but also achieves near perfect scaling for 48 cores.Peer ReviewedPostprint (author's final draft

UPCommons. Portal del coneixement obert de la UPC

Particle-In-Cell Simulation using Asynchronous Tasking

Author: Barreto João
Ceyrat Pedro
Fonseca Ricardo
Guidotti Nicolas
Martorell Xavier
Monteiro José
Peña Antonio J.
Rodrigues Rodrigo
Publication venue
Publication date: 01/01/2021
Field of study

arXiv.org e-Print Archive

UPCommons. Portal del coneixement obert de la UPC

Analysis of the Task Superscalar architecture hardware design

Author: Badia Sala Rosa Maria
Etsion Yoav
Jiménez González Daniel
Yazdanpanah Ahmadabadi Fahimeh
Álvarez Martínez Carlos
Publication venue: 'Elsevier BV'
Publication date: 01/01/2013
Field of study

In this paper, we analyze the operational flow of two hardware implementations of the Task Superscalar architecture. The Task Superscalar is an experimental task based dataflow scheduler that dynamically detects inter-task data dependencies, identifies task-level parallelism, and executes tasks in the out-of-order manner. In this paper, we present a base implementation of the Task Superscalar architecture, as well as a new design with improved performance. We study the behavior of processing some dependent and non-dependent tasks with both base and improved hardware designs and present the simulation results compared with the results of the runtime implementation.This work is supported by the Ministry of Science and Technology of Spain and the European Union (FEDER funds) under contract TIN2007-60625, by the Generalitat de Catalunya (contract 2009-SGR-980), and by the European FP7 project TERAFLUX id. 249013, http://www.tera ux.eu. We would also like to thank the Xilinx University Program for its hardware and software donations.Postprint (author’s final draft

Elsevier - Publisher Connector

Crossref

UPCommons. Portal del coneixement obert de la UPC

A domain-specific high-level programming model

Author: Balarin
Bhattacharya
Blumofe
Board
Boulos
Cameron
Chandra
Grotker
Hiram
Houzet
Kirk
Lee
Munshi
Pacheco
Parhi
Polukhin
Reinders
Sanders
Skillicorn
Valiant
Publication venue: 'Wiley'
Publication date: 22/09/2015
Field of study

International audienceNowadays, computing hardware continues to move toward more parallelism and more heterogeneity, to obtain more computing power. From personal computers to supercomputers, we can find several levels of parallelism expressed by the interconnections of multi-core and many-core accelerators. On the other hand, computing software needs to adapt to this trend, and programmers can use parallel programming models (PPM) to fulfil this difficult task. There are different PPMs available that are based on tasks, directives, or low level languages or library. These offer higher or lower abstraction levels from the architecture by handling their own syntax. However, to offer an efficient PPM with a greater (additional) high-levelabstraction level while saving on performance, one idea is to restrict this to a specific domain and to adapt it to a family of applications. In the present study, we propose a high-level PPM specific to digital signal processing applications. It is based on data-flow graph models of computation, and a dynamic runtime model of execution (StarPU). We show how the user can easily express this digital signal processing application, and can take advantage of task, data and graph parallelism in the implementation, to enhance the performances of targeted heterogeneous clusters composed of CPUs and different accelerators (e.g., GPU, Xeon Phi

Crossref

Hal - Université Grenoble Alpes

Different aspects of workflow scheduling in large-scale distributed systems

Author: Carretero Pérez Jesús
García Blas Francisco Javier
Karatza Helen D.
Rodrigo Duro Francisco José
Stavrinides Georgios L.
Publication venue: 'Elsevier BV'
Publication date: 01/01/2017
Field of study

As large-scale distributed systems gain momentum, the scheduling of workflow applications with multiple requirements in such computing platforms has become a crucial area of research. In this paper, we investigate the workflow scheduling problem in large-scale distributed systems, from the Quality of Service (QoS) and data locality perspectives. We present a scheduling approach, considering two models of synchronization for the tasks in a workflow application: (a) communication through the network and (b) communication through temporary files. Specifically, we investigate via simulation the performance of a heterogeneous distributed system, where multiple soft real-time workflow applications arrive dynamically. The applications are scheduled under various tardiness bounds, taking into account the communication cost in the first case study and the I/O cost and data locality in the second.The work presented in this paper has been partially supported by EU, under the COST program Action IC1305, “Network for Sustainable Ultrascale Computing (NESUS)”, and by the Ministerio de Economía y Competitividad, Spain, under the project TIN2013-41350-P, “Scalable Data Management Techniques for High-End Computing Systems”

Universidad Carlos III de Madrid e-Archivo