Search CORE

138,806 research outputs found

Bounding the execution time of parallel applications on unrelated multiprocessors

Author: Pathan Risat
Stenstr\uf6m Per
Voudouris Petros
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2021
Field of study

Heterogeneous multiprocessors, that consist of processor types with different execution capabilities, are critical today, and in future, to offer high performance and high energy efficiency. In order to use them in hard real-time systems to support parallel processing, a tight estimation of the upper bound on the completion time (WCET) of parallel applications is needed. This paper presents, for the first time, a closed-form solution for the calculation of the WCET for task-based parallel applications modeled as directed acyclic-graphs (DAG) using the general unrelated multiprocessor model that is capable of modeling a wide range of heterogeneous multiprocessor platforms. The paper contributes with a polynomial time algorithm to calculate the WCET (i.e., makespan) for the unrelated model. In addition, it presents simulation results that are based on modeling a set of representative OpenMP task-based parallel applications from the BOTS benchmark suite

PubMed Central

Chalmers Research

Estimating the Potential Speedup of Computer Vision Applications on Embedded Multiprocessors

Author: Cleyet-Merle Sébastien
Issard Alain
Mancini Stéphane
Schwambach Vítor
Publication venue
Publication date: 26/02/2015
Field of study

Computer vision applications constitute one of the key drivers for embedded multicore architectures. Although the number of available cores is increasing in new architectures, designing an application to maximize the utilization of the platform is still a challenge. In this sense, parallel performance prediction tools can aid developers in understanding the characteristics of an application and finding the most adequate parallelization strategy. In this work, we present a method for early parallel performance estimation on embedded multiprocessors from sequential application traces. We describe its implementation in Parana, a fast trace-driven simulator targeting OpenMP applications on the STMicroelectronics' STxP70 Application-Specific Multiprocessor (ASMP). Results for the FAST key point detector application show an error margin of less than 10% compared to the reference cycle-approximate simulator, with lower modeling effort and up to 20x faster execution time.Comment: Presented at DATE Friday Workshop on Heterogeneous Architectures and Design Methods for Embedded Image Systems (HIS 2015) (arXiv:1502.07241

arXiv.org e-Print Archive

Hal - Université Grenoble Alpes

Analytical Modeling of High Performance Reconfigurable Computers: Prediction and Analysis of System Performance.

Author: Smith Melissa C.
Publication venue: TRACE: Tennessee Research and Creative Exchange
Publication date: 01/01/2002
Field of study

The use of a network of shared, heterogeneous workstations each harboring a Reconfigurable Computing (RC) system offers high performance users an inexpensive platform for a wide range of computationally demanding problems. However, effectively using the full potential of these systems can be challenging without the knowledge of the system’s performance characteristics. While some performance models exist for shared, heterogeneous workstations, none thus far account for the addition of Reconfigurable Computing systems. This dissertation develops and validates an analytic performance modeling methodology for a class of fork-join algorithms executing on a High Performance Reconfigurable Computing (HPRC) platform. The model includes the effects of the reconfigurable device, application load imbalance, background user load, basic message passing communication, and processor heterogeneity. Three fork-join class of applications, a Boolean Satisfiability Solver, a Matrix-Vector Multiplication algorithm, and an Advanced Encryption Standard algorithm are used to validate the model with homogeneous and simulated heterogeneous workstations. A synthetic load is used to validate the model under various loading conditions including simulating heterogeneity by making some workstations appear slower than others by the use of background loading. The performance modeling methodology proves to be accurate in characterizing the effects of reconfigurable devices, application load imbalance, background user load and heterogeneity for applications running on shared, homogeneous and heterogeneous HPRC resources. The model error in all cases was found to be less than five percent for application runtimes greater than thirty seconds and less than fifteen percent for runtimes less than thirty seconds. The performance modeling methodology enables us to characterize applications running on shared HPRC resources. Cost functions are used to impose system usage policies and the results of vii the modeling methodology are utilized to find the optimal (or near-optimal) set of workstations to use for a given application. The usage policies investigated include determining the computational costs for the workstations and balancing the priority of the background user load with the parallel application. The applications studied fall within the Master-Worker paradigm and are well suited for a grid computing approach. A method for using NetSolve, a grid middleware, with the model and cost functions is introduced whereby users can produce optimal workstation sets and schedules for Master-Worker applications running on shared HPRC resources

University of Tennessee, Knoxville: Trace

CiteSeerX

A Virtualized SGE-based Computational Cluster for Heterogeneous Environments

Author: Asif M.
Publication venue: IR-09-052
Publication date: 01/10/2009
Field of study

The computing and modeling environment of IIASA was studied in the context of computation-intensive ad resource-demanding applications/models which are being developed and used by the researchers/scientists of IIASA. High Performance Computing applicatins can be classified into two broad computing fields; sequential distributed and parallel distributed applications and these applications has been developed for heterogeneous operating system architectures such as Linux, Windows and Solaris etc. Majority of IIASA applications/models belong to the latter class of computing and these applications are resource demanding when the extensive and repetitive use of these applications is required according to the need of some research study. Not every sequential application can be easily parallelized; therefore, instead of re-programming sequental applications into parallel ones, the idea of distributing such applications on computing cluster/grid is often an effective approach for accelerating the work. In the light of available computing resources and modest modeling environment of IIASA, the virtualization and Sun Grid Engine (batch job scheduler and manager for cluster/grid) was efficiently exploited and designed, built and tested. This resulted in a computational cluster supporting multiple operating systems and multiple sequential distributed and parallel distributed applications/models along with multiple job execution types such as binaries and JAVA

International Institute for Applied Systems Analysis (IIASA)

EXPLORING MULTIPLE LEVELS OF PERFORMANCE MODELING FOR HETEROGENEOUS SYSTEMS

Author: Pallipuram Krishnamani Venkittaraman Vivek
Publication venue: Clemson University Libraries
Publication date: 01/12/2013
Field of study

The current trend in High-Performance Computing (HPC) is to extract concurrency from clusters that include heterogeneous resources such as General Purpose Graphical Processing Units (GPGPUs) and Field Programmable Gate Array (FPGAs). Although these heterogeneous systems can provide substantial performance for massively parallel applications, much of the available computing resources are often under-utilized due to inefficient application mapping, load balancing, and tuning. While several performance prediction models exist to efficiently tune applications, they often require significant computing architecture knowledge for reliable prediction. In addition, they do not address multiple levels of design space abstraction and it is often difficult to choose a reliable prediction model for a given design. In this research, we develop a multi-level suite of performance prediction models for heterogeneous systems that primarily targets Synchronous Iterative Algorithms (SIAs). The modeling suite aims to produce accurate and straightforward application runtime prediction prior to the actual large-scale implementation. This suite addresses two levels of system abstraction: 1) low-level where partial knowledge of the application implementation is present along with the system specifications and 2) high-level where the implementation details are minimum and only high-level computing system specifications are given. The performance prediction modeling suite is developed using our proposed Synchronous Iterative GPGPU Execution (SIGE) model for GPGPU clusters, motivated by the RC Amenability Test for Scalable Systems (RATSS) model for FPGA clusters. The low-level abstraction for GPGPU clusters consists of a regression-based performance prediction framework that statistically abstracts system architecture characteristics, enabling performance prediction without detailed architecture knowledge. In this framework, the overall execution time of an application is predicted using regression models developed for host-device computations and network-level communications performed in the algorithm. We have used a family of Spiking Neural Network (SNN) models and an Anisotropic Diffusion Filter (ADF) algorithm as SIA case studies for verification of the regression-based framework and achieved over 90% prediction accuracy compared to the actual implementations for several GPGPU cluster configurations tested. The results establish the adequacy of the low-level abstraction model for advanced, fine-grained performance prediction and design space exploration (DSE). The high-level abstraction consists of the following two primary modeling approaches: qualitative modeling that uses existing subjective-analytical models for computation and communication; and quantitative modeling that predicts computation and communication performance by measuring hardware events associated with objective-analytical models using micro-benchmarks. The performance prediction provided by the high-level abstraction approaches, albeit coarse-grained, delivers useful insight into application performance on the chosen heterogeneous system. A blend of the two high-level modeling approaches, labeled as hybrid modeling, is explored for insightful preliminary performance prediction. The performance prediction models in the multi-level suite are verified and compared for their accuracy and ease-of-use, allowing developers to choose a model that best satisfies their design space abstraction. We also construct a roadmap that guides user from optimal Application-to-Accelerator (A2A) mapping to fine-grained performance prediction, thereby providing a hierarchical approach to optimal application porting on the target heterogeneous system. The end goal of this dissertation research is to offer the HPC community a thorough, non-architecture specific, performance prediction framework in the form of a hierarchical modeling suite that enables them to optimally utilize the heterogeneous resources

Clemson University: TigerPrints

Numerical Representation of Directed Acyclic Graphs for Efficient Dataflow Embedded Resource Allocation

Author: Arrestier Florian
Desnos Karol
Juarez Eduardo
Menard Daniel
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2019
Field of study

International audienceStream processing applications running on Heterogeneous Multi-Processor Systems on Chips (HMPSoCs) require efficient resource allocation and management, both at compile-time and at runtime. To cope with modern adaptive applications whose behavior can not be exhaustively predicted at compile-time, runtime managers must be able to take resource allocation decisions on-the-fly, with a minimum overhead on application performance. Resource allocation algorithms often rely on an internal modeling of an application. Directed Acyclic Graph (DAGs) are the most commonly used models for capturing control and data dependencies between tasks. DAGs are notably often used as an intermediate representation for deploying applications modeled with a dataflow Model of Computation (MoC) on HMPSoCs. Building such intermediate representation at runtime for massively parallel applications is costly both in terms of computation and memory overhead. In this paper, an intermediate representation of DAGs for resource allocation is presented. This new representation shows improved performance for run-time analysis of dataflow graphs with less overhead in both computation time and memory footprint. The performances of the proposed representation are evaluated on a set of computer vision and machine learning applications

HAL-CentraleSupelec

HAL-Rennes 1

Implement of a high-performance computing system for parallel processing of scientific applications and the teaching of multicore and parallel programming

Author: Velarde Martinez Apolinar
Publication venue: 'Universitat Politecnica de Valencia'
Publication date: 07/05/2019
Field of study

[EN] Increasingly complex algorithms for the modeling and resolution of different problems, which are currently facing humanity, has made it necessary the advent of new data processing requirements and the consequent implementation of high performance computing systems; but due to the high economic cost of this type of equipment and considering that an education institution cannot acquire, it is necessary to develop and implement computable architectures that are economical and scalable in their construction, such as heterogeneous distributed computing systems, constituted by several clustering of multicore processing elements with shared and distributed memory systems. This paper presents the analysis, design and implementation of a high-performance computing system called Liebres InTELigentes, whose purpose is the design and execution of intrinsically parallel algorithms, which require high amounts of storage and excessive processing times. The proposed computer system is constituted by conventional computing equipment (desktop computers, lap top equipment and servers), linked by a high-speed network. The main objective of this research is to build technology for the purposes of scientific and educational research.This project is sponsored by Tecnologico Nacional de México TecNM. 2018-2 110Velarde Martinez, A. (2019). Implement of a high-performance computing system for parallel processing of scientific applications and the teaching of multicore and parallel programming. En INNODOCT/18. International Conference on Innovation, Documentation and Education. Editorial Universitat Politècnica de València. 203-213. https://doi.org/10.4995/INN2018.2018.8908OCS20321

Crossref

RiuNet