Search CORE

5,214 research outputs found

An intelligent allocation algorithm for parallel processing

Author: Ananthram Kishan G.
Carroll Chester C.
Homaifar Abdollah
Publication venue
Publication date
Field of study

The problem of allocating nodes of a program graph to processors in a parallel processing architecture is considered. The algorithm is based on critical path analysis, some allocation heuristics, and the execution granularity of nodes in a program graph. These factors, and the structure of interprocessor communication network, influence the allocation. To achieve realistic estimations of the executive durations of allocations, the algorithm considers the fact that nodes in a program graph have to communicate through varying numbers of tokens. Coarse and fine granularities have been implemented, with interprocessor token-communication duration, varying from zero up to values comparable to the execution durations of individual nodes. The effect on allocation of communication network structures is demonstrated by performing allocations for crossbar (non-blocking) and star (blocking) networks. The algorithm assumes the availability of as many processors as it needs for the optimal allocation of any program graph. Hence, the focus of allocation has been on varying token-communication durations rather than varying the number of processors. The algorithm always utilizes as many processors as necessary for the optimal allocation of any program graph, depending upon granularity and characteristics of the interprocessor communication network

NASA Technical Reports Server

Run-time Spatial Mapping of Streaming Applications to Heterogeneous Multi-Processor Systems

Author: Braak Timon D. ter
Hurink Johann L.
Hölzenspies Philip K.F.
Kuper Jan
Smit Gerard J.M.
Publication venue: Springer Verlag
Publication date: 01/01/2009
Field of study

In this paper, we define the problem of spatial mapping. We present reasons why performing spatial mappings at run-time is both necessary and desirable. We propose what is—to our knowledge—the first attempt at a formal description of spatial mappings for the embedded real-time streaming application domain. Thereby, we introduce criteria for a qualitative comparison of these spatial mappings. As an illustration of how our formalization relates to practice, we relate our own spatial mapping algorithm to the formal model

Springer - Publisher Connector

University of Twente Research Information

Automating Fault Tolerance in High-Performance Computational Biological Jobs Using Multi-Agent Approaches

Author: Alexandrov Vassil
McKee Gerard
Varghese Blesson
Publication venue: 'Elsevier BV'
Publication date: 03/03/2014
Field of study

Background: Large-scale biological jobs on high-performance computing systems require manual intervention if one or more computing cores on which they execute fail. This places not only a cost on the maintenance of the job, but also a cost on the time taken for reinstating the job and the risk of losing data and execution accomplished by the job before it failed. Approaches which can proactively detect computing core failures and take action to relocate the computing core's job onto reliable cores can make a significant step towards automating fault tolerance. Method: This paper describes an experimental investigation into the use of multi-agent approaches for fault tolerance. Two approaches are studied, the first at the job level and the second at the core level. The approaches are investigated for single core failure scenarios that can occur in the execution of parallel reduction algorithms on computer clusters. A third approach is proposed that incorporates multi-agent technology both at the job and core level. Experiments are pursued in the context of genome searching, a popular computational biology application. Result: The key conclusion is that the approaches proposed are feasible for automating fault tolerance in high-performance computing systems with minimal human intervention. In a typical experiment in which the fault tolerance is studied, centralised and decentralised checkpointing approaches on an average add 90% to the actual time for executing the job. On the other hand, in the same experiment the multi-agent approaches add only 10% to the overall execution time.Comment: Computers in Biology and Medicin

arXiv.org e-Print Archive

Queen's University Belfast Research Portal

University of St. Andrews - Pure

St Andrews Research Repository

A New Duplication Task Scheduling Algorithm in Heterogeneous Distributed Computing Systems

Author: A Nasr Aida
EL-Bahnasawy Nirmeen A
EL-Sayed Ayman
Publication venue: 'Institute of Advanced Engineering and Science'
Publication date: 01/09/2016
Field of study

The efficient scheduling algorithm is critical to achieve high performance in parallel and distributed systems. The main objective of task scheduling is to assign the tasks onto the available processors with the aim of producing minimum schedule length and without violating the precedence constraints. So we developed new algorithm called Mean Communication Node with Duplication MCND algorithm to achieve high performance task scheduling. The MCND algorithm has two phases namely, task priority and processor selection. Our algorithm takes into account the average of parents' communication costs for each task to reduce the overhead communication. The algorithm uses new task duplication algorithm. We build a simulation to compare the MCND algorithm with CPOP with duplication algorithm. The algorithms are applied on real application. From results, the MCND algorithm shows the best result

Bulletin of Electrical Engineering and Informatics

Neliti

Crossref

DALiuGE: A Graph Execution Framework for Harnessing the Astronomical Data Deluge

Author: An Tao
Boulton Mark
Cooper Ian
Dodson Richard
Dolensky Markus
Lao Baoqiang
Mei Ying
Pallot Dave
Tobar Rodrigo
Vinsen Kevin
Wang Feng
Wang Ruonan
Wicenec Andreas
Wu Chen
Publication venue
Publication date: 01/01/2017
Field of study

The Data Activated Liu Graph Engine - DALiuGE - is an execution framework for processing large astronomical datasets at a scale required by the Square Kilometre Array Phase 1 (SKA1). It includes an interface for expressing complex data reduction pipelines consisting of both data sets and algorithmic components and an implementation run-time to execute such pipelines on distributed resources. By mapping the logical view of a pipeline to its physical realisation, DALiuGE separates the concerns of multiple stakeholders, allowing them to collectively optimise large-scale data processing solutions in a coherent manner. The execution in DALiuGE is data-activated, where each individual data item autonomously triggers the processing on itself. Such decentralisation also makes the execution framework very scalable and flexible, supporting pipeline sizes ranging from less than ten tasks running on a laptop to tens of millions of concurrent tasks on the second fastest supercomputer in the world. DALiuGE has been used in production for reducing interferometry data sets from the Karl E. Jansky Very Large Array and the Mingantu Ultrawide Spectral Radioheliograph; and is being developed as the execution framework prototype for the Science Data Processor (SDP) consortium of the Square Kilometre Array (SKA) telescope. This paper presents a technical overview of DALiuGE and discusses case studies from the CHILES and MUSER projects that use DALiuGE to execute production pipelines. In a companion paper, we provide in-depth analysis of DALiuGE's scalability to very large numbers of tasks on two supercomputing facilities.Comment: 31 pages, 12 figures, currently under review by Astronomy and Computin

arXiv.org e-Print Archive

Shanghai Astronomical Observatory,Chinese Academy of Sciences

Run-time scheduling and execution of loops on message passing machines

Author: Berryman Harry
Crowley Kay
Mirchandaney Ravi
Saltz Joel
Publication venue
Publication date
Field of study

Sparse system solvers and general purpose codes for solving partial differential equations are examples of the many types of problems whose irregularity can result in poor performance on distributed memory machines. Often, the data structures used in these problems are very flexible. Crucial details concerning loop dependences are encoded in these structures rather than being explicitly represented in the program. Good methods for parallelizing and partitioning these types of problems require assignment of computations in rather arbitrary ways. Naive implementations of programs on distributed memory machines requiring general loop partitions can be extremely inefficient. Instead, the scheduling mechanism needs to capture the data reference patterns of the loops in order to partition the problem. First, the indices assigned to each processor must be locally numbered. Next, it is necessary to precompute what information is needed by each processor at various points in the computation. The precomputed information is then used to generate an execution template designed to carry out the computation, communication, and partitioning of data, in an optimized manner. The design is presented for a general preprocessor and schedule executer, the structures of which do not vary, even though the details of the computation and of the type of information are problem dependent

NASA Technical Reports Server

Automated problem scheduling and reduction of synchronization delay effects

Author: Saltz Joel H.
Publication venue
Publication date
Field of study

It is anticipated that in order to make effective use of many future high performance architectures, programs will have to exhibit at least a medium grained parallelism. A framework is presented for partitioning very sparse triangular systems of linear equations that is designed to produce favorable preformance results in a wide variety of parallel architectures. Efficient methods for solving these systems are of interest because: (1) they provide a useful model problem for use in exploring heuristics for the aggregation, mapping and scheduling of relatively fine grained computations whose data dependencies are specified by directed acrylic graphs, and (2) because such efficient methods can find direct application in the development of parallel algorithms for scientific computation. Simple expressions are derived that describe how to schedule computational work with varying degrees of granularity. The Encore Multimax was used as a hardware simulator to investigate the performance effects of using the partitioning techniques presented in shared memory architectures with varying relative synchronization costs

NASA Technical Reports Server