Search CORE

4 research outputs found

Pipelined Scheduling of Tiled Nested Loops onto Clusters of SMPs Using Memory Mapped Network Interfaces

Author
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2002
Field of study

Pipelined Scheduling of Tiled Nested Loops onto Clusters of SMPs using Memory Mapped Network Interfaces

Author: Evangelos Koukis
Maria Athanasaki
Nectarios Koziris
Publication venue: IEEE Computer Society Press
Publication date: 01/01/2002
Field of study

In this paper we propose several alternative methods for the compile time scheduling of Tiled Nested Loops onto a fixed size parallel architecture. We investigate the distribution of tiles among processors, provided that we have chosen either a non-overlapping communication mode, which involves successive computation and communication steps, or an overlapping communication mode, which supposes a pipelined, concurrent execution of communication and computations. In order to utilize the available processors as efficiently as possible, we can either adopt a cyclic assignment schedule, or assign neighboring tiles to the same CPU, or adapt the size and shape of tiles, so that the required number of processors is exactly equal to the number of the available ones. We theoretically and experimentally compare the proposed schedules, so as to design one which achieves the minimum total execution time, depending on the cluster configuration, (i.e. number and type of nodes, interconnect bandwidth, etc) the internal characteristics of the underlying architecture (i.e. NIC and DMA latencies, etc) and the iteration space size and shape. 1

CiteSeerX

Pipelined Scheduling of Tiled Nested Loops onto Clusters of SMPs Using Memory Mapped Network Interfaces

Author: Aristidis Sotiropoulos
Georgios Tsoukalas
Maria Athanasaki
Nectarios Koziris
Publication venue
Publication date: 01/01/2002
Field of study

This paper describes the performance benefits attained using enhanced network interfaces to achieve low latency communication. We present a novel, pipelined scheduling approach which takes advantage of DMA communication mode, to send data to other nodes, while the CPUs are performing calculations. We also use zero-copy communication through pinned-down physical memory regions, provided by NIC’s driver modules. Our testbed concerns the parallel execution of tiled nested loops onto a cluster of SMP nodes with single PCI-SCI NICs inside each node. In order to schedule tiles, we apply a hyperplane-based grouping transformation to the tiled space, so as to group together independent neighboring tiles and assign them to the same SMP node. Experimental evaluation illustrates that memory mapped NICs with enhanced communication features enable the use of a more advanced pipelined (overlapping) schedule, which considerably improves performance, compared to an ordinary blocking schedule, implemented with conventional, CPU and kernel bounded, communication primitives. Keywords: memory mapped network interfaces, DMA, pipelined schedules, tile grouping, communication overlapping, SMPs

CiteSeerX

Crossref

DSpace at NTUA

Supernode Transformation On Parallel Systems With Distributed Memory – An Analytical Approach

Author: Chen Yong
Publication venue: Scholar Commons
Publication date: 21/03/2017
Field of study

Supernode transformation, or tiling, is a technique that partitions algorithms to improve data locality and parallelism by balancing computation and inter-processor communication costs to achieve shortest execution or running time. It groups multiple iterations of nested loops into supernodes to be assigned to processors for processing in parallel. A supernode transformation can be described by supernode size and shape. This research focuses on supernode transformation on multi-processor architectures with distributed memory, including computer cluster systems and General Purpose Graphic Processing Units (GPGPUs). The research involves supernode scheduling, supernode mapping to processors, and the finding of the optimal supernode size, for achieving the shortest total running time. The algorithms considered are two nested loops with regular data dependencies. The Longest Common Subsequence problem is used as an illustration. A novel mathematical model for the total running time is established as a function of the supernode size, algorithm parameters such as the problem size and the data dependence, the computation time of each loop iteration, architecture parameters such as the number of processors, and the communication cost. The optimal supernode size is derived from this closed form model. The model and the optimal supernode size provide better results than previous researches and are verified by simulations on multi-processor systems including computer cluster systems and GPGPUs

Scholar Commons - Santa Clara University