Search CORE

131 research outputs found

Simulation Of Multi-core Systems And Interconnections And Evaluation Of Fat-Mesh Networks

Author: Zhang Yu
Publication venue
Publication date: 28/01/2009
Field of study

Simulators are very important in computer architecture research as they enable the exploration of new architectures to obtain detailed performance evaluation without building costly physical hardware. Simulation is even more critical to study future many-core architectures as it provides the opportunity to assess currently non-existing computer systems. In this thesis, a multiprocessor simulator is presented based on a cycle accurate architecture simulator called SESC. The shared L2 cache system is extended into a distributed shared cache (DSC) with a directory-based cache coherency protocol. A mesh network module is extended and integrated into SESC to replace the bus for scalable inter-processor communication. While these efforts complete an extended multiprocessor simulation infrastructure, two interconnection enhancements are proposed and evaluated. A novel non-uniform fat-mesh network structure similar to the idea of fat-tree is proposed. This non-uniform mesh network takes advantage of the average traffic pattern, typically all-to-all in DSC, to dedicate additional links for connections with heavy traffic (e.g., near the center) and fewer links for lighter traffic (e.g., near the periphery). Two fat-mesh schemes are implemented based on different routing algorithms. Analytical fat-mesh models are constructed by presenting the expressions for the traffic requirements of personalized all-to-all traffic. Performance improvements over the uniform mesh are demonstrated in the results from the simulator. A hybrid network consisting of one packet switching plane and multiple circuit switching planes is constructed as the second enhancement. The circuit switching planes provide fast paths between neighbors with heavy communication traffic. A compiler technique that abstracts the symbolic expressions of benchmarks' communication patterns can be used to help facilitate the circuit establishment

D-Scholarship@Pitt

Optimal all-to-all personalized exchange in self-routable multistage networks

Author: J. Wang
Y. Yang
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L.

Author: Andy Yoo
Andy Yoo
Bruce Hendrickson
Bruce Hendrickson
Edmond Chow
Edmond Chow
Keith Henderson
Keith Henderson
Umit Catalyurek
William Mclendon
William Mclendon
§ † Lawrence
¶ D E Shaw
Publication venue
Publication date: 01/01/2005
Field of study

Abstract Many emerging large-scale data science applications require searching large graphs distributed across multiple memories and processors. This paper presents a distributed breadth-first search (BFS) scheme that scales for random graphs with up to three billion vertices and 30 billion edges. Scalability was tested on IBM BlueGene/L with 32,768 nodes at the Lawrence Livermore National Laboratory. Scalability was obtained through a series of optimizations, in particular, those that ensure scalable use of memory. We use 2D (edge) partitioning of the graph instead of conventional 1D (vertex) partitioning to reduce communication overhead. For Poisson random graphs, we show that the expected size of the messages is scalable for both 2D and 1D partitionings. Finally, we have developed efficient collective communication functions for the 3D torus architecture of BlueGene/L that also take advantage of the structure in the problem. The performance and characteristics of the algorithm are measured and reported

CiteSeerX

Circuit-Switched Gossiping in the 3-Dimensional Torus Networks

Author: Delmas Olivier
Pérennes Stéphane
Publication venue: HAL CCSD
Publication date: 01/07/1996
Field of study

In this paper we describe, in the case of short messages, an efficient gossiping algorithm for 3-dimensional torus networks (wrap-around or toroidal meshes) that uses synchronous circuit-switched routing. The algorithm is based on a recursive decomposition of a torus. The algorithm requires an optimal number of rounds and a quasi-optimal number of intermediate switch settings to gossip in an

7^i \times 7^i \times 7^i

torus

INRIA a CCSD electronic archive server

Model-driven approach for supporting the mapping of parallel algorithms to parallel computing platforms

Author: A. Gamatié
A.H. Karp
A.M. Aizcorbe
C. Baransel
D. Talia
G. Rudy
G.E. Moore
G.M. Amdahl
H.W. Lenstra
J.G. Peters
J.L. Gustafson
K. Czarnecki
K.M. İmre
L. Adhianto
M. Geimer
M.D. Hill
M.P. Frank
S.S. Shende
T.D. Han
X.D. Zhang
Y.-J. Suh
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

The trend from single processor to parallel computer architectures has increased the importance of parallel computing. To support parallel computing it is important to map parallel algorithms to a computing platform that consists of multiple parallel processing nodes. In general different alternative mappings can be defined that perform differently with respect to the quality requirements for power consumption, efficiency and memory usage. The mapping process can be carried out manually for platforms with a limited number of processing nodes. However, for exascale computing in which hundreds of thousands of processing nodes are applied, the mapping process soon becomes intractable. To assist the parallel computing engineer we provide a model-driven approach to analyze, model, and select feasible mappings. We describe the developed toolset that implements the corresponding approach together with the required metamodels and model transformations. We illustrate our approach for the well-known complete exchange algorithm in parallel computing. © 2013 Springer-Verlag

Crossref

Bilkent University Institutional Repository

Optimizing Communication for Massively Parallel Processing

Author: Kumar Sameer
Publication venue
Publication date: 01/04/2005
Field of study

The current trends in high performance computing show that large machines with tens of thousands of processors will soon be readily available. The IBM Bluegene-L machine with 128k processors (which is currently being deployed) is an important step in this direction. In this scenario, it is going to be a significant burden for the programmer to manually scale his applications. This task of scaling involves addressing issues like load-imbalance and communication overhead. In this thesis, we explore several communication optimizations to help parallel applications to easily scale on a large number of processors. We also present automatic runtime techniques to relieve the programmer from the burden of optimizing communication in his applications. This thesis explores processor virtualization to improve communication performance in applications. With processor virtualization, the computation is mapped to virtual processors (VPs). After one VP has finished computation and is waiting for responses to its messages, another VP can compute, thus overlapping communication with computation. This overlap is only effective if the processor overhead of the communication operation is a small fraction of the total communication time. Fortunately, with network interfaces having co-processors, this happens to be true and processor virtualization has a natural advantage on such interconnects. The communication optimizations we present in this thesis, are motivated by applications such as NAMD (a classical molecular dynamics application) and CPAIMD (a quantum chemistry application). Applications like NAMD and CPAIMD consume a fair share of the time available on supercomputers. So, improving their performance would be of great value. We have successfully scaled NAMD to 1TF of peak performance on 3000 processors of PSC Lemieux, using the techniques presented in this thesis. We study both point-to-point communication and collective communication (specifically all-to-all communication). On a large number of processors all-to-all communication can take several milli-seconds to finish. With synchronous collectives defined in MPI, the processor idles while the collective messages are in flight. Therefore, we demonstrate an asynchronous collective communication framework, to let the CPU compute while the all-to-all messages are in flight. We also show that the best strategy for all-to-all communication depends on the message size, number of processors and other dynamic parameters. This suggests that these parameters can be observed at runtime and used to choose the optimal strategy for all-to-all communication. In this thesis, we demonstrate adaptive strategy switching for all-to-all communication. The communication optimization framework presented in this thesis, has been designed to optimize communication in the context of processor virtualization and dynamic migrating objects. We present the streaming strategy to optimize fine grained object-to-object communication. In this thesis, we motivate the need for hardware collectives, as processor based collectives can be delayed by intermediate that processors busy with computation. We explore a next generation interconnect that supports collectives in the switching hardware. We show the performance gains of hardware collectives through synthetic benchmarks

Illinois Digital Environment for Access to Learning and Scholarship Repository

Recommended from our members

OpenAtom: Ab initio Molecular Dynamics for Petascale Platforms

Author: Arya A.
Bhatele A.
Bohm E. J.
Kale L. V.
Martyna G. J.
Venkataraman R.
Publication venue: Lawrence Livermore National Laboratory
Publication date: 14/12/2012
Field of study

UNT Digital Library

Runtime Support for In-Core and Out-of-Core Data-Parallel Programs

Author: Thakur Rajeev
Publication venue: SURFACE at Syracuse University
Publication date: 01/01/1995
Field of study

Distributed memory parallel computers or distributed computer systems are widely recognized as the only cost-effective means of achieving teraflops performance in the near future. However, the fact remains that they are difficult to program and advances in software for these machines have not kept pace with advances in hardware. This thesis addresses several issues in providing runtime support for in-core as well as out-of-core programs on distributed memory parallel computers. This runtime support can be directly used in application programs for greater efficiency, portability and ease of programming. It can also be used together with a compiler to translate programs written in a high-level data-parallel language like High Performance Fortran (HPF) to node programs for distributed memory machines. In distributed memory programs, it is often necessary to change the distribution of arrays during program execution. This thesis presents efficient and portable algorithms for runtime array redistribution. The algorithms have been implemented on the Intel Touchstone Delta and are found to scale well with the number of processors and array size. This thesis also presents algorithms for all to all collective communication on fat tree and two dimensional mesh interconnection topologies. The performance of these algorithms on the CM 5 and Touchstone Delta is studied extensively. A model for estimating the time taken by these algorithms on the basis of system parameters is developed and validated by comparing with experimental results. A number of applications deal with very large data sets which cannot fit in main memory, and hence have to be stored in files on disks, resulting in out of core programs. This thesis also describes the design and implementation of efficient runtime support for out of core computations. Several optimizations for accessing out of core data are presented. An extended Two Phase Method is proposed for accessing sections of out of core arrays efficiently. This method uses collective I/O and the I/O workload is divided among processors dynamically, depending on the access requests. Performance results obtained using this runtime support for out of core programs on the Touchstone Delta are presented

CiteSeerX

Syracuse University Research Facility and Collaborative Environment