1,049 research outputs found
A Multilevel Approach to Topology-Aware Collective Operations in Computational Grids
The efficient implementation of collective communiction operations has
received much attention. Initial efforts produced "optimal" trees based on
network communication models that assumed equal point-to-point latencies
between any two processes. This assumption is violated in most practical
settings, however, particularly in heterogeneous systems such as clusters of
SMPs and wide-area "computational Grids," with the result that collective
operations perform suboptimally. In response, more recent work has focused on
creating topology-aware trees for collective operations that minimize
communication across slower channels (e.g., a wide-area network). While these
efforts have significant communication benefits, they all limit their view of
the network to only two layers. We present a strategy based upon a multilayer
view of the network. By creating multilevel topology-aware trees we take
advantage of communication cost differences at every level in the network. We
used this strategy to implement topology-aware versions of several MPI
collective operations in MPICH-G2, the Globus Toolkit[tm]-enabled version of
the popular MPICH implementation of the MPI standard. Using information about
topology provided by MPICH-G2, we construct these multilevel topology-aware
trees automatically during execution. We present results demonstrating the
advantages of our multilevel approach by comparing it to the default
(topology-unaware) implementation provided by MPICH and a topology-aware
two-layer implementation.Comment: 16 pages, 8 figure
MPICH-G2: A Grid-Enabled Implementation of the Message Passing Interface
Application development for distributed computing "Grids" can benefit from
tools that variously hide or enable application-level management of critical
aspects of the heterogeneous environment. As part of an investigation of these
issues, we have developed MPICH-G2, a Grid-enabled implementation of the
Message Passing Interface (MPI) that allows a user to run MPI programs across
multiple computers, at the same or different sites, using the same commands
that would be used on a parallel computer. This library extends the Argonne
MPICH implementation of MPI to use services provided by the Globus Toolkit for
authentication, authorization, resource allocation, executable staging, and
I/O, as well as for process creation, monitoring, and control. Various
performance-critical operations, including startup and collective operations,
are configured to exploit network topology information. The library also
exploits MPI constructs for performance management; for example, the MPI
communicator construct is used for application-level discovery of, and
adaptation to, both network topology and network quality-of-service mechanisms.
We describe the MPICH-G2 design and implementation, present performance
results, and review application experiences, including record-setting
distributed simulations.Comment: 20 pages, 8 figure
NUMA-Aware Strategies for the Heterogeneous Execution of SPMV on Modern Supercomputers
The sparse matrix-vector product is a widespread operation amongst the scientific computing community. It represents the dominant computational cost in many large-scale simulations relying on iterative methods, and its performance is sensitive to the sparse pattern, the storage format, and kernel implementation, and the target computing architecture. In this work, we are devoted to the efficient execution of the sparse matrix-vector product on (potentially hybrid) modern supercomputers with non-uniform memory access configurations. A hierarchical parallel implementation is proposed to minimize the number of processes participating in distributed-memory parallelization. As a result, a single process per computing node is enough to engage all its hardware and ensure efficient memory access on manycore platforms. The benefits of this approach have been demonstrated on up to 9,600 cores of MareNostrum 4 supercomputer, at Barcelona Supercomputing Center.The work of A. Gorobets has been funded by the Russian Science Foundation, project 19- 11-00299. The work of X. Alvarez-Farr ´ e, F. X. Trias and A. Oliva has been financially supported ´ by the ANUMESOL project (ENE2017-88697-R) by the Spanish Research Agency (Ministerio de EconomĂa y Competitividad, SecretarĂa de Estado de Investigacion, Desarrollo e Inno- ´ vacion), and the FusionCAT project (001-P-001722) by the Government of Catalonia (RIS3CAT ´ FEDER). The studies of this work have been carried out using the MareNostrum 4 supercomputer of the Barcelona Supercomputing Center (projects IM-2020-2-0029 and IM-2020-3-0030); the TSUBAME3.0 supercomputer of the Global Scientific Information and Computing Center at Tokyo Institute of Technology; the Lomonosov-2 supercomputer of the shared research facilities of HPC computing resources at Lomonosov Moscow State University; the K-60 hybrid cluster of the collective use center of the Keldysh Institute of Applied Mathematics. The authors thankfully acknowledge these institutions for the compute time and technical support.Postprint (published version
Improving the Performance of the MPI_Allreduce Collective Operation through Rank Renaming
Proceedings of: First International Workshop on Sustainable Ultrascale Computing Systems (NESUS 2014). Porto (Portugal), August 27-28, 2014.Collective operations, a key issue in the global efficiency of HPC applications, are optimized in current MPI libraries by choosing at runtime between a set of algorithms, based on platform-dependent beforehand established parameters, as the message size or the number of processes. However, with progressively more cores per node, the cost of a collective algorithm must be mainly imputed to process-to-processor mapping, because its decisive influence over the network traffic. Hierarchical design of collective algorithms pursuits to minimize the data movement through the slowest communication channels of the multi-core cluster. Nevertheless, the hierarchical implementation of some collectives becomes inefficient, and even impracticable, due to the operation definition itself. This paper proposes a new approach that departs from a frequently found regular mapping, either sequential or round-robin. While keeping the mapping, the rank assignation to the processes is temporarily changed prior to the execution of the collective algorithm. The new assignation makes the communication pattern to adapt to the communication channels hierarchy. We explore this technique for the Ring algorithm when used in the well-known MPI_Allreduce collective, and discuss the obtained performance results. Extensions to other algorithms and collective operations are proposed.The work presented in this paper has been partially supported by EU
under the COST programme Action IC1305, âNetwork for Sustainable
Ultrascale Computing (NESUS)â, and by the computing facilities
of Extremadura Research Centre for Advanced Technologies (CETACIEMAT),
funded by the European Regional Development Fund
(ERDF). CETA-CIEMAT belongs to CIEMAT and the Government of
Spain
A Framework for Adaptive Collective Communications on Heterogeneous Hierarchical Networks
Extended version of the IPDPS 2006 paperToday, due to the wide variety of existing parallel systems consisting on collections of heterogeneous machines, it is very difficult for a user to solve a target problem by using a single algorithm or to write portable programs that perform well on multiple computational supports. The inherent heterogeneity and the diversity of networks of such environments represent a great challenge to model the communications for high performance computing applications. Our objective within this work is to propose a generic framework based on communication models and adaptive techniques for dealing with prediction of communication performances on cluster-based hierarchical platforms. Toward this goal, we introduce the concept of polyalgorithmic model of communications, which correspond to selection of the most adapted communication algorithms and scheduling strategies, giving the characteristics of the hardware resources of the target parallel system. We apply this methodology on collective communication operations and show that the framework provides significant performances while determining the best algorithm depending on the problem and architecture parameters
The Inter-cloud meta-scheduling
Inter-cloud is a recently emerging approach that expands cloud elasticity. By facilitating an adaptable setting, it purposes at the realization of a scalable resource provisioning that enables a diversity of cloud user requirements to be handled efficiently. This studyâs contribution is in the inter-cloud performance optimization of job executions using metascheduling concepts. This includes the development of the inter-cloud meta-scheduling (ICMS) framework, the ICMS optimal schemes and the SimIC toolkit. The ICMS model is an architectural strategy for managing and scheduling user services in virtualized dynamically inter-linked clouds. This is achieved by the development of a model that includes a set of algorithms, namely the Service-Request, Service-Distribution, Service-Availability and Service-Allocation algorithms. These along with resource management optimal schemes offer the novel functionalities of the ICMS where the message exchanging implements the job distributions method, the VM deployment offers the VM management features and the local resource management system details the management of the local cloud schedulers. The generated system offers great flexibility by facilitating a lightweight resource management methodology while at the same time handling the heterogeneity of different clouds through advanced service level agreement coordination. Experimental results are productive as the proposed ICMS model achieves enhancement of the performance of service distribution for a variety of criteria such as service execution times, makespan, turnaround times, utilization levels and energy consumption rates for various inter-cloud entities, e.g. users, hosts and VMs. For example, ICMS optimizes the performance of a non-meta-brokering inter-cloud by 3%, while ICMS with full optimal schemes achieves 9% optimization for the same configurations. The whole experimental platform is implemented into the inter-cloud Simulation toolkit (SimIC) developed by the author, which is a discrete event simulation framework
HIGH-PERFORMANCE SPECTRAL METHODS FOR COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS
Recent research shows that by leveraging the key spectral properties of eigenvalues and eigenvectors of graph Laplacians, more efficient algorithms can be developed for tackling many graph-related computing tasks. In this dissertation, spectral methods are utilized for achieving faster algorithms in the applications of very-large-scale integration (VLSI) computer-aided design (CAD)
First, a scalable algorithmic framework is proposed for effective-resistance preserving spectral reduction of large undirected graphs. The proposed method allows computing much smaller graphs while preserving the key spectral (structural) properties of the original graph. Our framework is built upon the following three key components: a spectrum-preserving node aggregation and reduction scheme, a spectral graph sparsification framework with iterative edge weight scaling, as well as effective-resistance preserving post-scaling and iterative solution refinement schemes. We show that the resultant spectrally-reduced graphs can robustly preserve the first few nontrivial eigenvalues and eigenvectors of the original graph Laplacian and thus allow for developing highly-scalable spectral graph partitioning and circuit simulation algorithms.
Based on the framework of the spectral graph reduction, a Sparsified graph-theoretic Algebraic Multigrid (SAMG) is proposed for solving large Symmetric Diagonally Dominant (SDD) matrices. The proposed SAMG framework allows efficient construction of nearly-linear sized graph Laplacians for coarse-level problems while maintaining good spectral approximation during the AMG setup phase by leveraging a scalable spectral graph sparsification engine. Our experimental results show that the proposed method can offer more scalable performance than existing graph-theoretic AMG solvers for solving large SDD matrices in integrated circuit (IC) simulations, 3D-IC thermal analysis, image processing, finite element analysis as well as data mining and machine learning applications.
Finally, the spectral methods are applied to power grid and thermal integrity verification applications. This dissertation introduces a vectorless power grid and thermal integrity verification framework that allows computing worst-case voltage drop or thermal profiles across the entire chip under a set of local and global workload (power density) constraints. To address the computational challenges introduced by the large 3D mesh-structured thermal grids, we apply the spectral graph reduction approach for highly-scalable vectorless thermal (or power grids) verification of large chip designs. The effectiveness and efficiency of our approach have been demonstrated through extensive experiments
- âŚ