200 research outputs found

    Partial aggregation for collective communication in distributed memory machines

    Get PDF
    High Performance Computing (HPC) systems interconnect a large number of Processing Elements (PEs) in high-bandwidth networks to simulate complex scientific problems. The increasing scale of HPC systems poses great challenges on algorithm designers. As the average distance between PEs increases, data movement across hierarchical memory subsystems introduces high latency. Minimizing latency is particularly challenging in collective communications, where many PEs may interact in complex communication patterns. Although collective communications can be optimized for network-level parallelism, occasional synchronization delays due to dependencies in the communication pattern degrade application performance. To reduce the performance impact of communication and synchronization costs, parallel algorithms are designed with sophisticated latency hiding techniques. The principle is to interleave computation with asynchronous communication, which increases the overall occupancy of compute cores. However, collective communication primitives abstract parallelism which limits the integration of latency hiding techniques. Approaches to work around these limitations either modify the algorithmic structure of application codes, or replace collective primitives with verbose low-level communication calls. While these approaches give fine-grained control for latency hiding, implementing collective communication algorithms is challenging and requires expertise knowledge about HPC network topologies. A collective communication pattern is commonly described as a Directed Acyclic Graph (DAG) where a set of PEs, represented as vertices, resolve data dependencies through communication along the edges. Our approach improves latency hiding in collective communication through partial aggregation. Based on mathematical rules of binary operations and homomorphism, we expose data parallelism in a respective DAG to overlap computation with communication. The proposed concepts are implemented and evaluated with a subset of collective primitives in the Message Passing Interface (MPI), an established communication standard in scientific computing. An experimental analysis with communication-bound microbenchmarks shows considerable performance benefits for the evaluated collective primitives. A detailed case study with a large-scale distributed sort algorithm demonstrates, how partial aggregation significantly improves performance in data-intensive scenarios. Besides better latency hiding capabilities with collective communication primitives, our approach enables further optimizations of their implementations within MPI libraries. The vast amount of asynchronous programming models, which are actively studied in the HPC community, benefit from partial aggregation in collective communication patterns. Future work can utilize partial aggregation to improve the interaction of MPI collectives with acclerator architectures, and to design more efficient communication algorithms

    A Taxonomy of Data Grids for Distributed Data Sharing, Management and Processing

    Full text link
    Data Grids have been adopted as the platform for scientific communities that need to share, access, transport, process and manage large data collections distributed worldwide. They combine high-end computing technologies with high-performance networking and wide-area storage management techniques. In this paper, we discuss the key concepts behind Data Grids and compare them with other data sharing and distribution paradigms such as content delivery networks, peer-to-peer networks and distributed databases. We then provide comprehensive taxonomies that cover various aspects of architecture, data transportation, data replication and resource allocation and scheduling. Finally, we map the proposed taxonomy to various Data Grid systems not only to validate the taxonomy but also to identify areas for future exploration. Through this taxonomy, we aim to categorise existing systems to better understand their goals and their methodology. This would help evaluate their applicability for solving similar problems. This taxonomy also provides a "gap analysis" of this area through which researchers can potentially identify new issues for investigation. Finally, we hope that the proposed taxonomy and mapping also helps to provide an easy way for new practitioners to understand this complex area of research.Comment: 46 pages, 16 figures, Technical Repor

    Three Highly Parallel Computer Architectures and Their Suitability for Three Representative Artificial Intelligence Problems

    Get PDF
    Virtually all current Artificial Intelligence (AI) applications are designed to run on sequential (von Neumann) computer architectures. As a result, current systems do not scale up. As knowledge is added to these systems, a point is reached where their performance quickly degrades. The performance of a von Neumann machine is limited by the bandwidth between memory and processor (the von Neumann bottleneck). The bottleneck is avoided by distributing the processing power across the memory of the computer. In this scheme the memory becomes the processor (a smart memory ). This paper highlights the relationship between three representative AI application domains, namely knowledge representation, rule-based expert systems, and vision, and their parallel hardware realizations. Three machines, covering a wide range of fundamental properties of parallel processors, namely module granularity, concurrency control, and communication geometry, are reviewed: the Connection Machine (a fine-grained SIMD hypercube), DADO (a medium-grained MIMD/SIMD/MSIMD tree-machine), and the Butterfly (a coarse-grained MIMD Butterflyswitch machine)

    Multicentered computer architecture for real-time data acquisition and display

    Get PDF

    Network-Oblivious Algorithms

    Get PDF
    A framework is proposed for the design and analysis of network-oblivious algorithms, namely algorithms that can run unchanged, yet efficiently, on a variety of machines characterized by different degrees of parallelism and communication capabilities. The framework prescribes that a network-oblivious algorithm be specified on a parallel model of computation where the only parameter is the problem\u2019s input size, and then evaluated on a model with two parameters, capturing parallelism granularity and communication latency. It is shown that for a wide class of network-oblivious algorithms, optimality in the latter model implies optimality in the decomposable bulk synchronous parallel model, which is known to effectively describe a wide and significant class of parallel platforms. The proposed framework can be regarded as an attempt to port the notion of obliviousness, well established in the context of cache hierarchies, to the realm of parallel computation. Its effectiveness is illustrated by providing optimal network-oblivious algorithms for a number of key problems. Some limitations of the oblivious approach are also discussed

    Real-time power system dynamic simulation

    Get PDF
    The present day digital computing resources are overburdened by the amount of calculation necessary for power system dynamic simulation. Although the hardware has improved significantly, the expansion of the interconnected systems, and the requirement for more detailed models with frequent solutions have increased the need for simulating these systems in real time. To achieve this, more effort has been devoted to developing and improving the application of numerical methods and computational techniques such as sparsity-directed approaches and network decomposition to power system dynamic studies. This project is a modest contribution towards solving this problem. It consists of applying a very efficient sparsity technique to the power system dynamic simulator under a wide range of events. The method used was first developed by Zollenkopf (^117) Following the structure of the linear equations related to power system dynamic simulator models, the original algorithm which was conceived for scalar calculation has been modified to use sets of 2 * 2 sub-matrices for both the dynamic and algebraic equations. The realisation of real-time simulators also requires the simplification of the power system models and the adoption of a few assumptions such as neglecting short time constants. Most of the network components are simulated. The generating units include synchronous generators and their local controllers, and the simulated network is composed of transmission lines and transformers with tap-changing and phase-shifting, non-linear static loads, shunt compensators and simplified protection. The simulator is capable of handling some of the severe events which occur in power systems such as islanding, island re-synchronisation and generator start-up and shut-down. To avoid the stiffness problem and ensure the numerical stability of the system at long time steps at a reasonable accuracy, the implicit trapezoidal rule is used for discretising the dynamic equations. The algebraisation of differential equations requires an iterative process. Also the non-linear network models are generally better solved by the Newton-Raphson iterative method which has an efficient quadratic rate of convergence. This has favoured the adoption of the simultaneous technique over the classical partitioned method. In this case the algebraised differential equations and the non-linear static equations are solved as one set of algebraic equations. Another way of speeding-up centralised simulators is the adoption of distributed techniques. In this case the simulated networks are subdivided into areas which are computed by a multi-task machine (Perkin Elmer PE3230). A coordinating subprogram is necessary to synchronise and control the computation of the different areas, and perform the overall solution of the system. In addition to this decomposed algorithm the developed technique is also implemented in the parallel simulator running on the Array Processor FPS 5205 attached to a Perkin Elmer PE 3230 minicomputer, and a centralised version run on the host computer. Testing these simulators on three networks under a range of events would allow for the assessment of the algorithm and the selection of the best candidate hardware structure to be used as a dedicated machine to support the dynamic simulator. The results obtained from this dynamic simulator are very impressive. Great speed-up is realised, stable solutions under very severe events are obtained showing the robustness of the system, and accurate long-term results are obtained. Therefore, the present simulator provides a realistic test bed to the Energy Management System. It can also be used for other purposes such as operator training

    Towards larger scale collective operations in the Message Passing Interface

    Get PDF
    Supercomputers continue to expand both in size and complexity as we reach the beginning of the exascale era. Networks have evolved, from simple mechanisms which transport data to subsystems of computers which fulfil a significant fraction of the workload that computers are tasked with. Inevitably with this change, assumptions which were made at the beginning of the last major shift in computing are becoming outdated. We introduce a new latency-bandwidth model which captures the characteristics of sending multiple small messages in quick succession on modern networks. Contrary to other models representing the same effects, the pipelining latency-bandwidth model is simple and physically based. In addition, we develop a discrete-event simulation, Fennel, to capture non-analytical effects of communication within models. AllReduce operations with small messages are common throughout supercomputing, particularly for iterative methods. The performance of network operations are crucial to the overall time-to-solution of an application as a whole. The Message Passing Interface standard was introduced to abstract complex communications from application level development. The underlying algorithms used for the implementation to achieve the specified behaviour, such as the recursive doubling algorithm for AllReduce, have to evolve with the computers on which they are used. We introduce the recursive multiplying algorithm as a generalisation of recursive doubling. By utilising the pipelining nature of modern networks, we lower the latency of AllReduce operations and enable greater choice of schedule. A heuristic is used to quickly generate a near-optimal schedule, by using the pipelining latency-bandwidth model. Alongside recursive multiplying, the endpoints of collective operations must be able to handle larger numbers of incoming messages. Typically this is done by duplicating receive queues for remote peers, but this requires a linear amount of memory space for the size of the application. We introduce a single-consumer multipleproducer queue which is designed to be used with MPI as a protocol to insert messages remotely, with minimal contention for shared receive queues
    • …
    corecore