1,391 research outputs found
CLEX: Yet Another Supercomputer Architecture?
We propose the CLEX supercomputer topology and routing scheme. We prove that
CLEX can utilize a constant fraction of the total bandwidth for point-to-point
communication, at delays proportional to the sum of the number of intermediate
hops and the maximum physical distance between any two nodes. Moreover, %
applying an asymmetric bandwidth assignment to the links, all-to-all
communication can be realized -optimally both with regard to
bandwidth and delays. This is achieved at node degrees of ,
for an arbitrary small constant . In contrast, these
results are impossible in any network featuring constant or polylogarithmic
node degrees. Through simulation, we assess the benefits of an implementation
of the proposed communication strategy. Our results indicate that, for a
million processors, CLEX can increase bandwidth utilization and reduce average
routing path length by at least factors respectively in comparison to
a torus network. Furthermore, the CLEX communication scheme features several
other properties, such as deadlock-freedom, inherent fault-tolerance, and
canonical partition into smaller subsystems
OutFlank Routing: Increasing Throughput in Toroidal Interconnection Networks
We present a new, deadlock-free, routing scheme for toroidal interconnection
networks, called OutFlank Routing (OFR). OFR is an adaptive strategy which
exploits non-minimal links, both in the source and in the destination nodes.
When minimal links are congested, OFR deroutes packets to carefully chosen
intermediate destinations, in order to obtain travel paths which are only an
additive constant longer than the shortest ones. Since routing performance is
very sensitive to changes in the traffic model or in the router parameters, an
accurate discrete-event simulator of the toroidal network has been developed to
empirically validate OFR, by comparing it against other relevant routing
strategies, over a range of typical real-world traffic patterns. On the
16x16x16 (4096 nodes) simulated network OFR exhibits improvements of the
maximum sustained throughput between 14% and 114%, with respect to Adaptive
Bubble Routing.Comment: 9 pages, 5 figures, to be presented at ICPADS 201
Information Spreading on Almost Torus Networks
Epidemic modeling has been extensively used in the last years in the field of
telecommunications and computer networks. We consider the popular
Susceptible-Infected-Susceptible spreading model as the metric for information
spreading. In this work, we analyze information spreading on a particular class
of networks denoted almost torus networks and over the lattice which can be
considered as the limit when the torus length goes to infinity. Almost torus
networks consist on the torus network topology where some nodes or edges have
been removed. We find explicit expressions for the characteristic polynomial of
these graphs and tight lower bounds for its computation. These expressions
allow us to estimate their spectral radius and thus how the information spreads
on these networks
Symmetric Interconnection Networks from Cubic Crystal Lattices
Torus networks of moderate degree have been widely used in the supercomputer
industry. Tori are superb when used for executing applications that require
near-neighbor communications. Nevertheless, they are not so good when dealing
with global communications. Hence, typical 3D implementations have evolved to
5D networks, among other reasons, to reduce network distances. Most of these
big systems are mixed-radix tori which are not the best option for minimizing
distances and efficiently using network resources. This paper is focused on
improving the topological properties of these networks.
By using integral matrices to deal with Cayley graphs over Abelian groups, we
have been able to propose and analyze a family of high-dimensional grid-based
interconnection networks. As they are built over -dimensional grids that
induce a regular tiling of the space, these topologies have been denoted
\textsl{lattice graphs}. We will focus on cubic crystal lattices for modeling
symmetric 3D networks. Other higher dimensional networks can be composed over
these graphs, as illustrated in this research. Easy network partitioning can
also take advantage of this network composition operation. Minimal routing
algorithms are also provided for these new topologies. Finally, some practical
issues such as implementability and preliminary performance evaluations have
been addressed
Scalable data abstractions for distributed parallel computations
The ability to express a program as a hierarchical composition of parts is an
essential tool in managing the complexity of software and a key abstraction
this provides is to separate the representation of data from the computation.
Many current parallel programming models use a shared memory model to provide
data abstraction but this doesn't scale well with large numbers of cores due to
non-determinism and access latency. This paper proposes a simple programming
model that allows scalable parallel programs to be expressed with distributed
representations of data and it provides the programmer with the flexibility to
employ shared or distributed styles of data-parallelism where applicable. It is
capable of an efficient implementation, and with the provision of a small set
of primitive capabilities in the hardware, it can be compiled to operate
directly on the hardware, in the same way stack-based allocation operates for
subroutines in sequential machines
Topology Architecture and Routing Algorithms of Octagon-Connected Torus Interconnection Network
Two important issues in the design of interconnection networks for massively parallel computers are scalability and small diameter. A new interconnection network topology, called octagon-connected torus (OCT), is proposed. The OCT network combines the small diameter of octagon topology and the scalability of torus topology. The OCT network has better properties, such as small diameter, regular, symmetry and the scalability. The nodes of the OCT network adopt the Johnson coding scheme which can make routing algorithms simple and efficient. Both unicasting and broadcasting routing algorithms are designed for the OCT network, and it is based on the Johnson coding scheme. A detailed analysis shows that the OCT network is a better interconnection network in the properties of topology and the performance of communication
A visual Analytics System for Optimizing Communications in Massively Parallel Applications
Current and future supercomputers have tens of thousands of compute nodes interconnected with high-dimensional networks and complex network topologies for improved performance. Application developers are required to write scalable parallel programs in order to achieve high throughput on these machines. Application performance is largely determined by efficient inter-process communication. A common way to analyze and optimize performance is through profiling parallel codes to identify communication bottlenecks. However, understanding gigabytes of profile data is not a trivial task. In this paper, we present a visual analytics system for identifying the scalability bottlenecks and improving the communication efficiency of massively parallel applications. Visualization methods used in this system are designed to comprehend large-scale and varied communication patterns on thousands of nodes in complex networks such as the 5D torus and the dragonfly. We also present efficient rerouting and remapping algorithms that can be coupled with our interactive visual analytics design for performance optimization. We demonstrate the utility of our system with several case studies using three benchmark applications on two leading supercomputers. The mapping suggestion from our system led to 38% improvement in hop-bytes for MiniAMR application on 4,096 MPI processes.This research has been sponsored in part by the U.S. National Science Foundation through grant IIS-1320229, and the U.S. Department of Energy through grants DE-SC0012610 and DE-SC0014917. This research has been funded in part and used resources of the Argonne Leadership Computing Facility at Argonne National Lab- oratory, which is supported by the Office of Science of the U.S. Department of Energy under contract no. DE-AC02-06CH11357. This work was supported in part by the DOE Office of Science, ASCR, under award numbers 57L38, 57L32, 57L11, 57K50, and 508050
Optimising Simulation Data Structures for the Xeon Phi
In this paper, we propose a lock-free architecture
to accelerate logic gate circuit simulation using SIMD multi-core
machines. We evaluate its performance on different test circuits
simulated on the Intel Xeon Phi and 2 other machines. Comparisons
are presented of this software/hardware combination with
reported performances of GPU and other multi-core simulation
platforms. Comparisons are also given between the lock free
architecture and a leading commercial simulator running on the
same Intel hardware
- âŠ