219 research outputs found
Broadcasting in Hypercubes under Circuit Switched Model
International audienceIn this paper, we propose a method which enables to construct almost optimal broadcast schemes on an n-dimensional hypercube in the circuit switched,-port model. In this model, an initiator must inform all the nodes of the network in a sequence of rounds. During a round, vertices communicate along arc-disjoint dipaths. Our construction is based on particular sequences of nested binary codes having the property that each code can inform the next one in a single round. This last property is insured by a ow technique and results about symmetric ow networks. We apply the method to design optimal schemes improving and generalizing the previous results
Symmetric flows and broadcasting in hypercubes
International audienceIn this paper, we propose a method which enables to construct almost optimal broadcast schemes on an n-dimensional hypercube in the circuit switched,-port model. In this model, an initiator must inform all the nodes of the network in a sequence of rounds. During a round, vertices communicate along arc-disjoint dipaths. Our construction is based on particular sequences of nested binary codes having the property that each code can inform the next one in a single round. This last property is insured by a ow technique and results about symmetric ow networks. We apply the method to design optimal schemes improving and generalizing the previous results
Circuit-Switched Gossiping in the 3-Dimensional Torus Networks
In this paper we describe, in the case of short messages, an efficient gossiping algorithm for 3-dimensional torus networks (wrap-around or toroidal meshes) that uses synchronous circuit-switched routing. The algorithm is based on a recursive decomposition of a torus. The algorithm requires an optimal number of rounds and a quasi-optimal number of intermediate switch settings to gossip in an torus
Optical control plane: theory and algorithms
In this thesis we propose a novel way to achieve global network information dissemination in which some wavelengths are reserved exclusively for global control information exchange. We study the routing and wavelength assignment problem for the special communication pattern of non-blocking all-to-all broadcast in WDM optical networks. We provide efficient solutions to reduce the number of wavelengths needed for non-blocking all-to-all broadcast, in the absence of wavelength converters, for network information dissemination. We adopt an approach in which we consider all nodes to be tap-and-continue capable thus studying lighttrees rather than lightpaths. To the best of our knowledge, this thesis is the first to consider “tap-and-continue” capable nodes in the context of conflict-free all-to-all broadcast. The problem of all to-all broadcast using individual lightpaths has been proven to be an NP-complete problem [6]. We provide optimal RWA solutions for conflict-free all-to-all broadcast for some particular cases of regular topologies, namely the ring, the torus and the hypercube. We make an important contribution on hypercube decomposition into edge-disjoint structures. We also present near-optimal polynomial-time solutions for the general case of arbitrary topologies. Furthermore, we apply for the first time the “cactus” representation of all minimum edge-cuts of graphs with arbitrary topologies to the problem of all-to-all broadcast in optical networks. Using this representation recursively we obtain near-optimal results for the number of wavelengths needed by the non-blocking all-to-all broadcast. The second part of this thesis focuses on the more practical case of multi-hop RWA for non- blocking all-to-all broadcast in the presence of Optical-Electrical-Optical conversion. We propose two simple but efficient multi-hop RWA models. In addition to reducing the number of wavelengths we also concentrate on reducing the number of optical receivers, another important optical resource. We analyze these models on the ring and the hypercube, as special cases of regular topologies. Lastly, we develop a good upper-bound on the number of wavelengths in the case of non-blocking multi-hop all-to-all broadcast on networks with arbitrary topologies and offer a heuristic algorithm to achieve it. We propose a novel network partitioning method based on “virtual perfect matching” for use in the RWA heuristic algorithm
Simulation Of Multi-core Systems And Interconnections And Evaluation Of Fat-Mesh Networks
Simulators are very important in computer architecture research as they enable the exploration of new architectures to obtain detailed performance evaluation without building costly physical hardware. Simulation is even more critical to study future many-core architectures as it provides the opportunity to assess currently non-existing computer systems. In this thesis, a multiprocessor simulator is presented based on a cycle accurate architecture simulator called SESC. The shared L2 cache system is extended into a distributed shared cache (DSC) with a directory-based cache coherency protocol. A mesh network module is extended and integrated into SESC to replace the bus for scalable inter-processor communication. While these efforts complete an extended multiprocessor simulation infrastructure, two interconnection enhancements are proposed and evaluated. A novel non-uniform fat-mesh network structure similar to the idea of fat-tree is proposed. This non-uniform mesh network takes advantage of the average traffic pattern, typically all-to-all in DSC, to dedicate additional links for connections with heavy traffic (e.g., near the center) and fewer links for lighter traffic (e.g., near the periphery). Two fat-mesh schemes are implemented based on different routing algorithms. Analytical fat-mesh models are constructed by presenting the expressions for the traffic requirements of personalized all-to-all traffic. Performance improvements over the uniform mesh are demonstrated in the results from the simulator. A hybrid network consisting of one packet switching plane and multiple circuit switching planes is constructed as the second enhancement. The circuit switching planes provide fast paths between neighbors with heavy communication traffic. A compiler technique that abstracts the symbolic expressions of benchmarks' communication patterns can be used to help facilitate the circuit establishment
New Fault Tolerant Multicast Routing Techniques to Enhance Distributed-Memory Systems Performance
Distributed-memory systems are a key to achieve high performance computing and the most favorable architectures used in advanced research problems. Mesh connected multicomputer are one of the most popular architectures that have been implemented in many distributed-memory systems. These systems must support communication operations efficiently to achieve good performance. The wormhole switching technique has been widely used in design of distributed-memory systems in which the packet is divided into small flits. Also, the multicast communication has been widely used in distributed-memory systems which is one source node sends the same message to several destination nodes. Fault tolerance refers to the ability of the system to operate correctly in the presence of faults. Development of fault tolerant multicast routing algorithms in 2D mesh networks is an important issue. This dissertation presents, new fault tolerant multicast routing algorithms for distributed-memory systems performance using wormhole routed 2D mesh. These algorithms are described for fault tolerant routing in 2D mesh networks, but it can also be extended to other topologies. These algorithms are a combination of a unicast-based multicast algorithm and tree-based multicast algorithms. These algorithms works effectively for the most commonly encountered faults in mesh networks, f-rings, f-chains and concave fault regions. It is shown that the proposed routing algorithms are effective even in the presence of a large number of fault regions and large size of fault region. These algorithms are proved to be deadlock-free. Also, the problem of fault regions overlap is solved. Four essential performance metrics in mesh networks will be considered and calculated; also these algorithms are a limited-global-information-based multicasting which is a compromise of local-information-based approach and global-information-based approach. Data mining is used to validate the results and to enlarge the sample. The proposed new multicast routing techniques are used to enhance the performance of distributed-memory systems. Simulation results are presented to demonstrate the efficiency of the proposed algorithms
Efficient All-to-All Collective Communication Schedules for Direct-Connect Topologies
The all-to-all collective communications primitive is widely used in machine
learning (ML) and high performance computing (HPC) workloads, and optimizing
its performance is of interest to both ML and HPC communities. All-to-all is a
particularly challenging workload that can severely strain the underlying
interconnect bandwidth at scale. This is mainly because of the quadratic
scaling in the number of messages that must be simultaneously serviced combined
with large message sizes. This paper takes a holistic approach to optimize the
performance of all-to-all collective communications on supercomputer-scale
direct-connect interconnects. We address several algorithmic and practical
challenges in developing efficient and bandwidth-optimal all-to-all schedules
for any topology, lowering the schedules to various backends and fabrics that
may or may not expose additional forwarding bandwidth, establishing an upper
bound on all-to-all throughput, and exploring novel topologies that deliver
near-optimal all-to-all performance
Neural networks-on-chip for hybrid bio-electronic systems
PhD ThesisBy modelling the brains computation we can further our understanding
of its function and develop novel treatments for neurological disorders. The
brain is incredibly powerful and energy e cient, but its computation does
not t well with the traditional computer architecture developed over the
previous 70 years. Therefore, there is growing research focus in developing
alternative computing technologies to enhance our neural modelling capability,
with the expectation that the technology in itself will also bene t from
increased awareness of neural computational paradigms.
This thesis focuses upon developing a methodology to study the design
of neural computing systems, with an emphasis on studying systems suitable
for biomedical experiments. The methodology allows for the design to be
optimized according to the application. For example, di erent case studies
highlight how to reduce energy consumption, reduce silicon area, or to
increase network throughput.
High performance processing cores are presented for both Hodgkin-Huxley
and Izhikevich neurons incorporating novel design features. Further, a complete
energy/area model for a neural-network-on-chip is derived, which is
used in two exemplar case-studies: a cortical neural circuit to benchmark
typical system performance, illustrating how a 65,000 neuron network could
be processed in real-time within a 100mW power budget; and a scalable highperformance
processing platform for a cerebellar neural prosthesis. From
these case-studies, the contribution of network granularity towards optimal
neural-network-on-chip performance is explored
I/O embedding and broadcasting in star interconnection networks
The issues of communication between a host or central controller and processors, in large interconnection networks are very important and have been studied in the past by several researchers. There is a plethora of problems that arise when processors are asked to exchange information on parallel computers on which processors are interconnected according to a specific topology. In robust networks, it is desirable at times to send (receive) data/control information to (from) all the processors in minimal time. This type of communication is commonly referred to as broadcasting. To speed up broadcasting in a given network without modifying its topology, certain processors called stations can be specified to act as relay agents. In this thesis, broadcasting issues in a star-based interconnection network are studied. The model adopted assumes all-port communication and wormhole switching mechanism. Initially, the problem treated is one of finding the minimum number of stations required to cover all the nodes in the star graph with i-adjacency. We consider 1-, 2-, and 3-adjacencies and determine the upper bound on the number of stations required to cover the nodes for each case. After deriving the number of stations, two algorithms are designed to broadcast the messages first from the host to stations, and then from stations to remaining nodes; In addition, a Binary-based Algorithm is designed to allow routing in the network by directly working on the binary labels assigned to the star graph. No look-up table is consulted during routing and minimum number of bits are used to represent a node label. At the end, the thesis sheds light on another algorithm for routing using parallel paths in the star network
RAMP: A Flat Nanosecond Optical Network and MPI Operations for Distributed Deep Learning Systems
Distributed deep learning (DDL) systems strongly depend on network
performance. Current electronic packet switched (EPS) network architectures and
technologies suffer from variable diameter topologies, low-bisection bandwidth
and over-subscription affecting completion time of communication and collective
operations.
We introduce a near-exascale, full-bisection bandwidth, all-to-all,
single-hop, all-optical network architecture with nanosecond reconfiguration
called RAMP, which supports large-scale distributed and parallel computing
systems (12.8~Tbps per node for up to 65,536 nodes).
For the first time, a custom RAMP-x MPI strategy and a network transcoder is
proposed to run MPI collective operations across the optical circuit switched
(OCS) network in a schedule-less and contention-less manner. RAMP achieves
7.6-171 speed-up in completion time across all MPI operations compared
to realistic EPS and OCS counterparts. It can also deliver a 1.3-16 and
7.8-58 reduction in Megatron and DLRM training time respectively} while
offering 42-53 and 3.3-12.4 improvement in energy consumption
and cost respectively
- …