200 research outputs found
Node-Type-Based Load-Balancing Routing for Parallel Generalized Fat-Trees
High-Performance Computing (HPC) clusters are made up of a variety of node
types (usually compute, I/O, service, and GPGPU nodes) and applications don't
use nodes of a different type the same way. Resulting communication patterns
reflect organization of groups of nodes, and current optimal routing algorithms
for all-to-all patterns will not always maximize performance for group-specific
communications. Since application communication patterns are rarely available
beforehand, we choose to rely on node types as a good guess for node usage. We
provide a description of node type heterogeneity and analyse performance
degradation caused by unlucky repartition of nodes of the same type. We provide
an extension to routing algorithms for Parallel Generalized Fat-Tree topologies
(PGFTs) which balances load amongst groups of nodes of the same type. We show
how it removes these performance issues by comparing results in a variety of
situations against corresponding classical algorithms
Routing on the Channel Dependency Graph:: A New Approach to Deadlock-Free, Destination-Based, High-Performance Routing for Lossless Interconnection Networks
In the pursuit for ever-increasing compute power, and with Moore's law slowly coming to an end, high-performance computing started to scale-out to larger systems. Alongside the increasing system size, the interconnection network is growing to accommodate and connect tens of thousands of compute nodes. These networks have a large influence on total cost, application performance, energy consumption, and overall system efficiency of the supercomputer. Unfortunately, state-of-the-art routing algorithms, which define the packet paths through the network, do not utilize this important resource efficiently. Topology-aware routing algorithms become increasingly inapplicable, due to irregular topologies, which either are irregular by design, or most often a result of hardware failures. Exchanging faulty network components potentially requires whole system downtime further increasing the cost of the failure. This management approach becomes more and more impractical due to the scale of today's networks and the accompanying steady decrease of the mean time between failures. Alternative methods of operating and maintaining these high-performance interconnects, both in terms of hardware- and software-management, are necessary to mitigate negative effects experienced by scientific applications executed on the supercomputer. However, existing topology-agnostic routing algorithms either suffer from poor load balancing or are not bounded in the number of virtual channels needed to resolve deadlocks in the routing tables.
Using the fail-in-place strategy, a well-established method for storage systems to repair only critical component failures, is a feasible solution for current and future HPC interconnects as well as other large-scale installations such as data center networks. Although, an appropriate combination of topology and routing algorithm is required to minimize the throughput degradation for the entire system. This thesis contributes a network simulation toolchain to facilitate the process of finding a suitable combination, either during system design or while it is in operation. On top of this foundation, a key contribution is a novel scheduling-aware routing, which reduces fault-induced throughput degradation while improving overall network utilization. The scheduling-aware routing performs frequent property preserving routing updates to optimize the path balancing for simultaneously running batch jobs. The increased deployment of lossless interconnection networks, in conjunction with fail-in-place modes of operation and topology-agnostic, scheduling-aware routing algorithms, necessitates new solutions to solve the routing-deadlock problem. Therefore, this thesis further advances the state-of-the-art by introducing a novel concept of routing on the channel dependency graph, which allows the design of an universally applicable destination-based routing capable of optimizing the path balancing without exceeding a given number of virtual channels, which are a common hardware limitation. This disruptive innovation enables implicit deadlock-avoidance during path calculation, instead of solving both problems separately as all previous solutions
Hop-Constrained Oblivious Routing
We prove the existence of an oblivious routing scheme that is
-competitive in terms of , thus
resolving a well-known question in oblivious routing.
Concretely, consider an undirected network and a set of packets each with its
own source and destination. The objective is to choose a path for each packet,
from its source to its destination, so as to minimize , defined as follows: The dilation is the maximum path hop-length,
and the congestion is the maximum number of paths that include any single edge.
The routing scheme obliviously and randomly selects a path for each packet
independent of (the existence of) the other packets. Despite this
obliviousness, the selected paths have within a
factor of the best possible value. More precisely, for
any integer hop-bound , this oblivious routing scheme selects paths of
length at most and is -competitive in terms of in comparison to the best possible
achievable via paths of length at most hops. These paths can
be sampled in polynomial time.
This result can be viewed as an analogue of the celebrated oblivious routing
results of R\"{a}cke [FOCS 2002, STOC 2008], which are -competitive
in terms of , but are not competitive in terms of
Simulation Of Multi-core Systems And Interconnections And Evaluation Of Fat-Mesh Networks
Simulators are very important in computer architecture research as they enable the exploration of new architectures to obtain detailed performance evaluation without building costly physical hardware. Simulation is even more critical to study future many-core architectures as it provides the opportunity to assess currently non-existing computer systems. In this thesis, a multiprocessor simulator is presented based on a cycle accurate architecture simulator called SESC. The shared L2 cache system is extended into a distributed shared cache (DSC) with a directory-based cache coherency protocol. A mesh network module is extended and integrated into SESC to replace the bus for scalable inter-processor communication. While these efforts complete an extended multiprocessor simulation infrastructure, two interconnection enhancements are proposed and evaluated. A novel non-uniform fat-mesh network structure similar to the idea of fat-tree is proposed. This non-uniform mesh network takes advantage of the average traffic pattern, typically all-to-all in DSC, to dedicate additional links for connections with heavy traffic (e.g., near the center) and fewer links for lighter traffic (e.g., near the periphery). Two fat-mesh schemes are implemented based on different routing algorithms. Analytical fat-mesh models are constructed by presenting the expressions for the traffic requirements of personalized all-to-all traffic. Performance improvements over the uniform mesh are demonstrated in the results from the simulator. A hybrid network consisting of one packet switching plane and multiple circuit switching planes is constructed as the second enhancement. The circuit switching planes provide fast paths between neighbors with heavy communication traffic. A compiler technique that abstracts the symbolic expressions of benchmarks' communication patterns can be used to help facilitate the circuit establishment
Multi-Path Alpha-Fair Resource Allocation at Scale in Distributed Software Defined Networks
The performance of computer networks relies on how bandwidth is shared among
different flows. Fair resource allocation is a challenging problem particularly
when the flows evolve over time. To address this issue, bandwidth sharing
techniques that quickly react to the traffic fluctuations are of interest,
especially in large scale settings with hundreds of nodes and thousands of
flows. In this context, we propose a distributed algorithm based on the
Alternating Direction Method of Multipliers (ADMM) that tackles the multi-path
fair resource allocation problem in a distributed SDN control architecture. Our
ADMM-based algorithm continuously generates a sequence of resource allocation
solutions converging to the fair allocation while always remaining feasible, a
property that standard primal-dual decomposition methods often lack. Thanks to
the distribution of all computer intensive operations, we demonstrate that we
can handle large instances at scale
Perfectly Oblivious (Parallel) RAM Revisited, and Improved Constructions
Oblivious RAM (ORAM)
is a technique for compiling any RAM program to an oblivious counterpart, i.e.,
one whose access patterns do not leak information about the secret inputs.
Similarly, Oblivious Parallel RAM (OPRAM) compiles a
{\it parallel} RAM program to an oblivious counterpart.
In this paper, we care about ORAM/OPRAM with {\it perfect security}, i.e.,
the access patterns must be {\it identically distributed}
no matter what the program\u27s memory request sequence is.
In the past, two types of perfect ORAMs/OPRAMs
have been considered:
constructions whose performance bounds hold {\it in expectation} (but may occasionally
run more slowly);
and constructions whose performance bounds hold {\it deterministically} (even though
the algorithms themselves are randomized).
In this paper, we revisit the performance metrics for perfect
ORAM/OPRAM, and
show novel constructions that achieve asymptotical improvements
for all performance metrics.
Our first result
is a new perfectly secure OPRAM
scheme with {\it expected} overhead.
In comparison, prior literature
has been stuck at for more than a decade.
Next, we show how to construct a perfect ORAM
with
{\it deterministic} simulation overhead. We further show how
to make the scheme parallel, resulting in an perfect OPRAM
with
{\it deterministic} simulation overhead.
For perfect ORAMs/OPRAMs
with deterministic performance bounds, our results achieve
{\it subexponential} improvement over the state-of-the-art.
Specifically, the best known prior scheme
incurs more than deterministic simulation overhead
(Raskin and Simkin, Asiacrypt\u2719); moreover, their scheme works
only for the sequential setting and is {\it not} amenable to parallelization.
Finally, we additionally consider perfect ORAMs/OPRAMs
whose performance bounds hold with high probability.
For this new performance metric, we show new constructions
whose simulation overhead is upper bounded by
except with negligible in probability, i.e., we prove
high-probability performance bounds that match the expected
bounds mentioned earlier
FatPaths: Routing in Supercomputers and Data Centers when Shortest Paths Fall Short
We introduce FatPaths: a simple, generic, and robust routing architecture
that enables state-of-the-art low-diameter topologies such as Slim Fly to
achieve unprecedented performance. FatPaths targets Ethernet stacks in both HPC
supercomputers as well as cloud data centers and clusters. FatPaths exposes and
exploits the rich ("fat") diversity of both minimal and non-minimal paths for
high-performance multi-pathing. Moreover, FatPaths uses a redesigned "purified"
transport layer that removes virtually all TCP performance issues (e.g., the
slow start), and incorporates flowlet switching, a technique used to prevent
packet reordering in TCP networks, to enable very simple and effective load
balancing. Our design enables recent low-diameter topologies to outperform
powerful Clos designs, achieving 15% higher net throughput at 2x lower latency
for comparable cost. FatPaths will significantly accelerate Ethernet clusters
that form more than 50% of the Top500 list and it may become a standard routing
scheme for modern topologies
On Fault Tolerance Methods for Networks-on-Chip
Technology scaling has proceeded into dimensions in which the reliability of manufactured devices is becoming endangered. The reliability decrease is a consequence of physical limitations, relative increase of variations, and decreasing noise margins, among others. A promising solution for bringing the reliability of circuits back to a desired level is the use of design methods which introduce tolerance against possible faults in an integrated circuit.
This thesis studies and presents fault tolerance methods for network-onchip (NoC) which is a design paradigm targeted for very large systems-onchip. In a NoC resources, such as processors and memories, are connected to a communication network; comparable to the Internet. Fault tolerance in such a system can be achieved at many abstraction levels.
The thesis studies the origin of faults in modern technologies and explains the classification to transient, intermittent and permanent faults. A survey of fault tolerance methods is presented to demonstrate the diversity of available methods. Networks-on-chip are approached by exploring their main design choices: the selection of a topology, routing protocol, and flow control method. Fault tolerance methods for NoCs are studied at different layers of the OSI reference model.
The data link layer provides a reliable communication link over a physical channel. Error control coding is an efficient fault tolerance method especially against transient faults at this abstraction level. Error control coding methods suitable for on-chip communication are studied and their implementations presented. Error control coding loses its effectiveness in the presence of intermittent and permanent faults. Therefore, other solutions against them are presented. The introduction of spare wires and split transmissions are shown to provide good tolerance against intermittent and permanent errors and their combination to error control coding is illustrated.
At the network layer positioned above the data link layer, fault tolerance can be achieved with the design of fault tolerant network topologies and routing algorithms. Both of these approaches are presented in the thesis together with realizations in the both categories. The thesis concludes that an optimal fault tolerance solution contains carefully co-designed elements from different abstraction levelsSiirretty Doriast
- …