47 research outputs found
VLANbased Minimal Paths in PC Cluster with Ethernet on Mesh and Torus
Abstract In a PC cluster with Ethernet, well-distribute
Throttling Control for Bufferless Routing in On-Chip Networks
As the number of core integration on a single die grows, buffers consume significant energy, and occupy chip area. A bufferless deflection outing that eliminates router’s input port buffers can considerably help saving energy and chip area while providing similar performance of xisting buffered routing, especially for low-to-medium network loads. However when congestion increases, the bufferless frequently causes flits deflections, and misrouting leading to a degradation of network performance. In this paper, we propose IRT(Injection Rate Throttling), a ocal throttling mechanism that reduces deflection and misrouting for high-load bufferless networks. IRT provides injection rate control independently for each network node, allowing to reduce network congestion. Our simulation results based on a cycle-accurate simulator show that using IRT, IRT reduces average transmission latency by 8.65% compared to traditional bufferless routing
Cabinet Layout Optimization of Supercomputer Topologies for Shorter Cable Length
Abstract—As the scales of supercomputers increase total cable length becomes enormous, e.g., up to thousands of kilometers. Recent high-radix switches with dozens of ports make switch layout and system packaging more complex. In this study, we study the optimization of the physical layout of topologies of switches on a machine room floor with the goal of reducing cable length. For a given topology, using graph clustering algorithms, we group switches logically into cabinets so that the number of inter-cabinet cables is small. Then, we map the cabinets onto a physical floor space so as to minimize total cable length. This is done by modeling and optimizing the mapping problem as a facility location problem. Our evaluation results show that, when compared to standard clustering/mapping approaches and for popular network topologies, our clustering approach can reduce the number of inter-cabinet cables by up to 40.3 % and our mapping approach can reduce the inter-rack cable length by up to 39.6%. Index Terms—Topology, cabinet layout, interconnection networks, high performance computing, high-radix switches I
A Case for Offloading Federated Learning Server on Smart NIC
Federated learning is a distributed machine learning approach where local
weight parameters trained by clients locally are aggregated as global
parameters by a server. The global parameters can be trained without uploading
privacy-sensitive raw data owned by clients to the server. The aggregation on
the server is simply done by averaging the local weight parameters, so it is an
I/O intensive task where a network processing accounts for a large portion
compared to the computation. The network processing workload further increases
as the number of clients increases. To mitigate the network processing
workload, in this paper, the federated learning server is offloaded to NVIDIA
BlueField-2 DPU which is a smart NIC (Network Interface Card) that has eight
processing cores. Dedicated processing cores are assigned by DPDK (Data Plane
Development Kit) for receiving the local weight parameters and sending the
global parameters. The aggregation task is parallelized by exploiting multiple
cores available on the DPU. To further improve the performance, an approximated
design that eliminates an exclusive access control between the computation
threads is also implemented. Evaluation results show that the federated
learning server on the DPU accelerates the execution time by 1.32 times
compared with that on the host CPU with a negligible accuracy loss