709 research outputs found
Routing on the Channel Dependency Graph:: A New Approach to Deadlock-Free, Destination-Based, High-Performance Routing for Lossless Interconnection Networks
In the pursuit for ever-increasing compute power, and with Moore's law slowly coming to an end, high-performance computing started to scale-out to larger systems. Alongside the increasing system size, the interconnection network is growing to accommodate and connect tens of thousands of compute nodes. These networks have a large influence on total cost, application performance, energy consumption, and overall system efficiency of the supercomputer. Unfortunately, state-of-the-art routing algorithms, which define the packet paths through the network, do not utilize this important resource efficiently. Topology-aware routing algorithms become increasingly inapplicable, due to irregular topologies, which either are irregular by design, or most often a result of hardware failures. Exchanging faulty network components potentially requires whole system downtime further increasing the cost of the failure. This management approach becomes more and more impractical due to the scale of today's networks and the accompanying steady decrease of the mean time between failures. Alternative methods of operating and maintaining these high-performance interconnects, both in terms of hardware- and software-management, are necessary to mitigate negative effects experienced by scientific applications executed on the supercomputer. However, existing topology-agnostic routing algorithms either suffer from poor load balancing or are not bounded in the number of virtual channels needed to resolve deadlocks in the routing tables.
Using the fail-in-place strategy, a well-established method for storage systems to repair only critical component failures, is a feasible solution for current and future HPC interconnects as well as other large-scale installations such as data center networks. Although, an appropriate combination of topology and routing algorithm is required to minimize the throughput degradation for the entire system. This thesis contributes a network simulation toolchain to facilitate the process of finding a suitable combination, either during system design or while it is in operation. On top of this foundation, a key contribution is a novel scheduling-aware routing, which reduces fault-induced throughput degradation while improving overall network utilization. The scheduling-aware routing performs frequent property preserving routing updates to optimize the path balancing for simultaneously running batch jobs. The increased deployment of lossless interconnection networks, in conjunction with fail-in-place modes of operation and topology-agnostic, scheduling-aware routing algorithms, necessitates new solutions to solve the routing-deadlock problem. Therefore, this thesis further advances the state-of-the-art by introducing a novel concept of routing on the channel dependency graph, which allows the design of an universally applicable destination-based routing capable of optimizing the path balancing without exceeding a given number of virtual channels, which are a common hardware limitation. This disruptive innovation enables implicit deadlock-avoidance during path calculation, instead of solving both problems separately as all previous solutions
CLEX: Yet Another Supercomputer Architecture?
We propose the CLEX supercomputer topology and routing scheme. We prove that
CLEX can utilize a constant fraction of the total bandwidth for point-to-point
communication, at delays proportional to the sum of the number of intermediate
hops and the maximum physical distance between any two nodes. Moreover, %
applying an asymmetric bandwidth assignment to the links, all-to-all
communication can be realized -optimally both with regard to
bandwidth and delays. This is achieved at node degrees of ,
for an arbitrary small constant . In contrast, these
results are impossible in any network featuring constant or polylogarithmic
node degrees. Through simulation, we assess the benefits of an implementation
of the proposed communication strategy. Our results indicate that, for a
million processors, CLEX can increase bandwidth utilization and reduce average
routing path length by at least factors respectively in comparison to
a torus network. Furthermore, the CLEX communication scheme features several
other properties, such as deadlock-freedom, inherent fault-tolerance, and
canonical partition into smaller subsystems
The k-ary n-direct s-indirect family of topologies for large-scale interconnection networks
The final publication is available at Springer via http://dx.doi.org/10.1007/s11227-016-1640-zIn large-scale supercomputers, the interconnection network plays a key role in system performance. Network topology highly defines the performance and cost of the interconnection network. Direct topologies are sometimes used due to its reduced hardware cost, but the number of network dimensions is limited by the physical 3D space, which leads to an increase of the communication latency and a reduction of network throughput for large machines. Indirect topologies can provide better performance for large machines, but at higher hardware cost. In this paper, we propose a new family of hybrid topologies, the k-ary n-direct s-indirect, that combines the best features from both direct and indirect topologies to efficiently connect an extremely high number of processing nodes. The proposed network is an n-dimensional topology where the k nodes of each dimension are connected through a small indirect topology of s stages. This combination results in a family of topologies that provides high performance, with latency and throughput figures of merit close to indirect topologies, but at a lower hardware cost. In particular, it doubles the throughput obtained per cost unit compared with indirect topologies in most of the cases. Moreover, their fault-tolerance degree is similar to the one achieved by direct topologies built with switches with the same number of ports.This work was supported by the Spanish Ministerio de Economa y Competitividad (MINECO) and by FEDER funds under Grant TIN2012-38341-C04-01 and by Programa de Ayudas de Investigacion y Desarrollo (PAID) from Universitat Politecnica de Valencia.Peñaranda Cebrián, R.; Gómez Requena, C.; Gómez Requena, ME.; López Rodríguez, PJ.; Duato Marín, JF. (2016). The k-ary n-direct s-indirect family of topologies for large-scale interconnection networks. Journal of Supercomputing. 72(3):1035-1062. https://doi.org/10.1007/s11227-016-1640-z10351062723Connect-IB. http://www.mellanox.com/related-docs/prod_adapter_cards/PB_Connect-IB.pdf . Accessed 3 Feb 2016Mellanox store. http://www.mellanoxstore.com . Accessed 3 Feb 2016Mellanox technology. http://www.mellanox.com . Accessed 3 Feb 2016Myricom. http://www.myri.com . Accessed 3 Feb 2016Quadrics homepage. http://www.quadrics.com . Accessed 22 Sept 2008TOP500 supercomputer site. http://www.top500.org . Accessed 3 Feb 2016Balkan A, Qu G, Vishkin U (2009) Mesh-of-trees and alternative interconnection networks for single-chip parallelism. IEEE Trans Very Large Scale Integr(VLSI) Syst 17(10):1419–1432. doi: 10.1109/TVLSI.2008.2003999Bermudez Garzon D, Gomez ME, Lopez P, Duato J, Gomez C (2014) FT-RUFT: a performance and fault-tolerant efficient indirect topology. In: 22nd Euromicro international conference on parallel, distributed and network-based processing (PDP). IEEE, pp 405–409Bhandarkar SM, Arabnia HR (1995) The Hough transform on a reconfigurable multi-ring network. J Parallel Distrib Comput 24(1):107–114Boku T, Nakazawa K, Nakamura H, Sone T, Mishima T, Itakura K (1996) Adaptive routing technique on hypercrossbar network and its evaluation. Syst Comput Jpn 27(4):55–64Dally W, Towles B (2004) Principles and practices of interconnection networks. Morgan Kaufmann, San FranciscoDas R, Eachempati S, Mishra A, Narayanan V, Das C (2009) Design and evaluation of a hierarchical on-chip interconnect for next-generation CMPs. In: IEEE 15th international symposium on high performance computer architecture (HPCA’09), pp 175–186. doi: 10.1109/HPCA.2009.4798252Mahdaly AI, Mouftah HT, Hanna NN (1990) Topological properties of WK-recursive networks. In: Proceedings of IEEE workshop on future trends of distributed computing systems, pp 374–380. doi: 10.1109/FTDCS.1990.138349Duato J (1996) A necessary and sufficient condition for deadlock-free routing in cut-through and store-and-forward networks. IEEE Trans Parallel Distrib Syst 7:841–854. doi: 10.1109/71.532115Duato J, Yalamanchili S, Lionel N (2002) Interconnection networks: an engineering approach. Morgan Kaufmann Publishers Inc., USAFlich J, Malumbres M, López P, Duato J (2000) Improving routing performance in Myrinet networks. In: International on parallel and distributed processing symposium, p 27. doi: 10.1109/IPDPS.2000.845961García M, Beivide R, Camarero C, Valero M, Rodríguez G, Minkenberg C (2015) On-the-fly adaptive routing for dragonfly interconnection networks. J Supercomput 71(3):1116–1142Gómez C, Gilabert F, Gómez M, López P, Duato J (2007) Deterministic versus adaptive routing in fat-trees. In: IEEE international on parallel and distributed processing symposium (IPDPS’07), pp 1–8. doi: 10.1109/IPDPS.2007.370482Gómez C, Gilabert F, Gómez M, López P, Duato J (2008) RUFT: simplifying the fat-tree topology. In: 14th IEEE international conference on parallel and distributed systems (ICPADS’08), pp 153–160. doi: 10.1109/ICPADS.2008.44Guo C, Lu G, Li D, Wu H, Zhang X, Shi Y, Tian C, Zhang Y, Lu S (2009) BCube: a high performance, server-centric network architecture for modular data centers. In: SIGCOMM ’09: proceedings of the ACM SIGCOMM 2009 conference on data communication. ACM, New York, pp 63–74. doi: 10.1145/1592568.1592577 . http://www.bibsonomy.org/bibtex/23a5da89fbf099e3c70f4559ab38082c5/chesteve . Accessed 22 Sept 2008Gupta A, Dally W (2006) Topology optimization of interconnection networks. Comput Arch Lett 5(1):10–13. doi: 10.1109/L-CA.2006.8Kim J, Dally W, Abts D (2007) Flattened butterfly: a cost-efficient topology for high-radix networks. In: Proceedings of the 34th annual international symposium on computer architecture (ISCA’07). ACM, New York, pp 126–137. doi: 10.1145/1250662.1250679Kim J, Dally W, Scott S, Abts D (2008) Technology-driven, highly-scalable dragonfly topology. In: Proceedings of the 35th annual international symposium on computer architecture (ISCA’08). IEEE Computer Society, Washington, DC, pp 77–88. doi: 10.1109/ISCA.2008.19Leighton F (1992) Introduction to parallel algorithms and architectures: arrays, trees, hypercubes v. 1. M. Kaufmann Publishers, San FranciscoLeiserson CE (1985) Fat-trees: universal networks for hardware-efficient supercomputing. IEEE Trans Comput 34(10):892–901Matsutani H, Koibuchi M, Amano H (2007) Performance, cost, and energy evaluation of fat H-tree: a cost-efficient tree-based on-chip network. In: IEEE international on parallel and distributed processing symposium (IPDPS’07), pp 1–10. doi: 10.1109/IPDPS.2007.370271Rahmati D, Kiasari A, Hessabi S, Sarbazi-Azad H (2006) A performance and power analysis of wk-recursive and mesh networks for network-on-chips. In: International conference on computer design (ICCD’06), pp 142–147. doi: 10.1109/ICCD.2006.4380807Towles B, Dally WJ (2002) Worst-case traffic for oblivious routing functions. In: Proceedings of the fourteenth annual ACM symposium on parallel algorithms and architectures (SPAA’02). ACM, New York, pp 1–8. doi: 10.1145/564870.564872Yang Y, Funahashi A, Jouraku A, Nishi H, Amano H, Sueyoshi T (2001) Recursive diagonal torus: an interconnection network for massively parallel computers. IEEE Trans Parallel Distrib Syst 12(7):701–715. doi: 10.1109/71.94074
Low-Memory Techniques for Routing and Fault-Tolerance on the Fat-Tree Topology
Actualmente, los clústeres de PCs están considerados como una alternativa eficiente a la hora de construir supercomputadores en los que miles de nodos de computación se conectan mediante una red de interconexión. La red de interconexión tiene que ser diseñada cuidadosamente, puesto que tiene una gran influencia sobre las prestaciones globales del sistema. Dos de los principales parámetros de diseño de las redes de interconexión son la topología y el encaminamiento. La topología define la interconexión de los elementos de la red entre sí, y entre éstos y los nodos de computación. Por su parte, el encaminamiento define los caminos que siguen los paquetes a través de la red.
Las prestaciones han sido tradicionalmente la principal métrica a la hora de evaluar las redes de interconexión. Sin embargo, hoy en día hay que considerar dos métricas adicionales: el coste y la tolerancia a fallos. Las redes de interconexión además de escalar en prestaciones también deben hacerlo en coste. Es decir, no sólo tienen que mantener su productividad conforme aumenta el tamaño de la red, sino que tienen que hacerlo sin incrementar sobremanera su coste. Por otra parte, conforme se incrementa el número de nodos en las máquinas de tipo clúster, la red de interconexión debe crecer en concordancia. Este incremento en el número de elementos de la red de interconexión aumenta la probabilidad de aparición de fallos, y por lo tanto, la tolerancia a fallos es prácticamente obligatoria para las redes de interconexión actuales.
Esta tesis se centra en la topología fat-tree, ya que es una de las topologías más comúnmente usadas en los clústeres. El objetivo de esta tesis es aprovechar sus características particulares para proporcionar tolerancia a fallos y un algoritmo de encaminamiento capaz de equilibrar la carga de la red proporcionando una buena solución de compromiso entre las prestaciones y el coste.Gómez Requena, C. (2010). Low-Memory Techniques for Routing and Fault-Tolerance on the Fat-Tree Topology [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/8856Palanci
Speeding-up the fault-tolerance analysis of interconnection networks
© 2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other worksAnalyzing the fault-tolerance of interconnection
networks implies checking the connectivity of each sourcedestination
pair. The size of the exploration space of such
operation skyrockets with the network size and with the number
of link faults. However, this problem is highly parallelizable
since the exploration of each path between a source–destination
pair is independent of the other paths. This paper presents
an approach to analyze the fault-tolerance degree of multistage
interconnection networks using GPUs in order to speed-up it.
This approach uses CUDA as parallel programming tool on a
GPU in order to take advantage of all available cores. Results
show that the execution time of the fault-tolerance exploration
can be significantly reduced.This work was supported by the Spanish Ministerio de Economía y Competitividad (MINECO) and by FEDER funds under Grant TIN2012-38341-C04-01.Bermúdez Garzón, DF.; Gómez Requena, C.; López Rodríguez, PJ.; Gómez Requena, ME. (2015). Speeding-up the fault-tolerance analysis of interconnection networks. IEEE. https://doi.org/10.1109/HPCSim.2015.7237035
A multipath analysis of biswapped networks.
Biswapped networks of the form have recently been proposed as interconnection networks to be implemented as optical transpose interconnection systems. We provide a systematic construction of vertex-disjoint paths joining any two distinct vertices in , where is the connectivity of . In doing so, we obtain an upper bound of on the -diameter of , where is the diameter of and the -diameter. Suppose that we have a deterministic multipath source routing algorithm in an interconnection network that finds mutually vertex-disjoint paths in joining any distinct vertices and does this in time polynomial in , and (and independently of the number of vertices of ). Our constructions yield an analogous deterministic multipath source routing algorithm in the interconnection network that finds mutually vertex-disjoint paths joining any distinct vertices in so that these paths all have length bounded as above. Moreover, our algorithm has time complexity polynomial in , and . We also show that if is Hamiltonian then is Hamiltonian, and that if is a Cayley graph then is a Cayley graph
Noise-Adaptive Compiler Mappings for Noisy Intermediate-Scale Quantum Computers
A massive gap exists between current quantum computing (QC) prototypes, and
the size and scale required for many proposed QC algorithms. Current QC
implementations are prone to noise and variability which affect their
reliability, and yet with less than 80 quantum bits (qubits) total, they are
too resource-constrained to implement error correction. The term Noisy
Intermediate-Scale Quantum (NISQ) refers to these current and near-term systems
of 1000 qubits or less. Given NISQ's severe resource constraints, low
reliability, and high variability in physical characteristics such as coherence
time or error rates, it is of pressing importance to map computations onto them
in ways that use resources efficiently and maximize the likelihood of
successful runs.
This paper proposes and evaluates backend compiler approaches to map and
optimize high-level QC programs to execute with high reliability on NISQ
systems with diverse hardware characteristics. Our techniques all start from an
LLVM intermediate representation of the quantum program (such as would be
generated from high-level QC languages like Scaffold) and generate QC
executables runnable on the IBM Q public QC machine. We then use this framework
to implement and evaluate several optimal and heuristic mapping methods. These
methods vary in how they account for the availability of dynamic machine
calibration data, the relative importance of various noise parameters, the
different possible routing strategies, and the relative importance of
compile-time scalability versus runtime success. Using real-system
measurements, we show that fine grained spatial and temporal variations in
hardware parameters can be exploited to obtain an average x (and up to
x) improvement in program success rate over the industry standard IBM
Qiskit compiler.Comment: To appear in ASPLOS'1
- …