569 research outputs found
High-Speed Message Routing Mechanisms for Massively Parallel Computers
çŸåšè¶
䞊ååŠçã·ã¹ãã (MPP)ã¯ãäŒçµ±çãªãã¯ãã«ããã»ããµãSIMDãã·ã³ã®
çåã§ãã£ãå€ãã®åéã«é²åºããŠããããããã®ã·ã¹ãã ã¯ãå
¥æã容æãªé«æ§èœ
CPUã®æ¥æ¿ãªé²æ©ãããŸãå©çšããããããæ°çŸïœæ°ååæ¥ç¶ããŠå質ãªãã«ãã
ãã»ããµã®ã·ã¹ãã ãšããŠæ§æãããã®ã§ããããããããããã®ã·ã¹ãã ã®æ§èœã¯ã
çŸå®ã®åé¡ã解ããšãã¯å¿
ãããè¯ããªããåžžã«å
¬ç§°ã®æé«æ§èœã«ã¯ã¯ããã«åã°ãª
ãã®ãçŸç¶ã§ããããããã®ã·ã¹ãã ã§ã¯ããã»ããµéã®éä¿¡ã¯ãã¹ãŠçžäºçµå網ã«
ãã£ãŠè¡ãããã®ã§ãå®çŸå¯èœãªæé«æ§èœã決ãã決å®çãªèŠçŽ ã¯çžäºçµå網ãšãã
ãã«äœ¿ãããéä¿¡æ©æ§ã§ããã
æ¬è«æã§ã¯MPPã®çžäºçµå網ã«äœ¿ããããå¹ççãªéä¿¡æ©æ§ãå®çŸãã2ã€ã®æ¹æ³
ãææ¡ããã第1ã¯ãç¹æ¥ã«ãŒã¿ãã®ææ¡ã§ããããããçžäºçµå網ã«çšããå Žåã®
é©åæ§ãæ€èš»ãããç¹æ¥ã«ãŒã¿ã¯å€éã®åæ¹åã¬ãžã¹ã¿æ¿å
¥ãã¹ãå©çšããŠãæé
空éæ··ååå²åãããã¯ãŒã¯ãå®çŸããããã®ãã®ã§ãããç°ãªãåºæ°ã次å
æ°ã«ã€
ããŠãç¹æ¥ã«ãŒã¿ã®ã¹ã€ããåè·¯ãšãããã¡åè·¯ã®æ§èœãäºæž¬ããããã®æ£ç¢ºãªã¢ã
ã«ãéçºããããã®çµæãç¹æ¥ã«ãŒã¿ã¯å¹ççãªéä¿¡ãè¡ãããã®ãã¹ãŠã®æ¡ä»¶ãæº
足ããŠããããšã確ããããããããã«éèŠãªç¹ã¯ãç¹æ¥ã«ãŒã¿ã¯ãããã¯ãŒã¯ã«æ
éã®ããå Žåããéä¿¡ãé¯ç¶ããå Žåã«ããäœé
延æéãé«ã¹ã«ãŒããããæãªããª
ãçµè·¯å¶åŸ¡ãè¡ããããšã§ãããã·ãã¥ã¬ãŒã·ã§ã³ã«ãã£ãŠè©äŸ¡ããç¹æ¥ã«ãŒã¿ã®ã®
æ§èœã¯ããããŸã§ã«çºè¡šãããåºå®çµè·¯éžææ¹åŒã®ã«ãŒã¿ããåªããŠããããŸãä»ã®
é©å¿çµè·¯å¶åŸ¡æ¹åŒã®ã«ãŒã¿ã«æ¯ã¹ãŠããåçšåºŠãããã¯ãããè¶ããŠããããšã確ã
ããããã
第2ã¯çµè·¯é·å¶éæ¹åŒã®ãã«ããã£ã¹ãéä¿¡ã®ææ¡ã§ããããã«ããã£ã¹ãéä¿¡ã¯
å€ãã®äžŠååŠçåé¡ã«ãããŠé床åäžã«å¯äžããéä¿¡æ¹åŒã§ãããããã§ã¯ãŒã ããŒ
ã«éä¿¡æ¹åŒã«ãããŠåé¡ãšãªããã«ããã£ã¹ãéä¿¡ã«ããããããããã¯ã®åé¡ã«ã€
ããŠç 究ããããããŠãã®åé¡ã解決ããæ¹æ³ãšããŠçµè·¯é·å¶éæ¹åŒã®ãã«ããã£ã¹
ãéä¿¡ãææ¡ãããã®æ¹åŒã«ããéä¿¡æ§èœãã·ãã¥ã¬ãŒã·ã§ã³ã«ãã£ãŠè©äŸ¡ãããŠã
ãã£ã¹ãæ¹åŒããã«ããã¹æ¹åŒã«ãããã«ããã£ã¹ãéä¿¡ã®æ§èœãšæ¯èŒããããã®çµ
æãææ¡ããçµè·¯é·å¶éæ¹åŒã®ãã«ããã£ã¹ãéä¿¡ã¯ãããªã€åæã®ããã®ã¯ã©ã¹ã¿
ãžã®ãã«ããã£ã¹ãéä¿¡ããæè¿åããŒããžã®ãã«ããã£ã¹ããå
šããŒããžã®æŸéã®
å Žåã«ãç¹ã«åªãã解決æ³ãšãªãããšãæããã«ãã
High-Speed Message Routing Mechanisms for Massively Parallel Computers
çŸåšè¶
䞊ååŠçã·ã¹ãã (MPP)ã¯ãäŒçµ±çãªãã¯ãã«ããã»ããµãSIMDãã·ã³ã®
çåã§ãã£ãå€ãã®åéã«é²åºããŠããããããã®ã·ã¹ãã ã¯ãå
¥æã容æãªé«æ§èœ
CPUã®æ¥æ¿ãªé²æ©ãããŸãå©çšããããããæ°çŸïœæ°ååæ¥ç¶ããŠå質ãªãã«ãã
ãã»ããµã®ã·ã¹ãã ãšããŠæ§æãããã®ã§ããããããããããã®ã·ã¹ãã ã®æ§èœã¯ã
çŸå®ã®åé¡ã解ããšãã¯å¿
ãããè¯ããªããåžžã«å
¬ç§°ã®æé«æ§èœã«ã¯ã¯ããã«åã°ãª
ãã®ãçŸç¶ã§ããããããã®ã·ã¹ãã ã§ã¯ããã»ããµéã®éä¿¡ã¯ãã¹ãŠçžäºçµå網ã«
ãã£ãŠè¡ãããã®ã§ãå®çŸå¯èœãªæé«æ§èœã決ãã決å®çãªèŠçŽ ã¯çžäºçµå網ãšãã
ãã«äœ¿ãããéä¿¡æ©æ§ã§ããã
æ¬è«æã§ã¯MPPã®çžäºçµå網ã«äœ¿ããããå¹ççãªéä¿¡æ©æ§ãå®çŸãã2ã€ã®æ¹æ³
ãææ¡ããã第1ã¯ãç¹æ¥ã«ãŒã¿ãã®ææ¡ã§ããããããçžäºçµå網ã«çšããå Žåã®
é©åæ§ãæ€èš»ãããç¹æ¥ã«ãŒã¿ã¯å€éã®åæ¹åã¬ãžã¹ã¿æ¿å
¥ãã¹ãå©çšããŠãæé
空éæ··ååå²åãããã¯ãŒã¯ãå®çŸããããã®ãã®ã§ãããç°ãªãåºæ°ã次å
æ°ã«ã€
ããŠãç¹æ¥ã«ãŒã¿ã®ã¹ã€ããåè·¯ãšãããã¡åè·¯ã®æ§èœãäºæž¬ããããã®æ£ç¢ºãªã¢ã
ã«ãéçºããããã®çµæãç¹æ¥ã«ãŒã¿ã¯å¹ççãªéä¿¡ãè¡ãããã®ãã¹ãŠã®æ¡ä»¶ãæº
足ããŠããããšã確ããããããããã«éèŠãªç¹ã¯ãç¹æ¥ã«ãŒã¿ã¯ãããã¯ãŒã¯ã«æ
éã®ããå Žåããéä¿¡ãé¯ç¶ããå Žåã«ããäœé
延æéãé«ã¹ã«ãŒããããæãªããª
ãçµè·¯å¶åŸ¡ãè¡ããããšã§ãããã·ãã¥ã¬ãŒã·ã§ã³ã«ãã£ãŠè©äŸ¡ããç¹æ¥ã«ãŒã¿ã®ã®
æ§èœã¯ããããŸã§ã«çºè¡šãããåºå®çµè·¯éžææ¹åŒã®ã«ãŒã¿ããåªããŠããããŸãä»ã®
é©å¿çµè·¯å¶åŸ¡æ¹åŒã®ã«ãŒã¿ã«æ¯ã¹ãŠããåçšåºŠãããã¯ãããè¶ããŠããããšã確ã
ããããã
第2ã¯çµè·¯é·å¶éæ¹åŒã®ãã«ããã£ã¹ãéä¿¡ã®ææ¡ã§ããããã«ããã£ã¹ãéä¿¡ã¯
å€ãã®äžŠååŠçåé¡ã«ãããŠé床åäžã«å¯äžããéä¿¡æ¹åŒã§ãããããã§ã¯ãŒã ããŒ
ã«éä¿¡æ¹åŒã«ãããŠåé¡ãšãªããã«ããã£ã¹ãéä¿¡ã«ããããããããã¯ã®åé¡ã«ã€
ããŠç 究ããããããŠãã®åé¡ã解決ããæ¹æ³ãšããŠçµè·¯é·å¶éæ¹åŒã®ãã«ããã£ã¹
ãéä¿¡ãææ¡ãããã®æ¹åŒã«ããéä¿¡æ§èœãã·ãã¥ã¬ãŒã·ã§ã³ã«ãã£ãŠè©äŸ¡ãããŠã
ãã£ã¹ãæ¹åŒããã«ããã¹æ¹åŒã«ãããã«ããã£ã¹ãéä¿¡ã®æ§èœãšæ¯èŒããããã®çµ
æãææ¡ããçµè·¯é·å¶éæ¹åŒã®ãã«ããã£ã¹ãéä¿¡ã¯ãããªã€åæã®ããã®ã¯ã©ã¹ã¿
ãžã®ãã«ããã£ã¹ãéä¿¡ããæè¿åããŒããžã®ãã«ããã£ã¹ããå
šããŒããžã®æŸéã®
å Žåã«ãç¹ã«åªãã解決æ³ãšãªãããšãæããã«ãã
Routing on the Channel Dependency Graph:: A New Approach to Deadlock-Free, Destination-Based, High-Performance Routing for Lossless Interconnection Networks
In the pursuit for ever-increasing compute power, and with Moore's law slowly coming to an end, high-performance computing started to scale-out to larger systems. Alongside the increasing system size, the interconnection network is growing to accommodate and connect tens of thousands of compute nodes. These networks have a large influence on total cost, application performance, energy consumption, and overall system efficiency of the supercomputer. Unfortunately, state-of-the-art routing algorithms, which define the packet paths through the network, do not utilize this important resource efficiently. Topology-aware routing algorithms become increasingly inapplicable, due to irregular topologies, which either are irregular by design, or most often a result of hardware failures. Exchanging faulty network components potentially requires whole system downtime further increasing the cost of the failure. This management approach becomes more and more impractical due to the scale of today's networks and the accompanying steady decrease of the mean time between failures. Alternative methods of operating and maintaining these high-performance interconnects, both in terms of hardware- and software-management, are necessary to mitigate negative effects experienced by scientific applications executed on the supercomputer. However, existing topology-agnostic routing algorithms either suffer from poor load balancing or are not bounded in the number of virtual channels needed to resolve deadlocks in the routing tables.
Using the fail-in-place strategy, a well-established method for storage systems to repair only critical component failures, is a feasible solution for current and future HPC interconnects as well as other large-scale installations such as data center networks. Although, an appropriate combination of topology and routing algorithm is required to minimize the throughput degradation for the entire system. This thesis contributes a network simulation toolchain to facilitate the process of finding a suitable combination, either during system design or while it is in operation. On top of this foundation, a key contribution is a novel scheduling-aware routing, which reduces fault-induced throughput degradation while improving overall network utilization. The scheduling-aware routing performs frequent property preserving routing updates to optimize the path balancing for simultaneously running batch jobs. The increased deployment of lossless interconnection networks, in conjunction with fail-in-place modes of operation and topology-agnostic, scheduling-aware routing algorithms, necessitates new solutions to solve the routing-deadlock problem. Therefore, this thesis further advances the state-of-the-art by introducing a novel concept of routing on the channel dependency graph, which allows the design of an universally applicable destination-based routing capable of optimizing the path balancing without exceeding a given number of virtual channels, which are a common hardware limitation. This disruptive innovation enables implicit deadlock-avoidance during path calculation, instead of solving both problems separately as all previous solutions
Quarc: an architecture for efficient on-chip communication
The exponential downscaling of the feature size has enforced a paradigm shift from computation-based design to communication-based design in system on chip development. Buses, the traditional communication architecture in systems on chip, are incapable of addressing the increasing bandwidth requirements of future large systems.
Networks on chip have emerged as an interconnection architecture offering unique solutions to the technological and design issues related to communication in future systems on chip. The transition from buses as a shared medium to networks on chip as a segmented medium has given rise to new challenges in system on chip realm.
By leveraging the shared nature of the communication medium, buses have been highly efficient in delivering multicast communication. The segmented nature of networks, however, inhibits the multicast messages to be delivered as efficiently by networks on chip. Relying on extensive research on multicast communication in parallel computers, several network on chip architectures have offered mechanisms to perform the operation, while conforming to resource constraints of the network on chip paradigm. Multicast communication in majority of these networks on chip is implemented by establishing a connection between source and all multicast destinations before the message transmission
commences. Establishing the connections incurs an overhead and, therefore, is not desirable; in particular in latency sensitive services such as cache coherence.
To address high performance multicast communication, this research presents Quarc, a novel network on chip architecture. The Quarc architecture targets an area-efficient, low power, high performance implementation. The thesis covers a detailed representation of
the building blocks of the architecture, including topology, router and network interface.
The cost and performance comparison of the Quarc architecture against other network on chip architectures reveals that the Quarc architecture is a highly efficient architecture.
Moreover, the thesis introduces novel performance models of complex traffic patterns, including multicast and quality of service-aware communication
Recommended from our members
A survey of routing techniques in store-and-forward and wormhole interconnects.
This paper presents an overview of algorithms for directing messages through networks of varying topology. These are commonly referred to as routing algorithms in the literature that is presented. In addition to providing background on networking terminology and router basics, the paper explains the issues of deadlock and livelock as they apply to routing. After this, there is a discussion of routing algorithms for both store-and-forward and wormhole-switched networks. The paper covers both algorithms that do and do not adapt to conditions in the network. Techniques targeting structured as well as irregular topologies are discussed. Following this, strategies for routing in the presence of faulty nodes and links in the network are described
The k-ary n-direct s-indirect family of topologies for large-scale interconnection networks
The final publication is available at Springer via http://dx.doi.org/10.1007/s11227-016-1640-zIn large-scale supercomputers, the interconnection network plays a key role in system performance. Network topology highly defines the performance and cost of the interconnection network. Direct topologies are sometimes used due to its reduced hardware cost, but the number of network dimensions is limited by the physical 3D space, which leads to an increase of the communication latency and a reduction of network throughput for large machines. Indirect topologies can provide better performance for large machines, but at higher hardware cost. In this paper, we propose a new family of hybrid topologies, the k-ary n-direct s-indirect, that combines the best features from both direct and indirect topologies to efficiently connect an extremely high number of processing nodes. The proposed network is an n-dimensional topology where the k nodes of each dimension are connected through a small indirect topology of s stages. This combination results in a family of topologies that provides high performance, with latency and throughput figures of merit close to indirect topologies, but at a lower hardware cost. In particular, it doubles the throughput obtained per cost unit compared with indirect topologies in most of the cases. Moreover, their fault-tolerance degree is similar to the one achieved by direct topologies built with switches with the same number of ports.This work was supported by the Spanish Ministerio de Economa y Competitividad (MINECO) and by FEDER funds under Grant TIN2012-38341-C04-01 and by Programa de Ayudas de Investigacion y Desarrollo (PAID) from Universitat Politecnica de Valencia.Peñaranda Cebrián, R.; Gómez Requena, C.; Gómez Requena, ME.; López RodrÃguez, PJ.; Duato MarÃn, JF. (2016). The k-ary n-direct s-indirect family of topologies for large-scale interconnection networks. Journal of Supercomputing. 72(3):1035-1062. https://doi.org/10.1007/s11227-016-1640-z10351062723Connect-IB. http://www.mellanox.com/related-docs/prod_adapter_cards/PB_Connect-IB.pdf . Accessed 3 Feb 2016Mellanox store. http://www.mellanoxstore.com . Accessed 3 Feb 2016Mellanox technology. http://www.mellanox.com . Accessed 3 Feb 2016Myricom. http://www.myri.com . Accessed 3 Feb 2016Quadrics homepage. http://www.quadrics.com . Accessed 22 Sept 2008TOP500 supercomputer site. http://www.top500.org . Accessed 3 Feb 2016Balkan A, Qu G, Vishkin U (2009) Mesh-of-trees and alternative interconnection networks for single-chip parallelism. IEEE Trans Very Large Scale Integr(VLSI) Syst 17(10):1419â1432. doi: 10.1109/TVLSI.2008.2003999Bermudez Garzon D, Gomez ME, Lopez P, Duato J, Gomez C (2014) FT-RUFT: a performance and fault-tolerant efficient indirect topology. In: 22nd Euromicro international conference on parallel, distributed and network-based processing (PDP). IEEE, pp 405â409Bhandarkar SM, Arabnia HR (1995) The Hough transform on a reconfigurable multi-ring network. J Parallel Distrib Comput 24(1):107â114Boku T, Nakazawa K, Nakamura H, Sone T, Mishima T, Itakura K (1996) Adaptive routing technique on hypercrossbar network and its evaluation. Syst Comput Jpn 27(4):55â64Dally W, Towles B (2004) Principles and practices of interconnection networks. Morgan Kaufmann, San FranciscoDas R, Eachempati S, Mishra A, Narayanan V, Das C (2009) Design and evaluation of a hierarchical on-chip interconnect for next-generation CMPs. In: IEEE 15th international symposium on high performance computer architecture (HPCAâ09), pp 175â186. doi: 10.1109/HPCA.2009.4798252Mahdaly AI, Mouftah HT, Hanna NN (1990) Topological properties of WK-recursive networks. In: Proceedings of IEEE workshop on future trends of distributed computing systems, pp 374â380. doi: 10.1109/FTDCS.1990.138349Duato J (1996) A necessary and sufficient condition for deadlock-free routing in cut-through and store-and-forward networks. IEEE Trans Parallel Distrib Syst 7:841â854. doi: 10.1109/71.532115Duato J, Yalamanchili S, Lionel N (2002) Interconnection networks: an engineering approach. Morgan Kaufmann Publishers Inc., USAFlich J, Malumbres M, López P, Duato J (2000) Improving routing performance in Myrinet networks. In: International on parallel and distributed processing symposium, p 27. doi: 10.1109/IPDPS.2000.845961GarcÃa M, Beivide R, Camarero C, Valero M, RodrÃguez G, Minkenberg C (2015) On-the-fly adaptive routing for dragonfly interconnection networks. J Supercomput 71(3):1116â1142Gómez C, Gilabert F, Gómez M, López P, Duato J (2007) Deterministic versus adaptive routing in fat-trees. In: IEEE international on parallel and distributed processing symposium (IPDPSâ07), pp 1â8. doi: 10.1109/IPDPS.2007.370482Gómez C, Gilabert F, Gómez M, López P, Duato J (2008) RUFT: simplifying the fat-tree topology. In: 14th IEEE international conference on parallel and distributed systems (ICPADSâ08), pp 153â160. doi: 10.1109/ICPADS.2008.44Guo C, Lu G, Li D, Wu H, Zhang X, Shi Y, Tian C, Zhang Y, Lu S (2009) BCube: a high performance, server-centric network architecture for modular data centers. In: SIGCOMM â09: proceedings of the ACM SIGCOMM 2009 conference on data communication. ACM, New York, pp 63â74. doi: 10.1145/1592568.1592577 . http://www.bibsonomy.org/bibtex/23a5da89fbf099e3c70f4559ab38082c5/chesteve . Accessed 22 Sept 2008Gupta A, Dally W (2006) Topology optimization of interconnection networks. Comput Arch Lett 5(1):10â13. doi: 10.1109/L-CA.2006.8Kim J, Dally W, Abts D (2007) Flattened butterfly: a cost-efficient topology for high-radix networks. In: Proceedings of the 34th annual international symposium on computer architecture (ISCAâ07). ACM, New York, pp 126â137. doi: 10.1145/1250662.1250679Kim J, Dally W, Scott S, Abts D (2008) Technology-driven, highly-scalable dragonfly topology. In: Proceedings of the 35th annual international symposium on computer architecture (ISCAâ08). IEEE Computer Society, Washington, DC, pp 77â88. doi: 10.1109/ISCA.2008.19Leighton F (1992) Introduction to parallel algorithms and architectures: arrays, trees, hypercubes v. 1. M. Kaufmann Publishers, San FranciscoLeiserson CE (1985) Fat-trees: universal networks for hardware-efficient supercomputing. IEEE Trans Comput 34(10):892â901Matsutani H, Koibuchi M, Amano H (2007) Performance, cost, and energy evaluation of fat H-tree: a cost-efficient tree-based on-chip network. In: IEEE international on parallel and distributed processing symposium (IPDPSâ07), pp 1â10. doi: 10.1109/IPDPS.2007.370271Rahmati D, Kiasari A, Hessabi S, Sarbazi-Azad H (2006) A performance and power analysis of wk-recursive and mesh networks for network-on-chips. In: International conference on computer design (ICCDâ06), pp 142â147. doi: 10.1109/ICCD.2006.4380807Towles B, Dally WJ (2002) Worst-case traffic for oblivious routing functions. In: Proceedings of the fourteenth annual ACM symposium on parallel algorithms and architectures (SPAAâ02). ACM, New York, pp 1â8. doi: 10.1145/564870.564872Yang Y, Funahashi A, Jouraku A, Nishi H, Amano H, Sueyoshi T (2001) Recursive diagonal torus: an interconnection network for massively parallel computers. IEEE Trans Parallel Distrib Syst 12(7):701â715. doi: 10.1109/71.94074
A Performance Prediction Model for a Fault-Tolerant Computer During Recovery and Restoration
The modeling and design of a fault-tolerant multiprocessor system is addressed. In particular, the behavior of the system during recovery and restoration after a fault has occurred is investigated. Given that a multicomputer system is designed using the Algorithm to Architecture to Mapping Model (ATAMM), and that a fault (death of a computing resource) occurs during its normal steady-state operation, a model is presented as a viable research tool for predicting the performance bounds of the system during its recovery and restoration phases. Furthermore, the bounds of the performance behavior of the system during this transient mode can be assessed. These bounds include: time to recover from the fault (t(sub rec)), time to restore the system (t(sub rec)) and whether there is a permanent delay in the system's Time Between Input and Output (TBIO) after the system has reached a steady state. An implementation of an ATAMM based computer was developed with the Generic VHSIC Spaceborne Computer (GVSC) as the target system. A simulation of the GVSC was also written based on the code used in ATAMM Multicomputer Operating System (AMOS). The simulation is in turn used to validate the new model in the usefulness and accuracy in tracking the propagation of the delay through the system and predicting the behavior in the transient state of recovery and restoration. The model is validated as an accurate method to predict the transient behavior of an ATAMM based multicomputer during recovery and restoration
- âŠ