569 research outputs found

    High-Speed Message Routing Mechanisms for Massively Parallel Computers

    Get PDF
    珟圚超䞊列凊理システム(MPP)は、䌝統的なベクトルプロセッサやSIMDマシンの 牙城であった倚くの分野に進出しおいる。これらのシステムは、入手が容易な高性胜 CPUの急激な進歩をうたく利甚し、これらを数癟数千個接続しお均質なマルチプ ロセッサのシステムずしお構成したものである。しかし、これらのシステムの性胜は、 珟実の問題を解くずきは必ずしも良くなく、垞に公称の最高性胜にははるかに及ばな いのが珟状である。これらのシステムではプロセッサ間の通信はすべお盞互結合網に よっお行われるので、実珟可胜な最高性胜を決める決定的な芁玠は盞互結合網ず、そ れに䜿われる通信機構である。 本論文ではMPPの盞互結合網に䜿われる、効率的な通信機構を実珟する2぀の方法 を提案する。第1は「特急ルヌタ」の提案であり、これを盞互結合網に甚いた堎合の 適合性を怜蚻する。特急ルヌタは倚重の単方向レゞスタ挿入パスを利甚しお、時間 空間混合分割型ネットワヌクを実珟するためのものである。異なる基数や次元数に぀ いお、特急ルヌタのスむッチ回路ずバッファ回路の性胜を予枬するための正確なモデ ルを開発した。この結果、特急ルヌタは効率的な通信を行うためのすべおの条件を満 足しおいるこずが確かめられた。さらに重芁な点は、特急ルヌタはネットワヌクに故 障のある堎合や、通信が錯綜する堎合にも、䜎遅延時間、高スルヌプットを損なわな い経路制埡が行えるこずである。シミュレヌションによっお評䟡した特急ルヌタのの 性胜は、これたでに発衚された固定経路遞択方匏のルヌタより優れおおり、たた他の 適応経路制埡方匏のルヌタに比べおも、同皋床あるいはそれを越えおいるこずが確か められた。 第2は経路長制限方匏のマルチキャスト通信の提案である。マルチキャスト通信は 倚くの䞊列凊理問題においお速床向䞊に寄䞎する通信方匏である。そこでワヌムホヌ ル通信方匏においお問題ずなるマルチキャスト通信におけるデッドロックの問題に぀ いお研究した。そしおこの問題を解決する方法ずしお経路長制限方匏のマルチキャス ト通信を提案し、この方匏による通信性胜をシミュレヌションによっお評䟡し、ナニ キャスト方匏やマルチパス方匏によるマルチキャスト通信の性胜ず比范した。その結 果、提案する経路長制限方匏のマルチキャスト通信は、パリダ同期のためのクラスタ ぞのマルチキャスト通信や、最近傍ノヌドぞのマルチキャストや党ノヌドぞの攟送の 堎合に、特に優れた解決法ずなるこずを明らかにした

    High-Speed Message Routing Mechanisms for Massively Parallel Computers

    Get PDF
    珟圚超䞊列凊理システム(MPP)は、䌝統的なベクトルプロセッサやSIMDマシンの 牙城であった倚くの分野に進出しおいる。これらのシステムは、入手が容易な高性胜 CPUの急激な進歩をうたく利甚し、これらを数癟数千個接続しお均質なマルチプ ロセッサのシステムずしお構成したものである。しかし、これらのシステムの性胜は、 珟実の問題を解くずきは必ずしも良くなく、垞に公称の最高性胜にははるかに及ばな いのが珟状である。これらのシステムではプロセッサ間の通信はすべお盞互結合網に よっお行われるので、実珟可胜な最高性胜を決める決定的な芁玠は盞互結合網ず、そ れに䜿われる通信機構である。 本論文ではMPPの盞互結合網に䜿われる、効率的な通信機構を実珟する2぀の方法 を提案する。第1は「特急ルヌタ」の提案であり、これを盞互結合網に甚いた堎合の 適合性を怜蚻する。特急ルヌタは倚重の単方向レゞスタ挿入パスを利甚しお、時間 空間混合分割型ネットワヌクを実珟するためのものである。異なる基数や次元数に぀ いお、特急ルヌタのスむッチ回路ずバッファ回路の性胜を予枬するための正確なモデ ルを開発した。この結果、特急ルヌタは効率的な通信を行うためのすべおの条件を満 足しおいるこずが確かめられた。さらに重芁な点は、特急ルヌタはネットワヌクに故 障のある堎合や、通信が錯綜する堎合にも、䜎遅延時間、高スルヌプットを損なわな い経路制埡が行えるこずである。シミュレヌションによっお評䟡した特急ルヌタのの 性胜は、これたでに発衚された固定経路遞択方匏のルヌタより優れおおり、たた他の 適応経路制埡方匏のルヌタに比べおも、同皋床あるいはそれを越えおいるこずが確か められた。 第2は経路長制限方匏のマルチキャスト通信の提案である。マルチキャスト通信は 倚くの䞊列凊理問題においお速床向䞊に寄䞎する通信方匏である。そこでワヌムホヌ ル通信方匏においお問題ずなるマルチキャスト通信におけるデッドロックの問題に぀ いお研究した。そしおこの問題を解決する方法ずしお経路長制限方匏のマルチキャス ト通信を提案し、この方匏による通信性胜をシミュレヌションによっお評䟡し、ナニ キャスト方匏やマルチパス方匏によるマルチキャスト通信の性胜ず比范した。その結 果、提案する経路長制限方匏のマルチキャスト通信は、パリダ同期のためのクラスタ ぞのマルチキャスト通信や、最近傍ノヌドぞのマルチキャストや党ノヌドぞの攟送の 堎合に、特に優れた解決法ずなるこずを明らかにした

    Routing on the Channel Dependency Graph:: A New Approach to Deadlock-Free, Destination-Based, High-Performance Routing for Lossless Interconnection Networks

    Get PDF
    In the pursuit for ever-increasing compute power, and with Moore's law slowly coming to an end, high-performance computing started to scale-out to larger systems. Alongside the increasing system size, the interconnection network is growing to accommodate and connect tens of thousands of compute nodes. These networks have a large influence on total cost, application performance, energy consumption, and overall system efficiency of the supercomputer. Unfortunately, state-of-the-art routing algorithms, which define the packet paths through the network, do not utilize this important resource efficiently. Topology-aware routing algorithms become increasingly inapplicable, due to irregular topologies, which either are irregular by design, or most often a result of hardware failures. Exchanging faulty network components potentially requires whole system downtime further increasing the cost of the failure. This management approach becomes more and more impractical due to the scale of today's networks and the accompanying steady decrease of the mean time between failures. Alternative methods of operating and maintaining these high-performance interconnects, both in terms of hardware- and software-management, are necessary to mitigate negative effects experienced by scientific applications executed on the supercomputer. However, existing topology-agnostic routing algorithms either suffer from poor load balancing or are not bounded in the number of virtual channels needed to resolve deadlocks in the routing tables. Using the fail-in-place strategy, a well-established method for storage systems to repair only critical component failures, is a feasible solution for current and future HPC interconnects as well as other large-scale installations such as data center networks. Although, an appropriate combination of topology and routing algorithm is required to minimize the throughput degradation for the entire system. This thesis contributes a network simulation toolchain to facilitate the process of finding a suitable combination, either during system design or while it is in operation. On top of this foundation, a key contribution is a novel scheduling-aware routing, which reduces fault-induced throughput degradation while improving overall network utilization. The scheduling-aware routing performs frequent property preserving routing updates to optimize the path balancing for simultaneously running batch jobs. The increased deployment of lossless interconnection networks, in conjunction with fail-in-place modes of operation and topology-agnostic, scheduling-aware routing algorithms, necessitates new solutions to solve the routing-deadlock problem. Therefore, this thesis further advances the state-of-the-art by introducing a novel concept of routing on the channel dependency graph, which allows the design of an universally applicable destination-based routing capable of optimizing the path balancing without exceeding a given number of virtual channels, which are a common hardware limitation. This disruptive innovation enables implicit deadlock-avoidance during path calculation, instead of solving both problems separately as all previous solutions

    Quarc: an architecture for efficient on-chip communication

    Get PDF
    The exponential downscaling of the feature size has enforced a paradigm shift from computation-based design to communication-based design in system on chip development. Buses, the traditional communication architecture in systems on chip, are incapable of addressing the increasing bandwidth requirements of future large systems. Networks on chip have emerged as an interconnection architecture offering unique solutions to the technological and design issues related to communication in future systems on chip. The transition from buses as a shared medium to networks on chip as a segmented medium has given rise to new challenges in system on chip realm. By leveraging the shared nature of the communication medium, buses have been highly efficient in delivering multicast communication. The segmented nature of networks, however, inhibits the multicast messages to be delivered as efficiently by networks on chip. Relying on extensive research on multicast communication in parallel computers, several network on chip architectures have offered mechanisms to perform the operation, while conforming to resource constraints of the network on chip paradigm. Multicast communication in majority of these networks on chip is implemented by establishing a connection between source and all multicast destinations before the message transmission commences. Establishing the connections incurs an overhead and, therefore, is not desirable; in particular in latency sensitive services such as cache coherence. To address high performance multicast communication, this research presents Quarc, a novel network on chip architecture. The Quarc architecture targets an area-efficient, low power, high performance implementation. The thesis covers a detailed representation of the building blocks of the architecture, including topology, router and network interface. The cost and performance comparison of the Quarc architecture against other network on chip architectures reveals that the Quarc architecture is a highly efficient architecture. Moreover, the thesis introduces novel performance models of complex traffic patterns, including multicast and quality of service-aware communication

    The k-ary n-direct s-indirect family of topologies for large-scale interconnection networks

    Full text link
    The final publication is available at Springer via http://dx.doi.org/10.1007/s11227-016-1640-zIn large-scale supercomputers, the interconnection network plays a key role in system performance. Network topology highly defines the performance and cost of the interconnection network. Direct topologies are sometimes used due to its reduced hardware cost, but the number of network dimensions is limited by the physical 3D space, which leads to an increase of the communication latency and a reduction of network throughput for large machines. Indirect topologies can provide better performance for large machines, but at higher hardware cost. In this paper, we propose a new family of hybrid topologies, the k-ary n-direct s-indirect, that combines the best features from both direct and indirect topologies to efficiently connect an extremely high number of processing nodes. The proposed network is an n-dimensional topology where the k nodes of each dimension are connected through a small indirect topology of s stages. This combination results in a family of topologies that provides high performance, with latency and throughput figures of merit close to indirect topologies, but at a lower hardware cost. In particular, it doubles the throughput obtained per cost unit compared with indirect topologies in most of the cases. Moreover, their fault-tolerance degree is similar to the one achieved by direct topologies built with switches with the same number of ports.This work was supported by the Spanish Ministerio de Economa y Competitividad (MINECO) and by FEDER funds under Grant TIN2012-38341-C04-01 and by Programa de Ayudas de Investigacion y Desarrollo (PAID) from Universitat Politecnica de Valencia.Peñaranda Cebrián, R.; Gómez Requena, C.; Gómez Requena, ME.; López Rodríguez, PJ.; Duato Marín, JF. (2016). The k-ary n-direct s-indirect family of topologies for large-scale interconnection networks. Journal of Supercomputing. 72(3):1035-1062. https://doi.org/10.1007/s11227-016-1640-z10351062723Connect-IB. http://www.mellanox.com/related-docs/prod_adapter_cards/PB_Connect-IB.pdf . Accessed 3 Feb 2016Mellanox store. http://www.mellanoxstore.com . Accessed 3 Feb 2016Mellanox technology. http://www.mellanox.com . Accessed 3 Feb 2016Myricom. http://www.myri.com . Accessed 3 Feb 2016Quadrics homepage. http://www.quadrics.com . Accessed 22 Sept 2008TOP500 supercomputer site. http://www.top500.org . Accessed 3 Feb 2016Balkan A, Qu G, Vishkin U (2009) Mesh-of-trees and alternative interconnection networks for single-chip parallelism. IEEE Trans Very Large Scale Integr(VLSI) Syst 17(10):1419–1432. doi: 10.1109/TVLSI.2008.2003999Bermudez Garzon D, Gomez ME, Lopez P, Duato J, Gomez C (2014) FT-RUFT: a performance and fault-tolerant efficient indirect topology. In: 22nd Euromicro international conference on parallel, distributed and network-based processing (PDP). IEEE, pp 405–409Bhandarkar SM, Arabnia HR (1995) The Hough transform on a reconfigurable multi-ring network. J Parallel Distrib Comput 24(1):107–114Boku T, Nakazawa K, Nakamura H, Sone T, Mishima T, Itakura K (1996) Adaptive routing technique on hypercrossbar network and its evaluation. Syst Comput Jpn 27(4):55–64Dally W, Towles B (2004) Principles and practices of interconnection networks. Morgan Kaufmann, San FranciscoDas R, Eachempati S, Mishra A, Narayanan V, Das C (2009) Design and evaluation of a hierarchical on-chip interconnect for next-generation CMPs. In: IEEE 15th international symposium on high performance computer architecture (HPCA’09), pp 175–186. doi: 10.1109/HPCA.2009.4798252Mahdaly AI, Mouftah HT, Hanna NN (1990) Topological properties of WK-recursive networks. In: Proceedings of IEEE workshop on future trends of distributed computing systems, pp 374–380. doi: 10.1109/FTDCS.1990.138349Duato J (1996) A necessary and sufficient condition for deadlock-free routing in cut-through and store-and-forward networks. IEEE Trans Parallel Distrib Syst 7:841–854. doi: 10.1109/71.532115Duato J, Yalamanchili S, Lionel N (2002) Interconnection networks: an engineering approach. Morgan Kaufmann Publishers Inc., USAFlich J, Malumbres M, López P, Duato J (2000) Improving routing performance in Myrinet networks. In: International on parallel and distributed processing symposium, p 27. doi: 10.1109/IPDPS.2000.845961García M, Beivide R, Camarero C, Valero M, Rodríguez G, Minkenberg C (2015) On-the-fly adaptive routing for dragonfly interconnection networks. J Supercomput 71(3):1116–1142Gómez C, Gilabert F, Gómez M, López P, Duato J (2007) Deterministic versus adaptive routing in fat-trees. In: IEEE international on parallel and distributed processing symposium (IPDPS’07), pp 1–8. doi: 10.1109/IPDPS.2007.370482Gómez C, Gilabert F, Gómez M, López P, Duato J (2008) RUFT: simplifying the fat-tree topology. In: 14th IEEE international conference on parallel and distributed systems (ICPADS’08), pp 153–160. doi: 10.1109/ICPADS.2008.44Guo C, Lu G, Li D, Wu H, Zhang X, Shi Y, Tian C, Zhang Y, Lu S (2009) BCube: a high performance, server-centric network architecture for modular data centers. In: SIGCOMM ’09: proceedings of the ACM SIGCOMM 2009 conference on data communication. ACM, New York, pp 63–74. doi: 10.1145/1592568.1592577 . http://www.bibsonomy.org/bibtex/23a5da89fbf099e3c70f4559ab38082c5/chesteve . Accessed 22 Sept 2008Gupta A, Dally W (2006) Topology optimization of interconnection networks. Comput Arch Lett 5(1):10–13. doi: 10.1109/L-CA.2006.8Kim J, Dally W, Abts D (2007) Flattened butterfly: a cost-efficient topology for high-radix networks. In: Proceedings of the 34th annual international symposium on computer architecture (ISCA’07). ACM, New York, pp 126–137. doi: 10.1145/1250662.1250679Kim J, Dally W, Scott S, Abts D (2008) Technology-driven, highly-scalable dragonfly topology. In: Proceedings of the 35th annual international symposium on computer architecture (ISCA’08). IEEE Computer Society, Washington, DC, pp 77–88. doi: 10.1109/ISCA.2008.19Leighton F (1992) Introduction to parallel algorithms and architectures: arrays, trees, hypercubes v. 1. M. Kaufmann Publishers, San FranciscoLeiserson CE (1985) Fat-trees: universal networks for hardware-efficient supercomputing. IEEE Trans Comput 34(10):892–901Matsutani H, Koibuchi M, Amano H (2007) Performance, cost, and energy evaluation of fat H-tree: a cost-efficient tree-based on-chip network. In: IEEE international on parallel and distributed processing symposium (IPDPS’07), pp 1–10. doi: 10.1109/IPDPS.2007.370271Rahmati D, Kiasari A, Hessabi S, Sarbazi-Azad H (2006) A performance and power analysis of wk-recursive and mesh networks for network-on-chips. In: International conference on computer design (ICCD’06), pp 142–147. doi: 10.1109/ICCD.2006.4380807Towles B, Dally WJ (2002) Worst-case traffic for oblivious routing functions. In: Proceedings of the fourteenth annual ACM symposium on parallel algorithms and architectures (SPAA’02). ACM, New York, pp 1–8. doi: 10.1145/564870.564872Yang Y, Funahashi A, Jouraku A, Nishi H, Amano H, Sueyoshi T (2001) Recursive diagonal torus: an interconnection network for massively parallel computers. IEEE Trans Parallel Distrib Syst 12(7):701–715. doi: 10.1109/71.94074

    A Performance Prediction Model for a Fault-Tolerant Computer During Recovery and Restoration

    Get PDF
    The modeling and design of a fault-tolerant multiprocessor system is addressed. In particular, the behavior of the system during recovery and restoration after a fault has occurred is investigated. Given that a multicomputer system is designed using the Algorithm to Architecture to Mapping Model (ATAMM), and that a fault (death of a computing resource) occurs during its normal steady-state operation, a model is presented as a viable research tool for predicting the performance bounds of the system during its recovery and restoration phases. Furthermore, the bounds of the performance behavior of the system during this transient mode can be assessed. These bounds include: time to recover from the fault (t(sub rec)), time to restore the system (t(sub rec)) and whether there is a permanent delay in the system's Time Between Input and Output (TBIO) after the system has reached a steady state. An implementation of an ATAMM based computer was developed with the Generic VHSIC Spaceborne Computer (GVSC) as the target system. A simulation of the GVSC was also written based on the code used in ATAMM Multicomputer Operating System (AMOS). The simulation is in turn used to validate the new model in the usefulness and accuracy in tracking the propagation of the delay through the system and predicting the behavior in the transient state of recovery and restoration. The model is validated as an accurate method to predict the transient behavior of an ATAMM based multicomputer during recovery and restoration
    • 

    corecore