50 research outputs found

    On the performance of routing algorithms in wormhole-switched multicomputer networks

    Get PDF
    This paper presents a comparative performance study of adaptive and deterministic routing algorithms in wormhole-switched hypercubes and investigates the performance vicissitudes of these routing schemes under a variety of network operating conditions. Despite the previously reported results, our results show that the adaptive routing does not consistently outperform the deterministic routing even for high dimensional networks. In fact, it appears that the superiority of adaptive routing is highly dependent to the broadcast traffic rate generated at each node and it begins to deteriorate by growing the broadcast rate of generated message

    Design of a communications interface for a very high performance computer

    Get PDF
    PetaFLOPS computing power is the newest goal of Federal Government agencies, in the increasingly active supercomputer field. To obtain this performance goal by the year 2007, sophisticated parallel processing designs are required. To effectively create network interfaces/routers for interprocessor communications in such computer systems, it requires optimal hardware and software codesigns. An interface is presented for the NJIT New Millennium Computing Point Design, a system that targets 100 TeraFLOPS performance by the year 2005. The router handles store-and-forward switching and wormhole routing for the system

    Quarc: an architecture for efficient on-chip communication

    Get PDF
    The exponential downscaling of the feature size has enforced a paradigm shift from computation-based design to communication-based design in system on chip development. Buses, the traditional communication architecture in systems on chip, are incapable of addressing the increasing bandwidth requirements of future large systems. Networks on chip have emerged as an interconnection architecture offering unique solutions to the technological and design issues related to communication in future systems on chip. The transition from buses as a shared medium to networks on chip as a segmented medium has given rise to new challenges in system on chip realm. By leveraging the shared nature of the communication medium, buses have been highly efficient in delivering multicast communication. The segmented nature of networks, however, inhibits the multicast messages to be delivered as efficiently by networks on chip. Relying on extensive research on multicast communication in parallel computers, several network on chip architectures have offered mechanisms to perform the operation, while conforming to resource constraints of the network on chip paradigm. Multicast communication in majority of these networks on chip is implemented by establishing a connection between source and all multicast destinations before the message transmission commences. Establishing the connections incurs an overhead and, therefore, is not desirable; in particular in latency sensitive services such as cache coherence. To address high performance multicast communication, this research presents Quarc, a novel network on chip architecture. The Quarc architecture targets an area-efficient, low power, high performance implementation. The thesis covers a detailed representation of the building blocks of the architecture, including topology, router and network interface. The cost and performance comparison of the Quarc architecture against other network on chip architectures reveals that the Quarc architecture is a highly efficient architecture. Moreover, the thesis introduces novel performance models of complex traffic patterns, including multicast and quality of service-aware communication

    On the design and implementation of broadcast and global combine operations using the postal model

    Get PDF
    There are a number of models that were proposed in recent years for message passing parallel systems. Examples are the postal model and its generalization the LogP model. In the postal model a parameter λ is used to model the communication latency of the message-passing system. Each node during each round can send a fixed-size message and, simultaneously, receive a message of the same size. Furthermore, a message sent out during round r will incur a latency of hand will arrive at the receiving node at round r + λ - 1. Our goal in this paper is to bridge the gap between the theoretical modeling and the practical implementation. In particular, we investigate a number of practical issues related to the design and implementation of two collective communication operations, namely, the broadcast operation and the global combine operation. Those practical issues include, for example, 1) techniques for measurement of the value of λ on a given machine, 2) creating efficient broadcast algorithms that get the latency hand the number of nodes n as parameters and 3) creating efficient global combine algorithms for parallel machines with λ which is not an integer. We propose solutions that address those practical issues and present results of an experimental study of the new algorithms on the Intel Delta machine. Our main conclusion is that the postal model can help in performance prediction and tuning, for example, a properly tuned broadcast improves the known implementation by more than 20%

    New Fault Tolerant Multicast Routing Techniques to Enhance Distributed-Memory Systems Performance

    Get PDF
    Distributed-memory systems are a key to achieve high performance computing and the most favorable architectures used in advanced research problems. Mesh connected multicomputer are one of the most popular architectures that have been implemented in many distributed-memory systems. These systems must support communication operations efficiently to achieve good performance. The wormhole switching technique has been widely used in design of distributed-memory systems in which the packet is divided into small flits. Also, the multicast communication has been widely used in distributed-memory systems which is one source node sends the same message to several destination nodes. Fault tolerance refers to the ability of the system to operate correctly in the presence of faults. Development of fault tolerant multicast routing algorithms in 2D mesh networks is an important issue. This dissertation presents, new fault tolerant multicast routing algorithms for distributed-memory systems performance using wormhole routed 2D mesh. These algorithms are described for fault tolerant routing in 2D mesh networks, but it can also be extended to other topologies. These algorithms are a combination of a unicast-based multicast algorithm and tree-based multicast algorithms. These algorithms works effectively for the most commonly encountered faults in mesh networks, f-rings, f-chains and concave fault regions. It is shown that the proposed routing algorithms are effective even in the presence of a large number of fault regions and large size of fault region. These algorithms are proved to be deadlock-free. Also, the problem of fault regions overlap is solved. Four essential performance metrics in mesh networks will be considered and calculated; also these algorithms are a limited-global-information-based multicasting which is a compromise of local-information-based approach and global-information-based approach. Data mining is used to validate the results and to enlarge the sample. The proposed new multicast routing techniques are used to enhance the performance of distributed-memory systems. Simulation results are presented to demonstrate the efficiency of the proposed algorithms

    Performance analysis of wormhole routing in multicomputer interconnection networks

    Get PDF
    Perhaps the most critical component in determining the ultimate performance potential of a multicomputer is its interconnection network, the hardware fabric supporting communication among individual processors. The message latency and throughput of such a network are affected by many factors of which topology, switching method, routing algorithm and traffic load are the most significant. In this context, the present study focuses on a performance analysis of k-ary n-cube networks employing wormhole switching, virtual channels and adaptive routing, a scenario of especial interest to current research. This project aims to build upon earlier work in two main ways: constructing new analytical models for k-ary n-cubes, and comparing the performance merits of cubes of different dimensionality. To this end, some important topological properties of k-ary n-cubes are explored initially; in particular, expressions are derived to calculate the number of nodes at/within a given distance from a chosen centre. These results are important in their own right but their primary significance here is to assist in the construction of new and more realistic analytical models of wormhole-routed k-ary n-cubes. An accurate analytical model for wormhole-routed k-ary n-cubes with adaptive routing and uniform traffic is then developed, incorporating the use of virtual channels and the effect of locality in the traffic pattern. New models are constructed for wormhole k-ary n-cubes, with the ability to simulate behaviour under adaptive routing and non-uniform communication workloads, such as hotspot traffic, matrix-transpose and digit-reversal permutation patterns. The models are equally applicable to unidirectional and bidirectional k-ary n-cubes and are significantly more realistic than any in use up to now. With this level of accuracy, the effect of each important network parameter on the overall network performance can be investigated in a more comprehensive manner than before. Finally, k-ary n-cubes of different dimensionality are compared using the new models. The comparison takes account of various traffic patterns and implementation costs, using both pin-out and bisection bandwidth as metrics. Networks with both normal and pipelined channels are considered. While previous similar studies have only taken account of network channel costs, our model incorporates router costs as well thus generating more realistic results. In fact the results of this work differ markedly from those yielded by earlier studies which assumed deterministic routing and uniform traffic, illustrating the importance of using accurate models to conduct such analyses

    High-Speed Message Routing Mechanisms for Massively Parallel Computers

    Get PDF
    現在超並列処理システム(MPP)は、伝統的なベクトルプロセッサやSIMDマシンの 牙城であった多くの分野に進出している。これらのシステムは、入手が容易な高性能 CPUの急激な進歩をうまく利用し、これらを数百~数千個接続して均質なマルチプ ロセッサのシステムとして構成したものである。しかし、これらのシステムの性能は、 現実の問題を解くときは必ずしも良くなく、常に公称の最高性能にははるかに及ばな いのが現状である。これらのシステムではプロセッサ間の通信はすべて相互結合網に よって行われるので、実現可能な最高性能を決める決定的な要素は相互結合網と、そ れに使われる通信機構である。 本論文ではMPPの相互結合網に使われる、効率的な通信機構を実現する2つの方法 を提案する。第1は「特急ルータ」の提案であり、これを相互結合網に用いた場合の 適合性を検註する。特急ルータは多重の単方向レジスタ挿入パスを利用して、時間 空間混合分割型ネットワークを実現するためのものである。異なる基数や次元数につ いて、特急ルータのスイッチ回路とバッファ回路の性能を予測するための正確なモデ ルを開発した。この結果、特急ルータは効率的な通信を行うためのすべての条件を満 足していることが確かめられた。さらに重要な点は、特急ルータはネットワークに故 障のある場合や、通信が錯綜する場合にも、低遅延時間、高スループットを損なわな い経路制御が行えることである。シミュレーションによって評価した特急ルータのの 性能は、これまでに発表された固定経路選択方式のルータより優れており、また他の 適応経路制御方式のルータに比べても、同程度あるいはそれを越えていることが確か められた。 第2は経路長制限方式のマルチキャスト通信の提案である。マルチキャスト通信は 多くの並列処理問題において速度向上に寄与する通信方式である。そこでワームホー ル通信方式において問題となるマルチキャスト通信におけるデッドロックの問題につ いて研究した。そしてこの問題を解決する方法として経路長制限方式のマルチキャス ト通信を提案し、この方式による通信性能をシミュレーションによって評価し、ユニ キャスト方式やマルチパス方式によるマルチキャスト通信の性能と比較した。その結 果、提案する経路長制限方式のマルチキャスト通信は、パリヤ同期のためのクラスタ へのマルチキャスト通信や、最近傍ノードへのマルチキャストや全ノードへの放送の 場合に、特に優れた解決法となることを明らかにした

    High-Speed Message Routing Mechanisms for Massively Parallel Computers

    Get PDF
    現在超並列処理システム(MPP)は、伝統的なベクトルプロセッサやSIMDマシンの 牙城であった多くの分野に進出している。これらのシステムは、入手が容易な高性能 CPUの急激な進歩をうまく利用し、これらを数百~数千個接続して均質なマルチプ ロセッサのシステムとして構成したものである。しかし、これらのシステムの性能は、 現実の問題を解くときは必ずしも良くなく、常に公称の最高性能にははるかに及ばな いのが現状である。これらのシステムではプロセッサ間の通信はすべて相互結合網に よって行われるので、実現可能な最高性能を決める決定的な要素は相互結合網と、そ れに使われる通信機構である。 本論文ではMPPの相互結合網に使われる、効率的な通信機構を実現する2つの方法 を提案する。第1は「特急ルータ」の提案であり、これを相互結合網に用いた場合の 適合性を検註する。特急ルータは多重の単方向レジスタ挿入パスを利用して、時間 空間混合分割型ネットワークを実現するためのものである。異なる基数や次元数につ いて、特急ルータのスイッチ回路とバッファ回路の性能を予測するための正確なモデ ルを開発した。この結果、特急ルータは効率的な通信を行うためのすべての条件を満 足していることが確かめられた。さらに重要な点は、特急ルータはネットワークに故 障のある場合や、通信が錯綜する場合にも、低遅延時間、高スループットを損なわな い経路制御が行えることである。シミュレーションによって評価した特急ルータのの 性能は、これまでに発表された固定経路選択方式のルータより優れており、また他の 適応経路制御方式のルータに比べても、同程度あるいはそれを越えていることが確か められた。 第2は経路長制限方式のマルチキャスト通信の提案である。マルチキャスト通信は 多くの並列処理問題において速度向上に寄与する通信方式である。そこでワームホー ル通信方式において問題となるマルチキャスト通信におけるデッドロックの問題につ いて研究した。そしてこの問題を解決する方法として経路長制限方式のマルチキャス ト通信を提案し、この方式による通信性能をシミュレーションによって評価し、ユニ キャスト方式やマルチパス方式によるマルチキャスト通信の性能と比較した。その結 果、提案する経路長制限方式のマルチキャスト通信は、パリヤ同期のためのクラスタ へのマルチキャスト通信や、最近傍ノードへのマルチキャストや全ノードへの放送の 場合に、特に優れた解決法となることを明らかにした
    corecore