23 research outputs found

    New Fault Tolerant Multicast Routing Techniques to Enhance Distributed-Memory Systems Performance

    Get PDF
    Distributed-memory systems are a key to achieve high performance computing and the most favorable architectures used in advanced research problems. Mesh connected multicomputer are one of the most popular architectures that have been implemented in many distributed-memory systems. These systems must support communication operations efficiently to achieve good performance. The wormhole switching technique has been widely used in design of distributed-memory systems in which the packet is divided into small flits. Also, the multicast communication has been widely used in distributed-memory systems which is one source node sends the same message to several destination nodes. Fault tolerance refers to the ability of the system to operate correctly in the presence of faults. Development of fault tolerant multicast routing algorithms in 2D mesh networks is an important issue. This dissertation presents, new fault tolerant multicast routing algorithms for distributed-memory systems performance using wormhole routed 2D mesh. These algorithms are described for fault tolerant routing in 2D mesh networks, but it can also be extended to other topologies. These algorithms are a combination of a unicast-based multicast algorithm and tree-based multicast algorithms. These algorithms works effectively for the most commonly encountered faults in mesh networks, f-rings, f-chains and concave fault regions. It is shown that the proposed routing algorithms are effective even in the presence of a large number of fault regions and large size of fault region. These algorithms are proved to be deadlock-free. Also, the problem of fault regions overlap is solved. Four essential performance metrics in mesh networks will be considered and calculated; also these algorithms are a limited-global-information-based multicasting which is a compromise of local-information-based approach and global-information-based approach. Data mining is used to validate the results and to enlarge the sample. The proposed new multicast routing techniques are used to enhance the performance of distributed-memory systems. Simulation results are presented to demonstrate the efficiency of the proposed algorithms

    Tree-Based Multicasting in Wormhole-Routed Irregular Topologies

    Get PDF
    A deadlock-free tree-based multicast routing algorithm is presented for all direct networks, regardless of interconnection topology. The algorithm delivers a message to any number of destinations using only a single startup phase. In contrast to existing tree-based schemes, this algorithm applies to all interconnection topologies, requires only fixed-sized input buffers that are independent of maximum message length, and uses a single asynchronous flit replication mechanism. The theoretical basis of the technique used here is sufficiently general to develop other tree-based multicasting algorithms for regular and irregular topologies. Simulation results demonstrate that this tree-based algorithm provides a very promising means of achieving very low latency multicast

    Efficient Multicast Algorithms for Mesh and Torus Networks

    Get PDF
    With the increasing popularity of multicomputers, efficient way of communication within its processors has become a popular area of research. Multicomputers refer to a computer system that has multiple processors, they have high computational power and they can perform multiple tasks concurrently. Mesh and Torus are some of the commonly used network topologies in building multicomputer systems. Their performance highly depends on the underlying network communication such as multicast. Multicast is a communication method in which a message is sent from a source node to a certain number of destinations. Two major parameters used to evaluate multicast are time that a multicast process takes to deliver the message to all destinations and traffic that indicates the number of links used for this process. Research indicates that in general, it is NP- complete to find an optimal multicasting algorithm which is efficient on both time and traffic. This thesis suggests two new algorithms to achieve multicast in mesh and torus networks. Extensive simulations of these algorithms show that in practice they perform better than existing ones

    High-Speed Message Routing Mechanisms for Massively Parallel Computers

    Get PDF
    現在超並列処理システム(MPP)は、伝統的なベクトルプロセッサやSIMDマシンの 牙城であった多くの分野に進出している。これらのシステムは、入手が容易な高性能 CPUの急激な進歩をうまく利用し、これらを数百~数千個接続して均質なマルチプ ロセッサのシステムとして構成したものである。しかし、これらのシステムの性能は、 現実の問題を解くときは必ずしも良くなく、常に公称の最高性能にははるかに及ばな いのが現状である。これらのシステムではプロセッサ間の通信はすべて相互結合網に よって行われるので、実現可能な最高性能を決める決定的な要素は相互結合網と、そ れに使われる通信機構である。 本論文ではMPPの相互結合網に使われる、効率的な通信機構を実現する2つの方法 を提案する。第1は「特急ルータ」の提案であり、これを相互結合網に用いた場合の 適合性を検註する。特急ルータは多重の単方向レジスタ挿入パスを利用して、時間 空間混合分割型ネットワークを実現するためのものである。異なる基数や次元数につ いて、特急ルータのスイッチ回路とバッファ回路の性能を予測するための正確なモデ ルを開発した。この結果、特急ルータは効率的な通信を行うためのすべての条件を満 足していることが確かめられた。さらに重要な点は、特急ルータはネットワークに故 障のある場合や、通信が錯綜する場合にも、低遅延時間、高スループットを損なわな い経路制御が行えることである。シミュレーションによって評価した特急ルータのの 性能は、これまでに発表された固定経路選択方式のルータより優れており、また他の 適応経路制御方式のルータに比べても、同程度あるいはそれを越えていることが確か められた。 第2は経路長制限方式のマルチキャスト通信の提案である。マルチキャスト通信は 多くの並列処理問題において速度向上に寄与する通信方式である。そこでワームホー ル通信方式において問題となるマルチキャスト通信におけるデッドロックの問題につ いて研究した。そしてこの問題を解決する方法として経路長制限方式のマルチキャス ト通信を提案し、この方式による通信性能をシミュレーションによって評価し、ユニ キャスト方式やマルチパス方式によるマルチキャスト通信の性能と比較した。その結 果、提案する経路長制限方式のマルチキャスト通信は、パリヤ同期のためのクラスタ へのマルチキャスト通信や、最近傍ノードへのマルチキャストや全ノードへの放送の 場合に、特に優れた解決法となることを明らかにした

    High-Speed Message Routing Mechanisms for Massively Parallel Computers

    Get PDF
    現在超並列処理システム(MPP)は、伝統的なベクトルプロセッサやSIMDマシンの 牙城であった多くの分野に進出している。これらのシステムは、入手が容易な高性能 CPUの急激な進歩をうまく利用し、これらを数百~数千個接続して均質なマルチプ ロセッサのシステムとして構成したものである。しかし、これらのシステムの性能は、 現実の問題を解くときは必ずしも良くなく、常に公称の最高性能にははるかに及ばな いのが現状である。これらのシステムではプロセッサ間の通信はすべて相互結合網に よって行われるので、実現可能な最高性能を決める決定的な要素は相互結合網と、そ れに使われる通信機構である。 本論文ではMPPの相互結合網に使われる、効率的な通信機構を実現する2つの方法 を提案する。第1は「特急ルータ」の提案であり、これを相互結合網に用いた場合の 適合性を検註する。特急ルータは多重の単方向レジスタ挿入パスを利用して、時間 空間混合分割型ネットワークを実現するためのものである。異なる基数や次元数につ いて、特急ルータのスイッチ回路とバッファ回路の性能を予測するための正確なモデ ルを開発した。この結果、特急ルータは効率的な通信を行うためのすべての条件を満 足していることが確かめられた。さらに重要な点は、特急ルータはネットワークに故 障のある場合や、通信が錯綜する場合にも、低遅延時間、高スループットを損なわな い経路制御が行えることである。シミュレーションによって評価した特急ルータのの 性能は、これまでに発表された固定経路選択方式のルータより優れており、また他の 適応経路制御方式のルータに比べても、同程度あるいはそれを越えていることが確か められた。 第2は経路長制限方式のマルチキャスト通信の提案である。マルチキャスト通信は 多くの並列処理問題において速度向上に寄与する通信方式である。そこでワームホー ル通信方式において問題となるマルチキャスト通信におけるデッドロックの問題につ いて研究した。そしてこの問題を解決する方法として経路長制限方式のマルチキャス ト通信を提案し、この方式による通信性能をシミュレーションによって評価し、ユニ キャスト方式やマルチパス方式によるマルチキャスト通信の性能と比較した。その結 果、提案する経路長制限方式のマルチキャスト通信は、パリヤ同期のためのクラスタ へのマルチキャスト通信や、最近傍ノードへのマルチキャストや全ノードへの放送の 場合に、特に優れた解決法となることを明らかにした

    Path-Based partitioning methods for 3D Networks-on-Chip with minimal adaptive routing

    Full text link
    © 2014 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Combining the benefits of 3D ICs and Networks-on-Chip (NoCs) schemes provides a significant performance gain in Chip Multiprocessors (CMPs) architectures. As multicast communication is commonly used in cache coherence protocols for CMPs and in various parallel applications, the performance of these systems can be significantly improved if multicast operations are supported at the hardware level. In this paper, we present several partitioning methods for the path-based multicast approach in 3D mesh-based NoCs, each with different levels of efficiency. In addition, we develop novel analytical models for unicast and multicast traffic to explore the efficiency of each approach. In order to distribute the unicast and multicast traffic more efficiently over the network, we propose the Minimal and Adaptive Routing (MAR) algorithm for the presented partitioning methods. The analytical and experimental results show that an advantageous method named Recursive Partitioning (RP) outperforms the other approaches. RP recursively partitions the network until all partitions contain a comparable number of switches and thus the multicast traffic is equally distributed among several subsets and the network latency is considerably decreased. The simulation results reveal that the RP method can achieve performance improvement across all workloads while performance can be further improved by utilizing the MAR algorithm. Nineteen percent average and 42 percent maximum latency reduction are obtained on SPLASH-2 and PARSEC benchmarks running on a 64-core CMP.Ebrahimi, M.; Daneshtalab, M.; Liljeberg, P.; Plosila, J.; Flich Cardo, J.; Tenhunen, H. (2014). Path-Based partitioning methods for 3D Networks-on-Chip with minimal adaptive routing. IEEE Transactions on Computers. 63(3):718-733. doi:10.1109/TC.2012.255S71873363

    Performance Evaluation of Specialized Hardware for Fast Global Operations on Distributed Memory Multicomputers

    Get PDF
    Workstation cluster multicomputers are increasingly being applied for solving scientific problems that require massive computing power. Parallel Virtual Machine (PVM) is a popular message-passing model used to program these clusters. One of the major performance limiting factors for cluster multicomputers is their inefficiency in performing parallel program operations involving collective communications. These operations include synchronization, global reduction, broadcast/multicast operations and orderly access to shared global variables. Hall has demonstrated that a .secondary network with wide tree topology and centralized coordination processors (COP) could improve the performance of global operations on a variety of distributed architectures [Hall94a]. My hypothesis was that the efficiency of many PVM applications on workstation clusters could be significantly improved by utilizing a COP system for collective communication operations. To test my hypothesis, I interfaced COP system with PVM. The interface software includes a virtual memory-mapped secondary network interface driver, and a function library which allows to use COP system in place of PVM function calls in application programs. My implementation makes it possible to easily port any existing PVM applications to perform fast global operations using the COP system. To evaluate the performance improvements of using a COP system, I measured cost of various PVM global functions, derived the cost of equivalent COP library global functions, and compared the results. To analyze the cost of global operations on overall execution time of applications, I instrumented a complex molecular dynamics PVM application and performed measurements. The measurements were performed for a sample cluster size of 5 and for message sizes up to 16 kilobytes. The comparison of PVM and COP system global operation performance clearly demonstrates that the COP system can speed up a variety of global operations involving small-to-medium sized messages by factors of 5-25. Analysis of the example application for a sample cluster size of 5 show that speedup provided by my global function libraries and the COP system reduces overall execution time for this and similar applications by above 1.5 times. Additionally, the performance improvement seen by applications increases as the cluster size increases, thus providing a scalable solution for performing global operations

    Exploring Adaptive Implementation of On-Chip Networks

    Get PDF
    As technology geometries have shrunk to the deep submicron regime, the communication delay and power consumption of global interconnections in high performance Multi- Processor Systems-on-Chip (MPSoCs) are becoming a major bottleneck. The Network-on- Chip (NoC) architecture paradigm, based on a modular packet-switched mechanism, can address many of the on-chip communication issues such as performance limitations of long interconnects and integration of large number of Processing Elements (PEs) on a chip. The choice of routing protocol and NoC structure can have a significant impact on performance and power consumption in on-chip networks. In addition, building a high performance, area and energy efficient on-chip network for multicore architectures requires a novel on-chip router allowing a larger network to be integrated on a single die with reduced power consumption. On top of that, network interfaces are employed to decouple computation resources from communication resources, to provide the synchronization between them, and to achieve backward compatibility with existing IP cores. Three adaptive routing algorithms are presented as a part of this thesis. The first presented routing protocol is a congestion-aware adaptive routing algorithm for 2D mesh NoCs which does not support multicast (one-to-many) traffic while the other two protocols are adaptive routing models supporting both unicast (one-to-one) and multicast traffic. A streamlined on-chip router architecture is also presented for avoiding congested areas in 2D mesh NoCs via employing efficient input and output selection. The output selection utilizes an adaptive routing algorithm based on the congestion condition of neighboring routers while the input selection allows packets to be serviced from each input port according to its congestion level. Moreover, in order to increase memory parallelism and bring compatibility with existing IP cores in network-based multiprocessor architectures, adaptive network interface architectures are presented to use multiple SDRAMs which can be accessed simultaneously. In addition, a smart memory controller is integrated in the adaptive network interface to improve the memory utilization and reduce both memory and network latencies. Three Dimensional Integrated Circuits (3D ICs) have been emerging as a viable candidate to achieve better performance and package density as compared to traditional 2D ICs. In addition, combining the benefits of 3D IC and NoC schemes provides a significant performance gain for 3D architectures. In recent years, inter-layer communication across multiple stacked layers (vertical channel) has attracted a lot of interest. In this thesis, a novel adaptive pipeline bus structure is proposed for inter-layer communication to improve the performance by reducing the delay and complexity of traditional bus arbitration. In addition, two mesh-based topologies for 3D architectures are also introduced to mitigate the inter-layer footprint and power dissipation on each layer with a small performance penalty.Siirretty Doriast
    corecore