10 research outputs found

    Performance analysis of wormhole switched interconnection networks with virtual channels and finite buffers

    Get PDF
    An efficient interconnection network that provides high bandwidth and low latency interprocessor communication is critical to harness fully the computational power of large scale multicomputer. K-ary n-cube networks have been widely adopted in contemporary multicomputers due to their desirable properties. As such, the present study focuses on a performance analysis of K-ary n-cubes employing wormhole switching, virtual channels, and adaptive routing. The objective of this dissertation is twofold: to examine the performance of these networks, and to compare the performance merits of various topologies under different working conditions, by means of analytical modelling. Most existing analytical models reported in the literature have used a method originally proposed by Dally to capture the effects of virtual channels on network performance. This method is based on a Markov chain and it has been shown that its prediction accuracy degrades as traffic increases. Moreover, these studies have also constrained the buffer capacity to a single flit per channel, a simplifying assumption that has often been invoked to ease the derivation of the analytical models. Motivated by these observations, the first part of this research proposes a new method for modelling virtual channels, based on an M/G/1 queue. Owing to the generality of this method. Daily's method is shown to be a special case when the message service time is exponentially distributed. The second part of this research uses theoretical results of queuing systems to relax the single-flit buffer assumption. New analytical models are then proposed to capture the effects of deploying arbitrary size buffers on the performance of deterministic and adaptive routing algorithms. Simulation experiments reveal that results from the proposed analytical models are in close agreement with those obtained through simulation. Building on these new analytical models, the third part of this research compares the relative performance merits of K-ary n-cubes under different operating conditions, in the presence of finite size buffers and multiple virtual channels. Namely, the analysis first revisits the relative performance merits of the well-known 2D torus, 3D torus and hypercube under different implementation constraints. The analysis has then been extended to investigate the performance impact of arranging the total buffer space, allocated to a physical channel, into multiple virtual channels. Finally, the performance of adaptive routing has been compared to that of deterministic routing. While previous similar studies have only taken account of channel and router costs, the present analysis incorporates different intra-router delays, as well, and thus generates more realistic results. In fact, the results of this research differ notably from those reported in previous studies, illustrating the sensitivity of such studies to the level of detail, degree of accuracy and the realism of the assumptions adopted

    Quarc: an architecture for efficient on-chip communication

    Get PDF
    The exponential downscaling of the feature size has enforced a paradigm shift from computation-based design to communication-based design in system on chip development. Buses, the traditional communication architecture in systems on chip, are incapable of addressing the increasing bandwidth requirements of future large systems. Networks on chip have emerged as an interconnection architecture offering unique solutions to the technological and design issues related to communication in future systems on chip. The transition from buses as a shared medium to networks on chip as a segmented medium has given rise to new challenges in system on chip realm. By leveraging the shared nature of the communication medium, buses have been highly efficient in delivering multicast communication. The segmented nature of networks, however, inhibits the multicast messages to be delivered as efficiently by networks on chip. Relying on extensive research on multicast communication in parallel computers, several network on chip architectures have offered mechanisms to perform the operation, while conforming to resource constraints of the network on chip paradigm. Multicast communication in majority of these networks on chip is implemented by establishing a connection between source and all multicast destinations before the message transmission commences. Establishing the connections incurs an overhead and, therefore, is not desirable; in particular in latency sensitive services such as cache coherence. To address high performance multicast communication, this research presents Quarc, a novel network on chip architecture. The Quarc architecture targets an area-efficient, low power, high performance implementation. The thesis covers a detailed representation of the building blocks of the architecture, including topology, router and network interface. The cost and performance comparison of the Quarc architecture against other network on chip architectures reveals that the Quarc architecture is a highly efficient architecture. Moreover, the thesis introduces novel performance models of complex traffic patterns, including multicast and quality of service-aware communication

    Static allocation of computation to processors in multicomputers

    Get PDF

    Message-driven dynamics

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1997.Includes bibliographical references (p. 251-260).by Richard Anton Lethin.Ph.D

    Automatic synthesis and optimization of chip multiprocessors

    Get PDF
    The microprocessor technology has experienced an enormous growth during the last decades. Rapid downscale of the CMOS technology has led to higher operating frequencies and performance densities, facing the fundamental issue of power dissipation. Chip Multiprocessors (CMPs) have become the latest paradigm to improve the power-performance efficiency of computing systems by exploiting the parallelism inherent in applications. Industrial and prototype implementations have already demonstrated the benefits achieved by CMPs with hundreds of cores.CMP architects are challenged to take many complex design decisions. Only a few of them are:- What should be the ratio between the core and cache areas on a chip?- Which core architectures to select?- How many cache levels should the memory subsystem have?- Which interconnect topologies provide efficient on-chip communication?These and many other aspects create a complex multidimensional space for architectural exploration. Design Automation tools become essential to make the architectural exploration feasible under the hard time-to-market constraints. The exploration methods have to be efficient and scalable to handle future generation on-chip architectures with hundreds or thousands of cores.Furthermore, once a CMP has been fabricated, the need for efficient deployment of the many-core processor arises. Intelligent techniques for task mapping and scheduling onto CMPs are necessary to guarantee the full usage of the benefits brought by the many-core technology. These techniques have to consider the peculiarities of the modern architectures, such as availability of enhanced power saving techniques and presence of complex memory hierarchies.This thesis has several objectives. The first objective is to elaborate the methods for efficient analytical modeling and architectural design space exploration of CMPs. The efficiency is achieved by using analytical models instead of simulation, and replacing the exhaustive exploration with an intelligent search strategy. Additionally, these methods incorporate high-level models for physical planning. The related contributions are described in Chapters 3, 4 and 5 of the document.The second objective of this work is to propose a scalable task mapping algorithm onto general-purpose CMPs with power management techniques, for efficient deployment of many-core systems. This contribution is explained in Chapter 6 of this document.Finally, the third objective of this thesis is to address the issues of the on-chip interconnect design and exploration, by developing a model for simultaneous topology customization and deadlock-free routing in Networks-on-Chip. The developed methodology can be applied to various classes of the on-chip systems, ranging from general-purpose chip multiprocessors to application-specific solutions. Chapter 7 describes the proposed model.The presented methods have been thoroughly tested experimentally and the results are described in this dissertation. At the end of the document several possible directions for the future research are proposed

    Design and analysis of a 3-dimensional cluster multicomputer architecture using optical interconnection for petaFLOP computing

    Get PDF
    In this dissertation, the design and analyses of an extremely scalable distributed multicomputer architecture, using optical interconnects, that has the potential to deliver in the order of petaFLOP performance is presented in detail. The design takes advantage of optical technologies, harnessing the features inherent in optics, to produce a 3D stack that implements efficiently a large, fully connected system of nodes forming a true 3D architecture. To adopt optics in large-scale multiprocessor cluster systems, efficient routing and scheduling techniques are needed. To this end, novel self-routing strategies for all-optical packet switched networks and on-line scheduling methods that can result in collision free communication and achieve real time operation in high-speed multiprocessor systems are proposed. The system is designed to allow failed/faulty nodes to stay in place without appreciable performance degradation. The approach is to develop a dynamic communication environment that will be able to effectively adapt and evolve with a high density of missing units or nodes. A joint CPU/bandwidth controller that maximizes the resource allocation in this dynamic computing environment is introduced with an objective to optimize the distributed cluster architecture, preventing performance/system degradation in the presence of failed/faulty nodes. A thorough analysis, feasibility study and description of the characteristics of a 3-Dimensional multicomputer system capable of achieving 100 teraFLOP performance is discussed in detail. Included in this dissertation is throughput analysis of the routing schemes, using methods from discrete-time queuing systems and computer simulation results for the different proposed algorithms. A prototype of the 3D architecture proposed is built and a test bed developed to obtain experimental results to further prove the feasibility of the design, validate initial assumptions, algorithms, simulations and the optimized distributed resource allocation scheme. Finally, as a prelude to further research, an efficient data routing strategy for highly scalable distributed mobile multiprocessor networks is introduced

    Quantitative performance evaluation of SCI memory hierarchies

    Get PDF

    Deployment and Debugging of Real-Time Applications on Multicore Architectures

    Get PDF
    It is essential to enable information extraction from software. Program tracing techniques are an example of information extraction. Program tracing extracts information from the program during execution. Tracing helps with the testing and validation of software to ensure that the software under test is correct. Information extraction is done by instrumenting the program. Logged information can be stored in dedicated logging memories or can be buffered and streamed off-chip to an external monitor. The designer inspects the trace after execution to identify potentially erroneous state information. In addition, the trace can provide the state information that serves as input to generate the erroneous output for reproducibility. Information extraction can be difficult and expensive due to the increase in size and complexity of modern software systems. For the sub-class of software systems known as real-time systems, these issues are further aggravated. This is because real-time systems demand timing guarantees in addition to functional correctness. Consequently, any instrumentation to the original program code for the purpose of information extraction may affect the temporal behaviors of the program. This perturbation of temporal behaviors can lead to the violation of timing constraints, which may bias the program execution and/or cause the program to miss its deadline. As a result, there is considerable interest in devising techniques to allow for information extraction without missing a program’s deadline that is known as time-aware instrumentation. This thesis investigates time-aware instrumentation mechanisms to instrument programs while respecting their timing constraints and functional behavior. Knowledge of the underlying hardware on which the software runs, enables the extraction of more information via the instrumentation process. Chip-multiprocessors offer a solution to the performance bottleneck on uni-processors. Providing timing guarantees for hard real-time systems, however, on chip-multiprocessors is difficult. This is because conventional communication interconnects are designed to optimize the average-case performance. Therefore, researchers propose interconnects such as the priority-aware networks to satisfy the requirements of hard real-time systems. The priority-aware interconnects, however, lack the proper analysis techniques to facilitate the deployment of real-time systems. This thesis also investigates latency and buffer space analysis techniques for pipelined communication resource models, as well as algorithms for the proper deployment of real-time applications to these platforms. The analysis techniques proposed in this thesis provide guarantees on the schedulability of real-time systems on chip-multiprocessors. These guarantees are based on reducing contention in the interconnect while simultaneously accurately computing the worst-case communication latencies. While these worst-case latencies provide bounds for computing the overall worst-case execution time of applications on chip-multiprocessors, they also provide means to assigning instrumentation budgets required by time-aware instrumentation. Leveraging these platform-specific analysis techniques for the assignment of instrumentation budgets, allows for extracting more information from the instrumentation process

    Runtime Support for In-Core and Out-of-Core Data-Parallel Programs

    Get PDF
    Distributed memory parallel computers or distributed computer systems are widely recognized as the only cost-effective means of achieving teraflops performance in the near future. However, the fact remains that they are difficult to program and advances in software for these machines have not kept pace with advances in hardware. This thesis addresses several issues in providing runtime support for in-core as well as out-of-core programs on distributed memory parallel computers. This runtime support can be directly used in application programs for greater efficiency, portability and ease of programming. It can also be used together with a compiler to translate programs written in a high-level data-parallel language like High Performance Fortran (HPF) to node programs for distributed memory machines. In distributed memory programs, it is often necessary to change the distribution of arrays during program execution. This thesis presents efficient and portable algorithms for runtime array redistribution. The algorithms have been implemented on the Intel Touchstone Delta and are found to scale well with the number of processors and array size. This thesis also presents algorithms for all to all collective communication on fat tree and two dimensional mesh interconnection topologies. The performance of these algorithms on the CM 5 and Touchstone Delta is studied extensively. A model for estimating the time taken by these algorithms on the basis of system parameters is developed and validated by comparing with experimental results. A number of applications deal with very large data sets which cannot fit in main memory, and hence have to be stored in files on disks, resulting in out of core programs. This thesis also describes the design and implementation of efficient runtime support for out of core computations. Several optimizations for accessing out of core data are presented. An extended Two Phase Method is proposed for accessing sections of out of core arrays efficiently. This method uses collective I/O and the I/O workload is divided among processors dynamically, depending on the access requests. Performance results obtained using this runtime support for out of core programs on the Touchstone Delta are presented
    corecore