155 research outputs found

    A Compiler and Runtime Infrastructure for Automatic Program Distribution

    Get PDF
    This paper presents the design and the implementation of a compiler and runtime infrastructure for automatic program distribution. We are building a research infrastructure that enables experimentation with various program partitioning and mapping strategies and the study of automatic distribution's effect on resource consumption (e.g., CPU, memory, communication). Since many optimization techniques are faced with conflicting optimization targets (e.g., memory and communication), we believe that it is important to be able to study their interaction. We present a set of techniques that enable flexible resource modeling and program distribution. These are: dependence analysis, weighted graph partitioning, code and communication generation, and profiling. We have developed these ideas in the context of the Java language. We present in detail the design and implementation of each of the techniques as part of our compiler and runtime infrastructure. Then, we evaluate our design and present preliminary experimental data for each component, as well as for the entire system

    Parallel rendering algorithms for distributed-memory multicomputers

    Get PDF
    Ankara : Department of Computer Engineering and Information Science and the Institute of Engineering and Science of Bilkent University, 1997.Thesis (Ph. D.) -- Bilkent University, 1997.Includes bibliographical references leaves 166-176.Kurç, Tahsin MertefePh.D

    Submicron Systems Architecture Project: Semiannual Technial Report

    Get PDF
    No abstract available

    Performance evaluation of distributed crossbar switch hypermesh

    Get PDF
    The interconnection network is one of the most crucial components in any multicomputer as it greatly influences the overall system performance. Several recent studies have suggested that hypergraph networks, such as the Distributed Crossbar Switch Hypermesh (DCSH), exhibit superior topological and performance characteristics over many traditional graph networks, e.g. k-ary n-cubes. Previous work on the DCSH has focused on issues related to implementation and performance comparisons with existing networks. These comparisons have so far been confined to deterministic routing and unicast (one-to-one) communication. Using analytical models validated through simulation experiments, this thesis extends that analysis to include adaptive routing and broadcast communication. The study concentrates on wormhole switching, which has been widely adopted in practical multicomputers, thanks to its low buffering requirement and the reduced dependence of latency on distance under low traffic. Adaptive routing has recently been proposed as a means of improving network performance, but while the comparative evaluation of adaptive and deterministic routing has been widely reported in the literature, the focus has been on graph networks. The first part of this thesis deals with adaptive routing, developing an analytical model to measure latency in the DCSH, and which is used throughout the rest of the work for performance comparisons. Also, an investigation of different routing algorithms in this network is presented. Conventional k-ary n-cubes have been the underlying topology of contemporary multicomputers, but it is only recently that adaptive routing has been incorporated into such systems. The thesis studies the relative performance merits of the DCSH and k-ary n-cubes under adaptive routing strategy. The analysis takes into consideration real-world factors, such as router complexity and bandwidth constraints imposed by implementation technology. However, in any network, the routing of unicast messages is not the only factor in traffic control. In many situations (for example, parallel iterative algorithms, memory update and invalidation procedures in shared memory systems, global notification of network errors), there is a significant requirement for broadcast traffic. The DCSH, by virtue of its use of hypergraph links, can implement broadcast operations particularly efficiently. The second part of the thesis examines how the DCSH and k-ary n-cube performance is affected by the presence of a broadcast traffic component. In general, these studies demonstrate that because of their relatively high diameter, k-ary n-cubes perform poorly when message lengths are short. This is consistent with earlier more simplistic analyses which led to the proposal for the express-cube, an enhancement of the basic k-ary n-cube structure, which provides additional express channels, allowing messages to bypass groups of nodes along their paths. The final part of the thesis investigates whether this "partial bypassing" can compete with the "total bypassing" capability provided inherently by the DCSH topology

    Static allocation of computation to processors in multicomputers

    Get PDF

    Reactive-Process Programming and Distributed Discrete-Event Simulation

    Get PDF
    The same forces that spurred the development of multicomputers - the demand for better performance and economy - are driving the evolution of multicomputers in the direction of more abundant and less expensive computing nodes - the direction of fine-grain multicomputers. This evolution in multicomputer architecture derives from advances in integrated circuit, packaging, and message-routing technologies, and carries far-reaching implications in programming and applications. This thesis pursues that trend with a balanced treatment of multicomputer programming and applications. First, a reactive- process programming system - Reactive-C - is investigated; then, a model application - discreteevent simulation - is developed; finally, a number of logic-circuit simulators written in the Reactive-C notation are evaluated. One difficulty in multicomputer applications is the inefficiency of many distributed algorithms compared to their sequential counterparts. When better formulations are developed, they often scale poorly with increasing numbers of nodes, and their beneficial effects eventually vanish when many nodes are used. However, rules for programming are quite different when nodes are plentiful and cheap: The primary concern is to utilize all of the concurrency available in an application, rather than to utilize all of the computing cycles available in a machine. We have shown in our research that it is possible to extract the maximum concurrency of a simulation subject, even one as difficult as a logic circuit, when one simulation element is assigned to each node. Despite the initial inefficiency of a straightforward algorithm, as the the number of nodes increases, the computation time decreases linearly until there are only a few elements in each node. We conclude by suggesting a, technique to further increase the available concurrency when there are many more nodes than simulation elements
    corecore