13 research outputs found

    Efficient processor management strategies for multicomputer systems

    Get PDF
    Multicomputers are cost-effective alternatives to the conventional supercomputers. Contemporary processor management schemes tend to underutilize the processors and leave many of the processors in the system idle while jobs are waiting for execution;Instead of designing faster processors or interconnection networks, a substantial performance improvement can be obtained by implementing better processor management strategies. This dissertation studies the performance issues related to the processor management schemes and proposes several ways to enhance the multicomputer systems by means of processor management. The proposed schemes incorporate the concepts of size-reduction, non-contiguous allocation, as well as job migration. Job scheduling using a bypass-queue is also studied. All the proposed schemes are proven effective in improving the system performance via extensive simulations. Each proposed scheme has different implementation cost and constraints. In order to take advantage of these schemes, judicious selection of system parameters is important and is discussed

    Design and analysis of a 3-dimensional cluster multicomputer architecture using optical interconnection for petaFLOP computing

    Get PDF
    In this dissertation, the design and analyses of an extremely scalable distributed multicomputer architecture, using optical interconnects, that has the potential to deliver in the order of petaFLOP performance is presented in detail. The design takes advantage of optical technologies, harnessing the features inherent in optics, to produce a 3D stack that implements efficiently a large, fully connected system of nodes forming a true 3D architecture. To adopt optics in large-scale multiprocessor cluster systems, efficient routing and scheduling techniques are needed. To this end, novel self-routing strategies for all-optical packet switched networks and on-line scheduling methods that can result in collision free communication and achieve real time operation in high-speed multiprocessor systems are proposed. The system is designed to allow failed/faulty nodes to stay in place without appreciable performance degradation. The approach is to develop a dynamic communication environment that will be able to effectively adapt and evolve with a high density of missing units or nodes. A joint CPU/bandwidth controller that maximizes the resource allocation in this dynamic computing environment is introduced with an objective to optimize the distributed cluster architecture, preventing performance/system degradation in the presence of failed/faulty nodes. A thorough analysis, feasibility study and description of the characteristics of a 3-Dimensional multicomputer system capable of achieving 100 teraFLOP performance is discussed in detail. Included in this dissertation is throughput analysis of the routing schemes, using methods from discrete-time queuing systems and computer simulation results for the different proposed algorithms. A prototype of the 3D architecture proposed is built and a test bed developed to obtain experimental results to further prove the feasibility of the design, validate initial assumptions, algorithms, simulations and the optimized distributed resource allocation scheme. Finally, as a prelude to further research, an efficient data routing strategy for highly scalable distributed mobile multiprocessor networks is introduced

    Mechanisms for efficient, protected messaging

    Get PDF
    Thesis (Ph.D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1999.Includes bibliographical references (p. 143-149).by Whay Sing Lee.Ph.D

    An optimized hardware architecture and communication protocol for scheduled communication

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1997.Includes bibliographical references (p. 173-177).by David Shoemaker.Ph.D

    Design and evaluation of the Hamal parallel computer

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, February 2003."December 2002."Includes bibliographical references (p. 145-152).This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Parallel shared-memory machines with hundreds or thousands of processor-memory nodes have been built; in the future we will see machines with millions or even billions of nodes. Associated with such large systems is a new set of design challenges. Many problems must be addressed by an architecture in order for it to be successful; of these, we focus on three in particular. First, a scalable memory system is required. Second, the network messaging protocol must be fault-tolerant. Third, the overheads of thread creation, thread management and synchronization must be extremely low. This thesis presents the complete system design for Hamal, a shared-memory architecture which addresses these concerns and is directly scalable to one million nodes. Virtual memory and distributed objects are implemented in a manner that requires neither inter-node synchronization nor the storage of globally coherent translations at each node. We develop a lightweight fault-tolerant messaging protocol that guarantees message delivery and idempotence across a discarding network. A number of hardware mechanisms provide efficient support for massive multithreading and fine-grained synchronization.(cont.) Experiments are conducted in simulation, using a trace-driven network simulator to investigate the messaging protocol and a cycle-accurate simulator to evaluate the Hamal architecture. We determine implementation parameters for the messaging protocol which optimize performance. A discarding network is easier to design and can be clocked at a higher rate, and we find that with this protocol its performance can approach that of a non-discarding network. Our simulations of Hamal demonstrate the effectiveness of its thread management and synchronization primitives. In particular, we find register-based synchronization to be an extremely efficient mechanism which can be used to implement a software barrier with a latency of only 523 cycles on a 512 node machine.by J.B. Grossman.Ph.D

    Design and Evaluation of the Hamal Parallel Computer

    Get PDF
    Parallel shared-memory machines with hundreds or thousands of processor-memory nodes have been built; in the future we will see machines with millions or even billions of nodes. Associated with such large systems is a new set of design challenges. Many problems must be addressed by an architecture in order for it to be successful; of these, we focus on three in particular. First, a scalable memory system is required. Second, the network messaging protocol must be fault-tolerant. Third, the overheads of thread creation, thread management and synchronization must be extremely low. This thesis presents the complete system design for Hamal, a shared-memory architecture which addresses these concerns and is directly scalable to one million nodes. Virtual memory and distributed objects are implemented in a manner that requires neither inter-node synchronization nor the storage of globally coherent translations at each node. We develop a lightweight fault-tolerant messaging protocol that guarantees message delivery and idempotence across a discarding network. A number of hardware mechanisms provide efficient support for massive multithreading and fine-grained synchronization. Experiments are conducted in simulation, using a trace-driven network simulator to investigate the messaging protocol and a cycle-accurate simulator to evaluate the Hamal architecture. We determine implementation parameters for the messaging protocol which optimize performance. A discarding network is easier to design and can be clocked at a higher rate, and we find that with this protocol its performance can approach that of a non-discarding network. Our simulations of Hamal demonstrate the effectiveness of its thread management and synchronization primitives. In particular, we find register-based synchronization to be an extremely efficient mechanism which can be used to implement a software barrier with a latency of only 523 cycles on a 512 node machine

    Tiled microprocessors

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2007.Includes bibliographical references (p. 251-258).Current-day microprocessors have reached the point of diminishing returns due to inherent scalability limitations. This thesis examines the tiled microprocessor, a class of microprocessor which is physically scalable but inherits many of the desirable properties of conventional microprocessors. Tiled microprocessors are composed of an array of replicated tiles connected by a special class of network, the Scalar Operand Network (SON), which is optimized for low-latency, low-occupancy communication between remote ALUs on different tiles. Tiled microprocessors can be constructed to scale to 100's or 1000's of functional units. This thesis identifies seven key criteria for achieving physical scalability in tiled microprocessors. It employs an archetypal tiled microprocessor to examine the challenges in achieving these criteria and to explore the properties of Scalar Operand Networks. The thesis develops the field of SONs in three major ways: it introduces the 5-tuple performance metric, it describes a complete, high-frequency SON implementation, and it proposes a taxonomy, called AsTrO, for categorizing them.(cont.) To develop these ideas, the thesis details the design, implementation and analysis of a tiled microprocessor prototype, the Raw Microprocessor, which was implemented at MIT in 180 nm technology. Overall, compared to Raw, recent commercial processors with half the transistors required 30x as many lines of code, occupied 100x as many designers, contained 50x as many pre-tapeout bugs, and resulted in 33x as many post-tapeout bugs. At the same time, the Raw microprocessor proves to be more versatile in exploiting ILP, stream, and server-farm workloads with modest to large amounts of parallelism.by Michael Bedford Taylor.Ph.D
    corecore