51 research outputs found

    Simplified vector-thread architectures for flexible and efficient data-parallel accelerators

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Cataloged from student submitted PDF version of thesis.Includes bibliographical references (p. 165-170).This thesis explores a new approach to building data-parallel accelerators that is based on simplifying the instruction set, microarchitecture, and programming methodology for a vector-thread architecture. The thesis begins by categorizing regular and irregular data-level parallelism (DLP), before presenting several architectural design patterns for data-parallel accelerators including the multiple-instruction multiple-data (MIMD) pattern, the vector single-instruction multiple-data (vector-SIMD) pattern, the single-instruction multiple-thread (SIMT) pattern, and the vector-thread (VT) pattern. Our recently proposed VT pattern includes many control threads that each manage their own array of microthreads. The control thread uses vector memory instructions to efficiently move data and vector fetch instructions to broadcast scalar instructions to all microthreads. These vector mechanisms are complemented by the ability for each microthread to direct its own control flow. In this thesis, I introduce various techniques for building simplified instances of the VT pattern. I propose unifying the VT control-thread and microthread scalar instruction sets to simplify the microarchitecture and programming methodology. I propose a new single-lane VT microarchitecture based on minimal changes to the vector-SIMD pattern.(cont.) Single-lane cores are simpler to implement than multi-lane cores and can achieve similar energy efficiency. This new microarchitecture uses control processor embedding to mitigate the area overhead of single-lane cores, and uses vector fragments to more efficiently handle both regular and irregular DLP as compared to previous VT architectures. I also propose an explicitly data-parallel VT programming methodology that is based on a slightly modified scalar compiler. This methodology is easier to use than assembly programming, yet simpler to implement than an automatically vectorizing compiler. To evaluate these ideas, we have begun implementing the Maven data-parallel accelerator. This thesis compares a simplified Maven VT core to MIMD, vector-SIMD, and SIMT cores. We have implemented these cores with an ASIC methodology, and I use the resulting gate-level models to evaluate the area, performance, and energy of several compiled microbenchmarks. This work is the first detailed quantitative comparison of the VT pattern to other patterns. My results suggest that future data-parallel accelerators based on simplified VT architectures should be able to combine the energy efficiency of vector-SIMD accelerators with the flexibility of MIMD accelerators.by Christopher Francis Batten.Ph.D

    SL: a "quick and dirty" but working intermediate language for SVP systems

    Get PDF
    The CSA group at the University of Amsterdam has developed SVP, a framework to manage and program many-core and hardware multithreaded processors. In this article, we introduce the intermediate language SL, a common vehicle to program SVP platforms. SL is designed as an extension to the standard C language (ISO C99/C11). It includes primitive constructs to bulk create threads, bulk synchronize on termination of threads, and communicate using word-sized dataflow channels between threads. It is intended for use as target language for higher-level parallelizing compilers. SL is a research vehicle; as of this writing, it is the only interface language to program a main SVP platform, the new Microgrid chip architecture. This article provides an overview of the language, to complement a detailed specification available separately.Comment: 22 pages, 3 figures, 18 listings, 1 tabl

    Microgrid - The microthreaded many-core architecture

    Full text link
    Traditional processors use the von Neumann execution model, some other processors in the past have used the dataflow execution model. A combination of von Neuman model and dataflow model is also tried in the past and the resultant model is referred as hybrid dataflow execution model. We describe a hybrid dataflow model known as the microthreading. It provides constructs for creation, synchronization and communication between threads in an intermediate language. The microthreading model is an abstract programming and machine model for many-core architecture. A particular instance of this model is named as the microthreaded architecture or the Microgrid. This architecture implements all the concurrency constructs of the microthreading model in the hardware with the management of these constructs in the hardware.Comment: 30 pages, 16 figure

    Multiphasic Growth Factor Release from Fibrin Microthreads

    Get PDF
    Biomaterial scaffolds are used to aid in regeneration of tissue for volumetric muscle loss (VML) where natural regeneration fails. Fibrin microthreads are beneficial due to their mechanical properties, uniaxial cell alignment, and low cytotoxicity. We designed a composite system of a microthread bundle coated with a hydrogel to meet the unmet need of delivering two factors in different time domains to mimic natural healing and assist the regeneration process. Release from the microthreads and hydrogels was quantified with a protein assay and microthread release was validated semi-qualitatively with fluorescence microscopy. Results demonstrated that we created a dual factor, multi-phase release system to aid in skeletal muscle regeneration to potentially heal patients with VML injuries

    PIKA: A Network Service for Multikernel Operating Systems

    Get PDF
    PIKA is a network stack designed for multikernel operating systems that target potential future architectures lacking cache-coherent shared memory but supporting message passing. PIKA splits the network stack into several servers that communicate using a low-overhead message passing layer. A key challenge faced by PIKA is the maintenance of shared state, such as a single accept queue and load balance information. PIKA addresses this challenge using a speculative 3-way handshake for connection acceptance, and a new distributed load balancing scheme for spreading connections. A PIKA prototype achieves competitive performance, excellent scalability, and low service times under load imbalance on commodity hardware. Finally, we demonstrate that splitting network stack processing by function across separate cores is a net loss on commodity hardware, and we describe conditions under which it may be advantageous

    Exploiting parallelism in n-D convex hull algorithms

    Get PDF
    PhD ThesisThe convex hull is a problem of primary importance because of its applications in computational geometry. A number of sequential and parallel algorithms for computing the convex hull of a finite set of points in the lower dimensions are known. In compar- ison, the general n-D problem is not as well understood and parallel algorithms are not so prevalent because the 2-D and 3-D methods are not easily extended to the general case. This thesis presents parallel algorithms for evaluating the general n- D convex hull problem (where 2-D and 3-D are special cases) using Swart's sequential algorithm. One of our methods combines a gift-wrapping technique with partitioning and merge algorithms > where the original list is split into p 1 partitions followed by the computation of the subhulls using the sequential n-D gift-wrapping method. The partial hulls are then combined using a fanin tree. The second method computes the convex hull in parallel by wrapping around the edges until a complete facial lattice structure of the polytope is generated. Several parameterised versions of the proposed algorithms have been implemented on the shared memory and message passing architectures. In the former, performance on an Encore Multimax using Encore Parallel Threads and the more lightweight Microthread programming utilities are examined. In the latter, performance on a transputer based machine using CS- Tools is discussed. We have shown that our techniques will be useful in the construction of faster algorithms which employ the n-D convex hull algorithms as a sub-algorithmCommonwealth Scholarship Commission in the United Kingdo

    An Investigation of thread scheduling heuristics for a simultaneous multithreaded processor

    Get PDF
    Over the years, the von Neumann model of computing has undergone many enhancements. These changes include an improved memory hierarchy, multiple instruction issue and branch predic tion. Since the model\u27s introduction, the performance of processors has increased at a much greater rate than that of memory. Several modifications to hide this ever widening gap in performance are being examined in current research. A very promising one is the Simultaneous Multithreaded processor. This architecture strives to further reduce the effects of long latency instructions, such as memory accesses, by allowing multiple threads of execution to be active in the processor at the same time. With the introduction of multiple active threads in a single processor, several new aspects of processor operation can have a sizeable effect on performance. One such aspect is how to choose from which thread to fetch instructions during the next cycle. For this project, three different classes of fetch scheduling mechanisms were defined and exam ples of each were either studied or proposed. The proposed mechanisms were then tested using a set of four sample programs by adding the mechanisms to a Simultaneous Multithreading sim ulator based on the Simple Scalar tool set from the University of Wisconsin-Madison. With the proper configuration, each of the proposed mechanisms improved the performance of the simulated architecture. However, the best increase in performance was produced by the Event History Table. It achieved an IPC of 2.0995 for two threads while overriding the primary scheduling mechanism only 0.070% of the time
    • …
    corecore