41,045 research outputs found

    Parametric data-parallel architectures for TLM acceleration

    Get PDF
    We discuss the architecture and microarchitecture of a scalable, parametric vector accelerator for the TLM algorithm. Architecture-level experimentation demonstrates an order of magnitude complexity reduction for vector lengths of 16 32-bit single-precision elements. We envisage the proposed architecture replicated in a SOC environment thus, forming a multiprocessor system capable of tapping parallelism at the thread level as well as the data level

    Scale Control Processor Test-Chip

    Get PDF
    We are investigating vector-thread architectures which provide competitive performance and efficiency across a broad class of application domains. Vector-thread architectures unify data-level, thread-level, and instruction-level parallelism, providing new ways of parallelizing codes that are difficult to vectorize or that incur excessive synchronization costs when multithreaded. To illustrate these ideas we have developed the Scale processor, which is an example of a vector-thread architecture designed for low-power and high-performance embedded systems. The prototype includes a single-issue 32-bit RISC control processor, a vector-thread unit which supports up to 128 virtual processor threads and can execute up to 16 instructions per cycle, and a 32 KB shared primary cache.Since the Scale Vector-Thread Processor is a large and complex design (especially for an academic project), we first designed and fabricated the Scale Test Chip (STC1). STC1 includes a simplified version of the Scale control processor, 8 KB of RAM, a host interface, and a custom clock generator. STC1 helped mitigate the risk involved in fabricating the full Scale chip in several ways. First, we were able to establish and test our CAD toolflow. Our toolflow included several custom tools which had not previously been used in any tapeouts. Second, we were able to better characterize our target package and process. For example, STC1 enabled us to better correlate the static timing numbers from our CAD tools with actual silicon and also to characterize the expected rise/fall times of our external signal pins. Finally, STC1 allowed us to test our custom clock generator. We used our experiences with STC1 to help us implement the Scale vector-thread processor. Scale was taped out on October 15, 2006 and it is currently being fabricated through MOSIS. This report discusses the fabrication of STC1 and presents power and performance results

    A system-on-chip vector multiprocessor for transmission line modelling acceleration

    Get PDF
    We discuss a configurable, System-on-Chip vector multiprocessor for accelerating the Transmission Line Modeling (TLM) algorithm with an architecture capable of exploiting the two primary forms of parallelism in the code, thread and data level parallelism. Theoretical results demonstrate an order of magnitude reduction in the dynamic instruction count for a scalar-processor/vector-coprocessor configuration at a vector length of sixteen 32-bit singleprecision elements. Furthermore, a multi-vector SoC architecture consisting of ten such vector accelerators provides a near-linear theoretical performance benefit of the order of 88% in three out of four benchmark configurations which is orthogonal to the benefit realized by vectorization alone. We discuss in detail this potent architecture and present implementation data for the 2-way multi-processor VLSI macrocell

    Simplified vector-thread architectures for flexible and efficient data-parallel accelerators

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Cataloged from student submitted PDF version of thesis.Includes bibliographical references (p. 165-170).This thesis explores a new approach to building data-parallel accelerators that is based on simplifying the instruction set, microarchitecture, and programming methodology for a vector-thread architecture. The thesis begins by categorizing regular and irregular data-level parallelism (DLP), before presenting several architectural design patterns for data-parallel accelerators including the multiple-instruction multiple-data (MIMD) pattern, the vector single-instruction multiple-data (vector-SIMD) pattern, the single-instruction multiple-thread (SIMT) pattern, and the vector-thread (VT) pattern. Our recently proposed VT pattern includes many control threads that each manage their own array of microthreads. The control thread uses vector memory instructions to efficiently move data and vector fetch instructions to broadcast scalar instructions to all microthreads. These vector mechanisms are complemented by the ability for each microthread to direct its own control flow. In this thesis, I introduce various techniques for building simplified instances of the VT pattern. I propose unifying the VT control-thread and microthread scalar instruction sets to simplify the microarchitecture and programming methodology. I propose a new single-lane VT microarchitecture based on minimal changes to the vector-SIMD pattern.(cont.) Single-lane cores are simpler to implement than multi-lane cores and can achieve similar energy efficiency. This new microarchitecture uses control processor embedding to mitigate the area overhead of single-lane cores, and uses vector fragments to more efficiently handle both regular and irregular DLP as compared to previous VT architectures. I also propose an explicitly data-parallel VT programming methodology that is based on a slightly modified scalar compiler. This methodology is easier to use than assembly programming, yet simpler to implement than an automatically vectorizing compiler. To evaluate these ideas, we have begun implementing the Maven data-parallel accelerator. This thesis compares a simplified Maven VT core to MIMD, vector-SIMD, and SIMT cores. We have implemented these cores with an ASIC methodology, and I use the resulting gate-level models to evaluate the area, performance, and energy of several compiled microbenchmarks. This work is the first detailed quantitative comparison of the VT pattern to other patterns. My results suggest that future data-parallel accelerators based on simplified VT architectures should be able to combine the energy efficiency of vector-SIMD accelerators with the flexibility of MIMD accelerators.by Christopher Francis Batten.Ph.D

    Breadth First Search Vectorization on the Intel Xeon Phi

    Full text link
    Breadth First Search (BFS) is a building block for graph algorithms and has recently been used for large scale analysis of information in a variety of applications including social networks, graph databases and web searching. Due to its importance, a number of different parallel programming models and architectures have been exploited to optimize the BFS. However, due to the irregular memory access patterns and the unstructured nature of the large graphs, its efficient parallelization is a challenge. The Xeon Phi is a massively parallel architecture available as an off-the-shelf accelerator, which includes a powerful 512 bit vector unit with optimized scatter and gather functions. Given its potential benefits, work related to graph traversing on this architecture is an active area of research. We present a set of experiments in which we explore architectural features of the Xeon Phi and how best to exploit them in a top-down BFS algorithm but the techniques can be applied to the current state-of-the-art hybrid, top-down plus bottom-up, algorithms. We focus on the exploitation of the vector unit by developing an improved highly vectorized OpenMP parallel algorithm, using vector intrinsics, and understanding the use of data alignment and prefetching. In addition, we investigate the impact of hyperthreading and thread affinity on performance, a topic that appears under researched in the literature. As a result, we achieve what we believe is the fastest published top-down BFS algorithm on the version of Xeon Phi used in our experiments. The vectorized BFS top-down source code presented in this paper can be available on request as free-to-use software

    The "MIND" Scalable PIM Architecture

    Get PDF
    MIND (Memory, Intelligence, and Network Device) is an advanced parallel computer architecture for high performance computing and scalable embedded processing. It is a Processor-in-Memory (PIM) architecture integrating both DRAM bit cells and CMOS logic devices on the same silicon die. MIND is multicore with multiple memory/processor nodes on each chip and supports global shared memory across systems of MIND components. MIND is distinguished from other PIM architectures in that it incorporates mechanisms for efficient support of a global parallel execution model based on the semantics of message-driven multithreaded split-transaction processing. MIND is designed to operate either in conjunction with other conventional microprocessors or in standalone arrays of like devices. It also incorporates mechanisms for fault tolerance, real time execution, and active power management. This paper describes the major elements and operational methods of the MIND architecture
    • …
    corecore