141 research outputs found

    Tiled microprocessors

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2007.Includes bibliographical references (p. 251-258).Current-day microprocessors have reached the point of diminishing returns due to inherent scalability limitations. This thesis examines the tiled microprocessor, a class of microprocessor which is physically scalable but inherits many of the desirable properties of conventional microprocessors. Tiled microprocessors are composed of an array of replicated tiles connected by a special class of network, the Scalar Operand Network (SON), which is optimized for low-latency, low-occupancy communication between remote ALUs on different tiles. Tiled microprocessors can be constructed to scale to 100's or 1000's of functional units. This thesis identifies seven key criteria for achieving physical scalability in tiled microprocessors. It employs an archetypal tiled microprocessor to examine the challenges in achieving these criteria and to explore the properties of Scalar Operand Networks. The thesis develops the field of SONs in three major ways: it introduces the 5-tuple performance metric, it describes a complete, high-frequency SON implementation, and it proposes a taxonomy, called AsTrO, for categorizing them.(cont.) To develop these ideas, the thesis details the design, implementation and analysis of a tiled microprocessor prototype, the Raw Microprocessor, which was implemented at MIT in 180 nm technology. Overall, compared to Raw, recent commercial processors with half the transistors required 30x as many lines of code, occupied 100x as many designers, contained 50x as many pre-tapeout bugs, and resulted in 33x as many post-tapeout bugs. At the same time, the Raw microprocessor proves to be more versatile in exploiting ILP, stream, and server-farm workloads with modest to large amounts of parallelism.by Michael Bedford Taylor.Ph.D

    Tile size selection for low-power tile-based architectures

    Get PDF
    In this paper, we investigate the power implications of tile size selection for tile-based processors. We refer to this investigation as a tile granularity study. This is accomplished by distilling the architectural cost of tiles with different computational widths into a system metric we call the Granularity Indicator (GI). The GI is then compared against the communications exposed when algorithms are partitioned across multiple tiles. Through this comparison, the tile granularity that best fits a given set of algorithms can be determined, reducing the system power for that set of algorithms. When the GI analysis is applied to the Synchroscalar tile architecture[1], we find that Synchroscalar\u27s already low power consumption can be further reduced by 14% when customized for execution of the 802.11a receiver. In addition, the GI can also be a used to evaluate tile size when considering multiple applications simultaneously, providing a convenient platform for hardware-software co-design

    The "MIND" Scalable PIM Architecture

    Get PDF
    MIND (Memory, Intelligence, and Network Device) is an advanced parallel computer architecture for high performance computing and scalable embedded processing. It is a Processor-in-Memory (PIM) architecture integrating both DRAM bit cells and CMOS logic devices on the same silicon die. MIND is multicore with multiple memory/processor nodes on each chip and supports global shared memory across systems of MIND components. MIND is distinguished from other PIM architectures in that it incorporates mechanisms for efficient support of a global parallel execution model based on the semantics of message-driven multithreaded split-transaction processing. MIND is designed to operate either in conjunction with other conventional microprocessors or in standalone arrays of like devices. It also incorporates mechanisms for fault tolerance, real time execution, and active power management. This paper describes the major elements and operational methods of the MIND architecture

    Resource allocation and scalability in dynamic wavelength-routed optical networks.

    Get PDF
    This thesis investigates the potential benefits of dynamic operation of wavelength-routed optical networks (WRONs) compared to the static approach. It is widely believed that dynamic operation of WRONs would overcome the inefficiencies of the static allocation in improving resource use. By rapidly allocating resources only when and where required, dynamic networks could potentially provide the same service that static networks but at decreased cost, very attractive to network operators. This hypothesis, however, has not been verified. It is therefore the focus of this thesis to investigate whether dynamic operation of WRONs can save significant number of wavelengths compared to the static approach whilst maintaining acceptable levels of delay and scalability. Firstly, the wavelength-routed optical-burst-switching (WR-OBS) network architecture is selected as the dynamic architecture to be studied, due to its feasibility of implementation and its improved network performance. Then, the wavelength requirements of dynamic WR-OBS are evaluated by means of novel analysis and simulation and compared to that of static networks for uniform and non-uniform traffic demand. It is shown that dynamic WR-OBS saves wavelengths with respect to the static approach only at low loads and especially for sparsely connected networks and that wavelength conversion is a key capability to significantly increase the benefits of dynamic operation. The mean delay introduced by dynamic operation of WR-OBS is then assessed. The results show that the extra delay is not significant as to violate end-to-end limits of time-sensitive applications. Finally, the limiting scalability of WR-OBS as a function of the lightpath allocation algorithm computational complexity is studied. The trade-off between the request processing time and blocking probability is investigated and a new low-blocking and scalable lightpath allocation algorithm which improves the mentioned trade-off is proposed. The presented algorithms and results can be used in the analysis and design of dynamic WRONs

    Parallel and Distributed Computing

    Get PDF
    The 14 chapters presented in this book cover a wide variety of representative works ranging from hardware design to application development. Particularly, the topics that are addressed are programmable and reconfigurable devices and systems, dependability of GPUs (General Purpose Units), network topologies, cache coherence protocols, resource allocation, scheduling algorithms, peertopeer networks, largescale network simulation, and parallel routines and algorithms. In this way, the articles included in this book constitute an excellent reference for engineers and researchers who have particular interests in each of these topics in parallel and distributed computing

    Flip: Data-Centric Edge CGRA Accelerator

    Full text link
    Coarse-Grained Reconfigurable Arrays (CGRA) are promising edge accelerators due to the outstanding balance in flexibility, performance, and energy efficiency. Classic CGRAs statically map compute operations onto the processing elements (PE) and route the data dependencies among the operations through the Network-on-Chip. However, CGRAs are designed for fine-grained static instruction-level parallelism and struggle to accelerate applications with dynamic and irregular data-level parallelism, such as graph processing. To address this limitation, we present Flip, a novel accelerator that enhances traditional CGRA architectures to boost the performance of graph applications. Flip retains the classic CGRA execution model while introducing a special data-centric mode for efficient graph processing. Specifically, it exploits the natural data parallelism of graph algorithms by mapping graph vertices onto processing elements (PEs) rather than the operations, and supporting dynamic routing of temporary data according to the runtime evolution of the graph frontier. Experimental results demonstrate that Flip achieves up to 36Ă—\times speedup with merely 19% more area compared to classic CGRAs. Compared to state-of-the-art large-scale graph processors, Flip has similar energy efficiency and 2.2Ă—\times better area efficiency at a much-reduced power/area budget

    Performance, scalability, and flexibility in the RAW network router

    Get PDF
    Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2004.Includes bibliographical references (p. 46).Conventional high speed Internet routers are built using custom designed microprocessors, dubbed network processors, to efficiently handle the task of packet routing. While capable of meeting the performance demanded of them, these custom network processors generally lack the flexibility to incorporate new features and do not scale well beyond that for which they were designed. Furthermore, they tend to suffer from long and costly development cycles, since each new generation must be redesigned to support new features and fabricated anew in hardware. This thesis presents a new design for a network processor, one implemented entirely in software, on a tiled, general purpose microprocessor. The network processor is implemented on the Raw microprocessor, a general purpose microchip developed by the Computer Architecture Group at MIT. The Raw chip consists of sixteen identical processing tiles arranged in a four by four matrix and connected by four inter-tile communication networks; the Raw chip is designed to be able to scale up merely by adding more tiles to the matrix. By taking advantage of the parallelism inherent in the task of packet forwarding on this inherently parallel microprocessor, the Raw network processor is able to achieve performance that matches or exceeds that of commercially available custom designed network processors. At the same time, it maintains the flexibility to incorporate new features since it is implemented entirely in software, as well as the scalability to handle more ports by simply adding more tiles to the microprocessor.by Anthony M. DeGangi.M.Eng
    • …
    corecore