1,139 research outputs found

    Modelling, Synthesis, and Configuration of Networks-on-Chips

    Get PDF

    ATAC: A Manycore Processor with On-Chip Optical Network

    Get PDF
    Ever since industry has turned to parallelism instead of frequency scaling to improve processor performance, multicore processors have continued to scale to larger and larger numbers of cores. Some believe that multicores will have 1000 cores or more by the middle of the next decade. However, their promise of increased performance will only be reached if their inherent scaling and programming challenges are overcome. Meanwhile, recent advances in nanophotonic device manufacturing are making chip-stack optics a reality; interconnect technology which can provide significantly more bandwidth at lower power than conventional electrical analogs. Perhaps more importantly, optical interconnect also has the potential to enable new, easy-to-use programming models enabled by an inexpensive broadcast mechanism. This paper introduces ATAC, a new manycore architecture that capitalizes on the recent advances in optics to address a number of the challenges that future manycore designs will face. The new constraints and opportunities associated with on-chip optical interconnect are presented and explored in the design of ATAC. Furthermore, this paper introduces ACKwise, a novel directory-based cache coherence protocol that takes advantage of the special properties of ATAC to achieve high performance and scalability on large-scale manycores. Early performance results show that a 1000-core ATAC chip achieves a speedup of as much as 39% when compared with a similarly sized manycore with an electrical mesh network

    High-performance and hardware-aware computing: proceedings of the second International Workshop on New Frontiers in High-performance and Hardware-aware Computing (HipHaC\u2711), San Antonio, Texas, USA, February 2011 ; (in conjunction with HPCA-17)

    Get PDF
    High-performance system architectures are increasingly exploiting heterogeneity. The HipHaC workshop aims at combining new aspects of parallel, heterogeneous, and reconfigurable microprocessor technologies with concepts of high-performance computing and, particularly, numerical solution methods. Compute- and memory-intensive applications can only benefit from the full hardware potential if all features on all levels are taken into account in a holistic approach

    PASoC: A Predictable Accelerator Rich SoC for Safety-Critical Systems

    Get PDF
    This thesis presents a model of a Predictable Accelerator-rich System-on-Chip (PASoC) for safety-critical systems, which guarantees timing predictability of a memory access in the system. Earlier adoption of accelerator-rich SoCs was for general-purpose comput ing and thus timing predictability of such systems was not well explored, despite being used in safety-critical systems. This thesis takes initial steps in exploring the predictabil ity of ASoCs by combining CPU clusters with one or more hardware accelerators. The PASoC allows the integration of multiple coherent agents to interact with each other over a shared memory bus and a shared LLC. These agents can be a cluster of cache-coherent homogeneous cores, and fully or one-way coherent hardware accelerators. PASoC ensures the predictability of a memory request through some modifications in hardware architecture and cache coherence protocols. PASoC supports predictable cache coherence within the cluster of cores and across agents. The former uses linear cache coherence, and the latter uses a modified version of predictable Modified Shared Invalid (MSI) cache coherence pro tocol. PASoC analyzes the per-request worst-case latency of a memory request from any of the agents and evaluates the design on the gem5 simulator. Finally, this work presents some observations based on the analysis that can help in future designs of PASoCs

    Dynamic Memory Bandwidth Allocation for Real-Time GPU-Based SoC Platforms

    Get PDF
    Heterogeneous SoC platforms, comprising both general purpose CPUs and accelerators such as a GPU, are becoming increasingly attractive for real-time and mixed-criticality systems to cope with the computational demand of data parallel applications. However, contention for access to shared main memory can lead to significant performance degradation on both CPU and GPU. Existing work has shown that memory bandwidth throttling is effective in protecting real-time applications from memory-intensive, best-effort ones; however, due to the inherent pessimism involved in worst-case execution time estimation, such approaches can unduly restrict the bandwidth available to best-effort applications. In this work, we propose a novel memory bandwidth allocation scheme where we dynamically monitor the progress of a real-time application and increase the bandwidth share of best-effort ones whenever it is safe to do so. Specifically, we demonstrate our approach by protecting a real-time GPU kernel from best-effort CPU tasks. Based on profiling information, we first build a worst case execution time estimation model for the GPU kernel. Using such model, we then show how to dynamically recompute on-line the maximum memory budget that can be allocated to best-effort tasks without exceeding the kernelā€™s assigned execution budget. We implement our proposed technique on NVIDIA embedded SoC and demonstrate its effectiveness on a variety of GPU and CPU benchmarks

    Micro benchmarks for multiprocessor memory hierachy performance

    Get PDF
    Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1997.Includes bibliographical references (leaves 67-69).by Cristina-Ana-Maria Hristea.M.Eng

    Parallel Natural Language Parsing: From Analysis to Speedup

    Get PDF
    Electrical Engineering, Mathematics and Computer Scienc

    Effects of Communication Protocol Stack Offload on Parallel Performance in Clusters

    Get PDF
    The primary research objective of this dissertation is to demonstrate that the effects of communication protocol stack offload (CPSO) on application execution time can be attributed to the following two complementary sources. First, the application-specific computation may be executed concurrently with the asynchronous communication performed by the communication protocol stack offload engine. Second, the protocol stack processing can be accelerated or decelerated by the offload engine. These two types of performance effects can be quantified with the use of the degree of overlapping Do and degree of acceleration Daccs. The composite communication speedup metrics S_comm(Do, Daccs) can be used in order to quantify the combined effects of the protocol stack offload. This dissertation thesis is validated empirically. The degree of overlapping Do, the degree of acceleration Daccs, and the communication speedup Scomm characteristic of the system configurations under test are derived in the course of experiments performed for the system configurations of interest. It is shown that the proposed metrics adequately describe the effects of the protocol stack offload on the application execution time. Additionally, a set of analytical models of the networking subsystem of a PC-based cluster node is developed. As a result of the modeling, the metrics Do, Daccs, and Scomm are obtained. The models are evaluated as to their complexity and precision by comparing the modeling results with the measured values of Do, Daccs, and Scomm. The primary contributions of this dissertation research are as follows. First, the metric Daccs and Scomm are introduced in order to complement the Do metric in its use for evaluation of the effects of optimizations in the networking subsystem on parallel performance in clusters. The metrics are shown to adequately describe CPSO performance effects. Second, a method for assessing performance effects of CPSO scenarios on application performance is developed and presented. Third, a set of analytical models of cluster node networking subsystems with CPSO capability is developed and characterised as to their complexity and precision of the prediction of the Do and Daccs metrics

    Mechanisms and interfaces for software-extended coherent shared memory

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1994.Includes bibliographical references (p. 140-146).by David L. Chaiken.Ph.D
    • ā€¦
    corecore