838 research outputs found

    Approximate simulation of distributed-memory multithreaded multiprocessors

    Get PDF
    The performance of modern computer systems is increasingly limited by long latencies of accesses to their memory systems. Instruction-level multithreading is a technique to tolerate long latencies of memory accesses by switching from one instruction thread to another. The paper shows that the simulation-based performance evaluation of distributed-memory multithreaded multiprocessor systems can be significantly simplified by using approximate models, composed of only a few processors, but with some parameters adjusted to represent the behavior of the original system

    Exploiting Properties of CMP Cache Traffic in Designing Hybrid Packet/Circuit Switched NoCs

    Get PDF
    Chip multiprocessors with few to tens of processing cores are already commercially available. Increased scaling of technology is making it feasible to integrate even more cores on a single chip. Providing the cores with fast access to data is vital to overall system performance. When a core requires access to a piece of data, the core's private cache memory is searched first. If a miss occurs, the data is looked up in the next level(s) of the memory hierarchy, where often one or more levels of cache are shared between two or more cores. Communication between the cores and the slices of the on-chip shared cache is carried through the network-on-chip(NoC). Interestingly, the cache and NoC mutually affect the operation of each other; communication over the NoC affects the access latency of cache data, while the cache organization generates the coherence and data messages, thus affecting the communication patterns and latency over the NoC. This thesis considers hybrid packet/circuit switched NoCs, i.e., packet switched NoCs enhanced with the ability to configure circuits. The communication and performance benefit that come from using circuits is predicated on amortizing the time cost incurred for configuring the circuits. To address this challenge, NoC designs are proposed that take advantage of properties of the cache traffic, namely temporal locality and predictability, to amortize or hide the circuit configuration time cost. First, a coarse-grained circuit configuration policy is proposed that exploits the temporal locality in the cache traffic to periodically configure circuits for the heavily communicating nodes. This allows the design of a locality-aware cache that promotes temporal communication locality through data placement, while designing suitable data replacement and migration policies. Next, a fine-grained configuration policy, called Déjà Vu switching, is proposed for leveraging predictability of data messages by initiating a circuit configuration as soon as a cache hit is detected and before the data becomes available. Its benefit is demonstrated for saving interconnect energy in multi-plane NoCs. Finally, a more proactive configuration policy is proposed for fast caches, where circuit reservations are initiated by request messages, which can greatly improve communication latency and system performance

    Temperature Regulation in Multicore Processors Using Adjustable-Gain Integral Controllers

    Full text link
    This paper considers the problem of temperature regulation in multicore processors by dynamic voltage-frequency scaling. We propose a feedback law that is based on an integral controller with adjustable gain, designed for fast tracking convergence in the face of model uncertainties, time-varying plants, and tight computing-timing constraints. Moreover, unlike prior works we consider a nonlinear, time-varying plant model that trades off precision for simple and efficient on-line computations. Cycle-level, full system simulator implementation and evaluation illustrates fast and accurate tracking of given temperature reference values, and compares favorably with fixed-gain controllers.Comment: 8 pages, 6 figures, IEEE Conference on Control Applications 2015, Accepted Versio

    Shared memory with hidden latency on a family of mesh-like networks

    Get PDF

    A case for merging the ILP and DLP paradigms

    Get PDF
    The goal of this paper is to show that instruction level parallelism (ILP) and data-level parallelism (DLP) can be merged in a single architecture to execute vectorizable code at a performance level that can not be achieved using either paradigm on its own. We will show that the combination of the two techniques yields very high performance at a low cost and a low complexity. We will show that this architecture can reach a performance equivalent to a superscalar processor that sustained 10 instructions per cycle. We will see that the machine exploiting both types of parallelism improves upon the ILP-only machine by factors of 1.5-1.8. We also present a study on the scalability of both paradigms and show that, when we increase resources to reach a 16-issue machine, the advantage of the ILP+DLP machine over the ILP-only machine increases up to 2.0-3.45. While the peak achieved IPC for the ILP machine is 4, the ILP+DLP machine exceeds 10 instructions per cycle.Peer ReviewedPostprint (published version

    Scaling Distributed Cache Hierarchies through Computation and Data Co-Scheduling

    Get PDF
    Cache hierarchies are increasingly non-uniform, so for systems to scale efficiently, data must be close to the threads that use it. Moreover, cache capacity is limited and contended among threads, introducing complex capacity/latency tradeoffs. Prior NUCA schemes have focused on managing data to reduce access latency, but have ignored thread placement; and applying prior NUMA thread placement schemes to NUCA is inefficient, as capacity, not bandwidth, is the main constraint. We present CDCS, a technique to jointly place threads and data in multicores with distributed shared caches. We develop novel monitoring hardware that enables fine-grained space allocation on large caches, and data movement support to allow frequent full-chip reconfigurations. On a 64-core system, CDCS outperforms an S-NUCA LLC by 46% on average (up to 76%) in weighted speedup and saves 36% of system energy. CDCS also outperforms state-of-the-art NUCA schemes under different thread scheduling policies.National Science Foundation (U.S.) (Grant CCF-1318384)Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science (Jacobs Presidential Fellowship)United States. Defense Advanced Research Projects Agency (PERFECT Contract HR0011-13-2-0005

    Programming MPSoC platforms: Road works ahead

    Get PDF
    This paper summarizes a special session on multicore/multi-processor system-on-chip (MPSoC) programming challenges. The current trend towards MPSoC platforms in most computing domains does not only mean a radical change in computer architecture. Even more important from a SW developer´s viewpoint, at the same time the classical sequential von Neumann programming model needs to be overcome. Efficient utilization of the MPSoC HW resources demands for radically new models and corresponding SW development tools, capable of exploiting the available parallelism and guaranteeing bug-free parallel SW. While several standards are established in the high-performance computing domain (e.g. OpenMP), it is clear that more innovations are required for successful\ud deployment of heterogeneous embedded MPSoC. On the other hand, at least for coming years, the freedom for disruptive programming technologies is limited by the huge amount of certified sequential code that demands for a more pragmatic, gradual tool and code replacement strategy

    Planificación consciente de la contención y gestión de recursos en arquitecturas multicore emergentes

    Get PDF
    Tesis inédita de la Universidad Complutense de Madrid, Facultad de Informática, Departamento de Arquitectura de Computadores y Automática, leída el 14-12-2021Chip multicore processors (CMPs) currently constitute the architecture of choice for mosto general-pùrpose computing systems, and they will likely continue to be dominant in the near future. Advances in technology have enabled to pack an increasing number of cores and bigger caches on the same chip. Nevertheless, contention on shared resources on CMPs -present since the advent of these architectures- still poses a big challenge. Cores in a CMP typically share a last-level cache (LLC) and other memory-related resources with the remaining cores, such as a DRAM controller and an interconnection network. This causes that co-running applications may intensively compete with each other for these shared resources, leading to substantial and uneven performance degradation...Los procesadores multinúcleo o CMPs (Chip Multicore Processors) son actualmente la arquitectura más usada por la mayoría de sistemas de computación de propósito general, y muy probablemente se mantendrían en esa posición dominante en el futuro cercano. Los avances tecnológicos han permitido integrar progresivamente en el mismo chip más cores y aumentar los tamaños de los distintos niveles de cache. No obstante, la contención de recursos compartidos en CMPs {presente desde la aparición de estas arquitecturas{ todavía representa un reto importante que afrontar. Los cores en un CMP comparten en la mayor parte de los diseños una cache de último nivel o LLC (Last-Level Cache) y otros recursos, como el controlador de DRAM o una red de interconexión. La existencia de dichos recursos compartidos provoca en ocasiones que cuando se ejecutan dos o más aplicaciones simultáneamente en el sistema, se produzca una degradación sustancial y potencialmente desigual del rendimiento entre aplicaciones...Fac. de InformáticaTRUEunpu
    • …
    corecore