238 research outputs found

    MemPool: A Scalable Manycore Architecture with a Low-Latency Shared L1 Memory

    Full text link
    Shared L1 memory clusters are a common architectural pattern (e.g., in GPGPUs) for building efficient and flexible multi-processing-element (PE) engines. However, it is a common belief that these tightly-coupled clusters would not scale beyond a few tens of PEs. In this work, we tackle scaling shared L1 clusters to hundreds of PEs while supporting a flexible and productive programming model and maintaining high efficiency. We present MemPool, a manycore system with 256 RV32IMAXpulpimg "Snitch" cores featuring application-tunable functional units. We designed and implemented an efficient low-latency PE to L1-memory interconnect, an optimized instruction path to ensure each PE's independent execution, and a powerful DMA engine and system interconnect to stream data in and out. MemPool is easy to program, with all the cores sharing a global view of a large, multi-banked, L1 scratchpad memory, accessible within at most five cycles in the absence of conflicts. We provide multiple runtimes to program MemPool at different abstraction levels and illustrate its versatility with a wide set of applications. MemPool runs at 600 MHz (60 gate delays) in typical conditions (TT/0.80V/25{\deg}C) in 22 nm FDX technology and achieves a performance of up to 229 GOPS or 192 GOPS/W with less than 2% of execution stalls.Comment: 14 pages, 17 figures, 2 table

    Exploiting data locality in cache-coherent NUMA systems

    Get PDF
    The end of Dennard scaling has caused a stagnation of the clock frequency in computers.To overcome this issue, in the last two decades vendors have been integrating larger numbers of processing elements in the systems, interconnecting many nodes, including multiple chips in the nodes and increasing the number of cores in each chip. The speed of main memory has not evolved at the same rate as processors, it is much slower and there is a need to provide more total bandwidth to the processors, especially with the increase in the number of cores and chips. Still keeping a shared address space, where all processors can access the whole memory, solutions have come by integrating more memories: by using newer technologies like high-bandwidth memories (HBM) and non-volatile memories (NVM), by giving groups cores (like sockets, for example) faster access to some subset of the DRAM, or by combining many of these solutions. This has caused some heterogeneity in the access speed to main memory, depending on the CPU requesting access to a memory address and the actual physical location of that address, causing non-uniform memory access (NUMA) behaviours. Moreover, many of these systems are cache-coherent (ccNUMA), meaning that changes in the memory done from one CPU must be visible by the other CPUs and transparent for the programmer. These NUMA behaviours reduce the performance of applications and can pose a challenge to the programmers. To tackle this issue, this thesis proposes solutions, at the software and hardware levels, to improve the data locality in NUMA systems and, therefore, the performance of applications in these computer systems. The first contribution shows how considering hardware prefetching simultaneously with thread and data placement in NUMA systems can find configurations with better performance than considering these aspects separately. The performance results combined with performance counters are then used to build a performance model to predict, both offline and online, the best configuration for new applications not in the model. The evaluation is done using two different high performance NUMA systems, and the performance counters collected in one machine are used to predict the best configurations in the other machine. The second contribution builds on the idea that prefetching can have a strong effect in NUMA systems and proposes a NUMA-aware hardware prefetching scheme. This scheme is generic and can be applied to multiple hardware prefetchers with a low hardware cost but giving very good results. The evaluation is done using a cycle-accurate architectural simulator and provides detailed results of the performance, the data transfer reduction and the energy costs. Finally, the third and last contribution consists in scheduling algorithms for task-based programming models. These programming models help improve the programmability of applications in parallel systems and also provide useful information to the underlying runtime system. This information is used to build a task dependency graph (TDG), a directed acyclic graph that models the application where the nodes are sequential pieces of code known as tasks and the edges are the data dependencies between the different tasks. The proposed scheduling algorithms use graph partitioning techniques and provide a scheduling for the tasks in the TDG that minimises the data transfers between the different NUMA regions of the system. The results have been evaluated in real ccNUMA systems with multiple NUMA regions.La fi de la llei de Dennard ha provocat un estancament de la freqüència de rellotge dels computadors. Amb l'objectiu de superar aquest fet, durant les darreres dues dècades els fabricants han integrat més quantitat d'unitats de còmput als sistemes mitjançant la interconnexió de nodes diferents, la inclusió de múltiples xips als nodes i l'increment de nuclis de processador a cada xip. La rapidesa de la memòria principal no ha evolucionat amb el mateix factor que els processadors; és molt més lenta i hi ha la necessitat de proporcionar més ample de banda als processadors, especialment amb l'increment del nombre de nuclis i xips. Tot mantenint un adreçament compartit en el qual tots els processadors poden accedir a la memòria sencera, les solucions han estat al voltant de la integració de més memòries: amb tecnologies modernes com HBM (high-bandwidth memories) i NVM (non-volatile memories), fent que grups de nuclis (com sòcols sencers) tinguin accés més ràpid a una part de la DRAM o amb la combinació de solucions. Això ha provocat una heterogeneïtat en la velocitat d'accés a la memòria principal, en funció del nucli que sol·licita l'accés a una adreça en particular i la seva localització física, fet que provoca uns comportaments no uniformes en l'accés a la memòria (non-uniform memory access, NUMA). A més, sovint tenen memòries cau coherents (cache-coherent NUMA, ccNUMA), que implica que qualsevol canvi fet a la memòria des d'un nucli d'un processador ha de ser visible la resta de manera transparent. Aquests comportaments redueixen el rendiment de les aplicacions i suposen un repte. Per abordar el problema, a la tesi s'hi proposen solucions, a nivell de programari i maquinari, que milloren la localitat de dades als sistemes NUMA i, en conseqüència, el rendiment de les aplicacions en aquests sistemes. La primera contribució mostra que, quan es tenen en compte alhora la precàrrega d'adreces de memòria amb maquinari (hardware prefetching) i les decisions d'ubicació dels fils d'execució i les dades als sistemes NUMA, es poden trobar millors configuracions que quan es condieren per separat. Una combinació dels resultats de rendiment i dels comptadors disponibles al sistema s'utilitza per construir un model de rendiment per fer la predicció, tant per avançat com també en temps d'execució, de la millor configuració per aplicacions que no es troben al model. L'avaluació es du a terme a dos sistemes NUMA d'alt rendiment, i els comptadors mesurats en un sistema s'usen per predir les millors configuracions a l'altre sistema. La segona contribució es basa en la idea que el prefetching pot tenir un efecte considerable als sistemes NUMA i proposa un esquema de precàrrega a nivell de maquinari que té en compte els efectes NUMA. L'esquema és genèric i es pot aplicar als algorismes de precàrrega existents amb un cost de maquinari molt baix però amb molt bons resultats. S'avalua amb un simulador arquitectural acurat a nivell de cicle i proporciona resultats detallats del rendiment, la reducció de les comunicacions de dades i els costos energètics. La tercera i darrera contribució consisteix en algorismes de planificació per models de programació basats en tasques. Aquests simplifiquen la programabilitat de les aplicacions paral·leles i proveeixen informació molt útil al sistema en temps d'execució (runtime system) que en controla el funcionament. Amb aquesta informació es construeix un graf de dependències entre tasques (task dependency graph, TDG), un graf dirigit i acíclic que modela l'aplicació i en el qual els nodes són fragments de codi seqüencial (o tasques) i els arcs són les dependències de dades entre les tasques. Els algorismes de planificació proposats fan servir tècniques de particionat de grafs i proporcionen una planificació de les tasques del TDG que minimitza la comunicació de dades entre les diferents regions NUMA del sistema. Els resultats han estat avaluats en sistemes ccNUMA reals amb múltiples regions NUMA.El final de la ley de Dennard ha provocado un estancamiento de la frecuencia de reloj de los computadores. Con el objetivo de superar este problema, durante las últimas dos décadas los fabricantes han integrado más unidades de cómputo en los sistemas mediante la interconexión de nodos diferentes, la inclusión de múltiples chips en los nodos y el incremento de núcleos de procesador en cada chip. La rapidez de la memoria principal no ha evolucionado con el mismo factor que los procesadores; es mucho más lenta y hay la necesidad de proporcionar más ancho de banda a los procesadores, especialmente con el incremento del número de núcleos y chips. Aun manteniendo un sistema de direccionamiento compartido en el que todos los procesadores pueden acceder al conjunto de la memoria, las soluciones han oscilado alrededor de la integración de más memorias: usando tecnologías modernas como las memorias de alto ancho de banda (highbandwidth memories, HBM) y memorias no volátiles (non-volatile memories, NVM), haciendo que grupos de núcleos (como zócalos completos) tengan acceso más veloz a un subconjunto de la DRAM, o con la combinación de soluciones. Esto ha provocado una heterogeneidad en la velocidad de acceso a la memoria principal, en función del núcleo que solicita el acceso a una dirección de memoria en particular y la ubicación física de esta dirección, lo que provoca unos comportamientos no uniformes en el acceso a la memoria (non-uniform memory access, NUMA). Además, muchos de estos sistemas tienen memorias caché coherentes (cache-coherent NUMA, ccNUMA), lo que implica que cualquier cambio hecho en la memoria desde un núcleo de un procesador debe ser visible por el resto de procesadores de forma transparente para los programadores. Estos comportamientos NUMA reducen el rendimiento de las aplicaciones y pueden suponer un reto para los programadores. Para abordar dicho problema, en esta tesis se proponen soluciones, a nivel de software y hardware, que mejoran la localidad de datos en los sistemas NUMA y, en consecuencia, el rendimiento de las aplicaciones en estos sistemas informáticos. La primera contribución muestra que, cuando se tienen en cuenta a la vez la precarga de direcciones de memoria mediante hardware (o hardware prefetching ) y las decisiones de la ubicación de los hilos de ejecución y los datos en los sistemas NUMA, se pueden hallar mejores configuraciones que cuando se consideran ambos aspectos por separado. Con una combinación de los resultados de rendimiento y de los contadores disponibles en el sistema se construye un modelo de rendimiento, tanto por avanzado como en en tiempo de ejecución, de la mejor configuración para aplicaciones que no están incluidas en el modelo. La evaluación se realiza en dos sistemas NUMA de alto rendimiento, y los contadores medidos en uno de los sistemas se usan para predecir las mejores configuraciones en el otro sistema. La segunda contribución se basa en la idea de que el prefetching puede tener un efecto considerable en los sistemas NUMA y propone un esquema de precarga a nivel hardware que tiene en cuenta los efectos NUMA. Este esquema es genérico y se puede aplicar a diferentes algoritmos de precarga existentes con un coste de hardware muy bajo pero que proporciona muy buenos resultados. Dichos resultados se obtienen y evalúan mediante un simulador arquitectural preciso a nivel de ciclo y proporciona resultados detallados del rendimiento, la reducción de las comunicaciones de datos y los costes energéticos. Finalmente, la tercera y última contribución consiste en algoritmos de planificación para modelos de programación basados en tareas. Estos modelos simplifican la programabilidad de las aplicaciones paralelas y proveen información muy útil al sistema en tiempo de ejecución (runtime system) que controla su funcionamiento. Esta información se utiliza para construir un grafo de dependencias entre tareas (task dependency graph, TDG), un grafo dirigido y acíclico que modela la aplicación y en el ue los nodos son fragmentos de código secuencial, conocidos como tareas, y los arcos son las dependencias de datos entre las distintas tareas. Los algoritmos de planificación que se proponen usan técnicas e particionado de grafos y proporcionan una planificación de las tareas del TDG que minimiza la comunicación de datos entre las distintas regiones NUMA del sistema. Los resultados se han evaluado en sistemas ccNUMA reales con múltiples regiones NUMA.Postprint (published version

    Reducing Internet Latency : A Survey of Techniques and their Merit

    Get PDF
    Bob Briscoe, Anna Brunstrom, Andreas Petlund, David Hayes, David Ros, Ing-Jyh Tsang, Stein Gjessing, Gorry Fairhurst, Carsten Griwodz, Michael WelzlPeer reviewedPreprin

    RAID Organizations for Improved Reliability and Performance: A Not Entirely Unbiased Tutorial (1st revision)

    Full text link
    RAID proposal advocated replacing large disks with arrays of PC disks, but as the capacity of small disks increased 100-fold in 1990s the production of large disks was discontinued. Storage dependability is increased via replication or erasure coding. Cloud storage providers store multiple copies of data obviating for need for further redundancy. Varitaions of RAID based on local recovery codes, partial MDS reduce recovery cost. NAND flash Solid State Disks - SSDs have low latency and high bandwidth, are more reliable, consume less power and have a lower TCO than Hard Disk Drives, which are more viable for hyperscalers.Comment: Submitted to ACM Computing Surveys. arXiv admin note: substantial text overlap with arXiv:2306.0876

    CGAMES'2009

    Get PDF

    Indexed dependence metadata and its applications in software performance optimisation

    No full text
    To achieve continued performance improvements, modern microprocessor design is tending to concentrate an increasing proportion of hardware on computation units with less automatic management of data movement and extraction of parallelism. As a result, architectures increasingly include multiple computation cores and complicated, software-managed memory hierarchies. Compilers have difficulty characterizing the behaviour of a kernel in a general enough manner to enable automatic generation of efficient code in any but the most straightforward of cases. We propose the concept of indexed dependence metadata to improve application development and mapping onto such architectures. The metadata represent both the iteration space of a kernel and the mapping of that iteration space from a given index to the set of data elements that iteration might use: thus the dependence metadata is indexed by the kernel’s iteration space. This explicit mapping allows the compiler or runtime to optimise the program more efficiently, and improves the program structure for the developer. We argue that this form of explicit interface specification reduces the need for premature, architecture-specific optimisation. It improves program portability, supports intercomponent optimisation and enables generation of efficient data movement code. We offer the following contributions: an introduction to the concept of indexed dependence metadata as a generalisation of stream programming, a demonstration of its advantages in a component programming system, the decoupled access/execute model for C++ programs, and how indexed dependence metadata might be used to improve the programming model for GPU-based designs. Our experimental results with prototype implementations show that indexed dependence metadata supports automatic synthesis of double-buffered data movement for the Cell processor and enables aggressive loop fusion optimisations in image processing, linear algebra and multigrid application case studies

    Using Tracing To Enhance Data Cache Performance in CPUs: The creation of a Trace-Assisted Cache to increase cache hits and decrease runtime

    Get PDF
    The processor-memory gap is widening every year with no prospect of reprieve. More and more latency is being added to program runtimes as memory cannot satisfy the demands of CPUs quickly enough. In the past, this has been alleviated through caches of increasing complexity or techniques like prefetching, to give the illusion of faster memory. However, these techniques have drawbacks because they are reactive or rely on incomplete information. In general, this leads to large amounts of latency in programs due to processor stalls. It is our contention that through tracing a program's data accesses and feeding this information back to the cache, overall program runtime can be reduced. This is achieved through a new piece of hardware called a Trace-Assisted Cache (TAC). This uses traces to gain foreknowledge of the memory requests the processor is likely to make, allowing them to be actioned before the processor requests the data, overlapping memory and computation instructions. Comparing the TAC against a standard CPU without a cache, we see improvements in runtimes of up to 65%. However, we see degraded performance of around 8% on average when compared to Set-Associative and Direct-Mapped caches. This is because improvements are swamped by high overheads and synchronisation times between components. We also see that benchmarks that exhibit several qualities: a balance of computation and memory instructions and keeping data well spread out in memory fare better using TAC than other benchmarks on the same hardware. Overall this demonstrates that whilst there is potential to reduce runtime via increasing the agency of the cache through Trace Assistance, it requires a highly efficient implementation to be competitive otherwise any potential gains are negated by the increase in overheads

    Reconfigurable Instruction Cell Architecture Reconfiguration and Interconnects

    Get PDF

    Run-time management for future MPSoC platforms

    Get PDF
    In recent years, we are witnessing the dawning of the Multi-Processor Systemon- Chip (MPSoC) era. In essence, this era is triggered by the need to handle more complex applications, while reducing overall cost of embedded (handheld) devices. This cost will mainly be determined by the cost of the hardware platform and the cost of designing applications for that platform. The cost of a hardware platform will partly depend on its production volume. In turn, this means that ??exible, (easily) programmable multi-purpose platforms will exhibit a lower cost. A multi-purpose platform not only requires ??exibility, but should also combine a high performance with a low power consumption. To this end, MPSoC devices integrate computer architectural properties of various computing domains. Just like large-scale parallel and distributed systems, they contain multiple heterogeneous processing elements interconnected by a scalable, network-like structure. This helps in achieving scalable high performance. As in most mobile or portable embedded systems, there is a need for low-power operation and real-time behavior. The cost of designing applications is equally important. Indeed, the actual value of future MPSoC devices is not contained within the embedded multiprocessor IC, but in their capability to provide the user of the device with an amount of services or experiences. So from an application viewpoint, MPSoCs are designed to ef??ciently process multimedia content in applications like video players, video conferencing, 3D gaming, augmented reality, etc. Such applications typically require a lot of processing power and a signi??cant amount of memory. To keep up with ever evolving user needs and with new application standards appearing at a fast pace, MPSoC platforms need to be be easily programmable. Application scalability, i.e. the ability to use just enough platform resources according to the user requirements and with respect to the device capabilities is also an important factor. Hence scalability, ??exibility, real-time behavior, a high performance, a low power consumption and, ??nally, programmability are key components in realizing the success of MPSoC platforms. The run-time manager is logically located between the application layer en the platform layer. It has a crucial role in realizing these MPSoC requirements. As it abstracts the platform hardware, it improves platform programmability. By deciding on resource assignment at run-time and based on the performance requirements of the user, the needs of the application and the capabilities of the platform, it contributes to ??exibility, scalability and to low power operation. As it has an arbiter function between different applications, it enables real-time behavior. This thesis details the key components of such an MPSoC run-time manager and provides a proof-of-concept implementation. These key components include application quality management algorithms linked to MPSoC resource management mechanisms and policies, adapted to the provided MPSoC platform services. First, we describe the role, the responsibilities and the boundary conditions of an MPSoC run-time manager in a generic way. This includes a de??nition of the multiprocessor run-time management design space, a description of the run-time manager design trade-offs and a brief discussion on how these trade-offs affect the key MPSoC requirements. This design space de??nition and the trade-offs are illustrated based on ongoing research and on existing commercial and academic multiprocessor run-time management solutions. Consequently, we introduce a fast and ef??cient resource allocation heuristic that considers FPGA fabric properties such as fragmentation. In addition, this thesis introduces a novel task assignment algorithm for handling soft IP cores denoted as hierarchical con??guration. Hierarchical con??guration managed by the run-time manager enables easier application design and increases the run-time spatial mapping freedom. In turn, this improves the performance of the resource assignment algorithm. Furthermore, we introduce run-time task migration components. We detail a new run-time task migration policy closely coupled to the run-time resource assignment algorithm. In addition to detailing a design-environment supported mechanism that enables moving tasks between an ISP and ??ne-grained recon??gurable hardware, we also propose two novel task migration mechanisms tailored to the Network-on-Chip environment. Finally, we propose a novel mechanism for task migration initiation, based on reusing debug registers in modern embedded microprocessors. We propose a reactive on-chip communication management mechanism. We show that by exploiting an injection rate control mechanism it is possible to provide a communication management system capable of providing a soft (reactive) QoS in a NoC. We introduce a novel, platform independent run-time algorithm to perform quality management, i.e. to select an application quality operating point at run-time based on the user requirements and the available platform resources, as reported by the resource manager. This contribution also proposes a novel way to manage the interaction between the quality manager and the resource manager. In order to have a the realistic, reproducible and ??exible run-time manager testbench with respect to applications with multiple quality levels and implementation tradev offs, we have created an input data generation tool denoted Pareto Surfaces For Free (PSFF). The the PSFF tool is, to the best of our knowledge, the ??rst tool that generates multiple realistic application operating points either based on pro??ling information of a real-life application or based on a designer-controlled random generator. Finally, we provide a proof-of-concept demonstrator that combines these concepts and shows how these mechanisms and policies can operate for real-life situations. In addition, we show that the proposed solutions can be integrated into existing platform operating systems
    corecore