9 research outputs found

    MEDEA: A Hybrid Shared-memory/Message-passing Multiprocessor NoC-based Architecture

    Get PDF
    The shared-memory model has been adopted, both for data exchange as well as synchronization using semaphores in almost every on-chip multiprocessor implementation, ranging from general purpose chip multiprocessors (CMPs) to domain specific multi-core graphics processing units (GPUs). Low-latency synchronization is desirable but is hard to achieve in practice due to the memory hierarchy. On the contrary, an explicit exchange of synchronization tokens among the processing elements through dedicated on-chip links would be beneficial for the overall system performance. In this paper we propose the Medea NoC-based framework, a hybrid shared-memory/message-passing approach. Medea has been modeled with a fast, cycle-accurate SystemC implementation enabling a fast system exploration varying several parameters like number and types of cores, cache size and policy and NoC features. In addition, every SystemC block has its RTL counterpart for physical implementation on FPGAs and ASICs. A parallel version of the Jacobi algorithm has been used as a test application to validate the metodology. Results confirm expectations about performance and effectiveness of system exploration and design

    MEDEA: A Hybrid Shared-memory/Message-passing Multiprocessor NoC-based Architecture

    Get PDF
    The shared-memory model has been adopted, both for data exchange as well as synchronization using semaphores in almost every on-chip multiprocessor implementation, ranging from general purpose chip multiprocessors (CMPs) to domain specific multi-core graphics processing units (GPUs). Low-latency synchronization is desirable but is hard to achieve in practice due to the memory hierarchy. On the contrary, an explicit exchange of synchronization tokens among the processing elements through dedicated on-chip links would be beneficial for the overall system performance. In this paper we propose the Medea NoC-based framework, a hybrid shared-memory/message-passing approach. Medea has been modeled with a fast, cycle-accurate SystemC implementation enabling a fast system exploration varying several parameters like number and types of cores, cache size and policy and NoC features. In addition, every SystemC block has its RTL counterpart for physical implementation on FPGAs and ASICs. A parallel version of the Jacobi algorithm has been used as a test application to validate the metodology. Results confirm expectations about performance and effectiveness of system exploration and desig

    Fast and Portable Locking for Multicore Architectures

    Get PDF
    International audienceThe scalability of multithreaded applications on current multicore systems is hampered by the performance of lock algorithms, due to the costs of access contention and cache misses. The main contribution presented in this article is a new locking technique, Remote Core Locking (RCL), that aims to accelerate the execution of critical sections in legacy applications on multicore architectures. The idea of RCL is to replace lock acquisitions by optimized remote procedure calls to a dedicated server hardware thread. RCL limits the performance collapse observed with other lock algorithms when many threads try to acquire a lock concurrently and removes the need to transfer lock-protected shared data to the hardware thread acquiring the lock, because such data can typically remain in the server's cache. Other contributions presented in this article include a profiler that identifies the locks that are the bottlenecks in multithreaded applications and that can thus benefit from RCL, and a reengineering tool that transforms POSIX lock acquisitions into RCL locks. Eighteen applications were used to evaluate RCL: the nine applications of the SPLASH-2 benchmark suite, the seven applications of the Phoenix 2 benchmark suite, Memcached, and Berkeley DB with a TPC-C client. Eight of these applications are unable to scale because of locks and benefit from RCL on an ×86 machine with four AMD Opteron processors and 48 hardware threads. By using RCL instead of Linux POSIX locks, performance is improved by up to 2.5 times on Memcached, and up to 11.6 times on Berkeley DB with the TPC-C client. On a SPARC machine with two Sun Ultrasparc T2+ processors and 128 hardware threads, three applications benefit from RCL. In particular, performance is improved by up to 1.3 times with respect to Solaris POSIX locks on Memcached, and up to 7.9 times on Berkeley DB with the TPC-C client.. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publication

    Heurísticas bioinspiradas para el problema de Floorplanning 3D térmico de dispositivos MPSoCs

    Get PDF
    Tesis inédita de la Universidad Complutense de Madrid, Facultad de Informática, Departamento de Arquitectura de Computadores y Automática, leída el 20-06-2013Depto. de Arquitectura de Computadores y AutomáticaFac. de InformáticaTRUEunpu

    Scaling High-Performance Interconnect Architectures to Many-Core Systems.

    Full text link
    The ever-increasing demand for performance scaling has made multi-core (2-8 cores) chips prevalent in today’s computing systems and foreshadows the shift toward many-core (10s- 100s of cores) chips in the near future. Although the potential performance gains from many-core systems remain appealing, the widespread adoption of these systems hinges on their ability to scale performance while simultaneously satisfying Quality-of-Service (QoS) and energy-efficiency constraints. This work makes the case that the interconnect for these many-core systems has a significant impact on the aforementioned scalability issues. The impact of interconnects on many-core systems is illustrated by observing that the degree of the interconnect has a signicant effect on system scalability and demonstrating that the architecture of high-radix, many-core systems are feasible, energy-efficient, and high-performance. The feasibility of high-radix crossbars for many-core systems is first shown through a new circuit-level building block called the Swizzle-Switch which can operate at frequencies up to 1.5GHz for 128-bit, radix-64 crossbars. This work then shows how a many-core system called the Swizzle-Switch Network (SSN) can use the Swizzle-Switch as the central building block for a flat crossbar interconnect. The SSN is shown to be advantageous to traditional Network-on-Chip (NoC) for systems up to 64 cores. The SSN performance by 21% relative to a Mesh while also providing a 25% energy savings over the Mesh. The Swizzle-Switch is also leveraged as a building block for high-radix NoC topologies that can support many-core architectures. The Swizzle-Switch-based Flattened Butterfly topology is demonstrated to provide a 15% speedup and 10% energy savings over the Mesh. Finally, the impact that 3D stacking technology has on many-core scalability is evaluated for bus and crossbar interconnects. A 3D-optimized Swizzle-Switch Network is able to leverage frequency gains to achieve a 15-28% speedup over a 2D-Swizzle-Switch Network when using memory- intensive benchmarks. Additionally, a bus-based 64-core architecture is shown to provide an average speedup of 49× over a baseline uniprocessor system when using 3D technology.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/93980/1/ksewell_1.pd

    Composable accelerator-rich microprocessor enhanced for adaptivity and longevity

    Get PDF
    Abstract Accelerator-rich platforms demonstrate orders of magnitude improvement in performance and energy efficiency over software, yet they lack adaptivity to new algorithms and can see low accelerator utilization. To address these issues we propose CAMEL: Composable Accelerator-rich Microprocessor Enhanced for Longevity. CAMEL features programmable fabric (PF) to extend the use of ASIC composable accelerators in supporting algorithms that are beyond the scope of the baseline platform. Using a combination of hardware extensions and compiler support, we demonstrate on average 11.6X performance improvement and 13.9X energy savings across benchmarks that deviate from the original domain for our baseline platform

    Characterization of interconnection networks in CMPs using full-system simulation

    Get PDF
    Los computadores más recientes incluyen complejos chips compuestos de varios procesadores y una cantidad significativa de memoria cache. La tendencia actual consiste en conectar varios nodos, cada uno de ellos con un procesador y uno o más niveles de cache privada y/o compartida, utilizando una red de interconexión. La importancia de esta red está aumentando a medida que crece el número de nodos que se integran en un chip, ya que pueden aparecer cuellos de botella en la comunicación que reduzcan las prestaciones. Además, la red contribuye en gran medida al consumo de energía y área del chip. En este proyecto, comparamos el comportamiento de tres topologías: el anillo bidireccional, la malla y el toro. El anillo es una topología mínima con bajo coste en energía pero peor rendimiento debido a la mayor latencia de comunicación entre nodos. Por otro lado, el toro tiene mayor número de enlaces entre nodos y ofrece mejores prestaciones. La malla ha sido incluida como una opción intermedia altamente popular. Analizaremos también dos topologías de anillo adicionales que aprovechan la reducida área y complejidad del mismo: una con mayor ancho de banda y otra con routers de menor número de ciclos. Modelamos cuidadosamente todos los componentes del sistema (procesadores, jerarquía de memoria y red de interconexión) utilizando simulación de sistema completo. Ejecutamos aplicaciones reales en arquitecturas con 16 y 64 nodos, incluyendo tanto cargas paralelas como multiprogramadas (ejecución de varias aplicaciones independientes). Demostramos que la topología de la red afecta en gran medida al rendimiento en sistemas con 64 nodos. Con las topologías de anillo, los tiempos de ejecución son mucho mayores debido al aumento del número de saltos que le cuesta a un mensaje atravesar la red. El toro es la topología que ofrece mejor rendimiento, pero la elección más óptima sería la malla si tenemos en cuenta también energía y área. Por otro lado, para chips con 16 nodos, las diferencias en rendimiento son menores y un anillo con routers de 3 cyclos ofrece un tiempo de ejecución aceptable con el menor coste en área y energía. Nuestra aportación más significativa está relacionada con la distribución del tráfico en la red. Vemos que el tráfico no está distribuido uniformemente y que los nodos con mayores tasas de inyección varían con la aplicación. Hasta donde nosotros sabemos, no hay ningún trabajo de investigación previo que destaque este comportamiento

    Improving Energy and Area Scalability of the Cache Hierarchy in CMPs

    Full text link
    As the core counts increase in each chip multiprocessor generation, CMPs should improve scalability in performance, area, and energy consumption to meet the demands of larger core counts. Directory-based protocols constitute the most scalable alternative. A conventional directory, however, suffers from an inefficient use of storage and energy. First, the large, non-scalable, sharer vectors consume unnecessary area and leakage, especially considering that most of the blocks tracked in a directory are cached by a single core. Second, although increasing directory size and associativity could boost system performance by reducing the coverage misses, it would come at the expense of area and energy consumption. This thesis focuses and exploits the important differences of behavior between private and shared blocks from the directory point of view. These differences claim for a separate management of both types of blocks at the directory. First, we propose the PS-Directory, a two-level directory cache that keeps the reduced number of frequently accessed shared entries in a small and fast first-level cache, namely Shared Directory Cache, and uses a larger and slower second-level Private Directory Cache to track the large amount of private blocks. Experimental results show that, compared to a conventional directory, the PS-Directory improves performance while also reducing silicon area and energy consumption. In this thesis we also show that the shared/private ratio of entries in the directory varies across applications and across different execution phases within the applications, which encourages us to propose Dynamic Way Partitioning (DWP) Directory. DWP-Directory reduces the number of ways with storage for shared blocks and it allows this storage to be powered off or on at run-time according to the dynamic requirements of the applications following a repartitioning algorithm. Results show similar performance as a traditional directory with high associativity, and similar area requirements as recent state-of-the-art schemes. In addition, DWP-Directory achieves notable static and dynamic power consumption savings. This dissertation also deals with the scalability issues in terms of power found in processor caches. A significant fraction of the total power budget is consumed by on-chip caches which are usually deployed with a high associativity degree (even L1 caches are being implemented with eight ways) to enhance the system performance. On a cache access, each way in the corresponding set is accessed in parallel, which is costly in terms of energy. This thesis presents the PS-Cache architecture, an energy-efficient cache design that reduces the number of accessed ways without hurting the performance. The PS-Cache takes advantage of the private-shared knowledge of the referenced block to reduce energy by accessing only those ways holding the kind of block looked up. Results show significant dynamic power consumption savings. Finally, we propose an energy-efficient architectural design that can be effectively applied to any kind of set-associative cache memory, not only to processor caches. The proposed approach, called the Tag Filter (TF) Architecture, filters the ways accessed in the target cache set, and just a few ways are searched in the tag and data arrays. This allows the approach to reduce the dynamic energy consumption of caches without hurting their access time. For this purpose, the proposed architecture holds the X least significant bits of each tag in a small auxiliary X-bit-wide array. These bits are used to filter the ways where the least significant bits of the tag do not match with the bits in the X-bit array. Experimental results show that this filtering mechanism achieves energy consumption in set-associative caches similar to direct mapped ones. Experimental results show that the proposals presented in this thesis offer a good tradeoff among these three major design axes.Conforme se incrementa el número de núcleos en las nuevas generaciones de multiprocesadores en chip, los CMPs deben de escalar en prestaciones, área y consumo energético para cumplir con las demandas de un número núcleos mayor. Los protocolos basados en directorio constituyen la alternativa más escalable. Un directorio convencional, no obstante, sufre de una utilización ineficiente de almacenamiento y energía. En primer lugar, los grandes y poco escalables vectores de compartidores consumen una cantidad de energía de fuga y de área innecesaria, especialmente si se tiene en consideración que la mayoría de los bloques en un directorio solo se encuentran en la cache de un único núcleo. En segundo lugar, aunque incrementar el tamaño y la asociatividad del directorio aumentaría las prestaciones del sistema, esto supondría un incremento notable en el consumo energético. Esta tesis estudia las diferencias significativas entre el comportamiento de bloques privados y compartidos en el directorio, lo que nos lleva hacia una gestión separada para cada uno de los tipos de bloque. Proponemos el PS-Directory, una cache de directorio de dos niveles que mantiene el reducido número de las entradas compartidas, que son los que se acceden con más frecuencia, en una estructura pequeña de primer nivel (concretamente, la Shared Directory Cache) y que utiliza una estructura más grande y lenta en el segundo nivel (Private Directory Cache) para poder mantener la información de los bloques privados. Los resultados experimentales muestran que, comparado con un directorio convencional, el PS-Directory consigue mejorar las prestaciones a la vez que reduce el área de silicio y el consumo energético. Ya que el ratio compartido/privado de las entradas en el directorio varia entre aplicaciones y entre las diferentes fases de ejecución dentro de las aplicaciones, proponemos el Dynamic Way Partitioning (DWP) Directory. El DWP-Directory reduce el número de vías que almacenan entradas compartidas y permite que éstas se enciendan o apaguen en tiempo de ejecución según los requisitos dinámicos de las aplicaciones según un algoritmo de reparticionado. Los resultados muestran unas prestaciones similares a un directorio tradicional de alta asociatividad y un área similar a otros esquemas recientes del estado del arte. Adicionalmente, el DWP-Directory obtiene importantes reducciones de consumo estático y dinámico. Esta disertación también se enfrenta a los problemas de escalabilidad que se pueden encontrar en las memorias cache. En un acceso a la cache, se accede a cada vía del conjunto en paralelo, siendo así un acción costosa en energía. Esta tesis presenta la arquitectura PS-Cache, un diseño energéticamente eficiente que reduce el número de vías accedidas sin perjudicar las prestaciones. La PS-Cache utiliza la información del estado privado-compartido del bloque referenciado para reducir la energía, ya que tan solo accedemos a un subconjunto de las vías que mantienen los bloques del tipo solicitado. Los resultados muestran unos importantes ahorros de energía dinámica. Finalmente, proponemos otro diseño de arquitectura energéticamente eficiente que se puede aplicar a cualquier tipo de memoria cache asociativa por conjuntos. La propuesta, la Tag Filter (TF) Architecture, filtra las vías accedidas en el conjunto de la cache, de manera que solo se mira un número reducido de vías tanto en el array de etiquetas como en el de datos. Esto permite que nuestra propuesta reduzca el consumo de energía dinámico de las caches sin perjudicar su tiempo de acceso. Los resultados experimentales muestran que este mecanismo de filtrado es capaz de obtener un consumo energético en caches asociativas por conjunto similar de las caches de mapeado directo. Los resultados experimentales muestran que las propuestas presentadas en esta tesis consiguen un buen compromiso entre estos tres importantes pilares de diseño.Conforme s'incrementen el nombre de nuclis en les noves generacions de multiprocessadors en xip, els CMPs han d'escalar en prestacions, àrea i consum energètic per complir en les demandes d'un nombre de nuclis major. El protocols basats en directori són l'alternativa més escalable. Un directori convencional, no obstant, pateix una utilització ineficient d'emmagatzematge i energia. En primer lloc, els grans i poc escalables vectors de compartidors consumeixen una quantitat d'energia estàtica i d'àrea innecessària, especialment si es considera que la majoria dels blocs en un directori només es troben en la cache d'un sol nucli. En segon lloc, tot i que incrementar la grandària i l'associativitat del directori augmentaria les prestacions del sistema, això suposaria un increment notable en el consum d'energia. Aquesta tesis estudia les diferències significatives entre el comportament de blocs privats i compartits dins del directori, la qual cosa ens guia cap a una gestió separada per a cada un dels tipus de bloc. Proposem el PS-Directory, una cache de directori de dos nivells que manté el reduït nombre de les entrades de blocs compartits, que són els que s'accedeixen amb més freqüència, en una estructura menuda de primer nivell (concretament, la Shared Directory Cache) i que empra una estructura més gran i lenta en el segon nivell (Private Directory Cache) per poder mantenir la informació dels blocs privats. Els resultats experimentals mostren que, comparat amb un directori convencional, el PS-Directory aconsegueix millorar les prestacions a la vegada que redueix l'àrea de silici i el consum energètic. Ja que la ràtio compartit/privat de les entrades en el directori varia entre aplicacions i entre les diferents fases d'execució dins de les aplicacions, proposem el Dynamic Way Partitioning (DWP) Directory. DWP-Directory redueix el nombre de vies que emmagatzemen entrades compartides i permeten que aquest s'encengui o apagui en temps d'execució segons els requeriments dinàmics de les aplicacions seguint un algoritme de reparticionat. Els resultats mostren unes prestacions similars a un directori tradicional d'alta associativitat i una àrea similar a altres esquemes recents de l'estat de l'art. Adicionalment, el DWP-Directory obté importants reduccions de consum estàtic i dinàmic. Aquesta dissertació també s'enfronta als problemes d'escalabilitat que es poden tro- bar en les memòries cache. Les caches on-chip consumeixen una part significativa del consum total del sistema. Aquestes caches implementen un alt nivell d'associativitat. En un accés a la cache, s'accedeix a cada via del conjunt en paral·lel, essent així una acció costosa en energia. Aquesta tesis presenta l'arquitectura PS-Cache, un disseny energèticament eficient que redueix el nombre de vies accedides sense perjudicar les prestacions. La PS-Cache utilitza la informació de l'estat privat-compartit del bloc referenciat per a reduir energia, ja que només accedim al subconjunt de vies que mantenen blocs del tipus sol·licitat. Els resultats mostren uns importants estalvis d'energia dinàmica. Finalment, proposem un altre disseny d'arquitectura energèticament eficient que es pot aplicar a qualsevol tipus de memòria cache associativa per conjunts. La proposta, la Tag Filter (TF) Architecture, filtra les vies accedides en el conjunt de la cache, de manera que només un reduït nombre de vies es miren tant en el array d'etiquetes com en el de dades. Això permet que la nostra proposta redueixi el consum dinàmic energètic de les caches sense perjudicar el seu temps d'accés. Els resultats experimentals mostren que aquest mecanisme de filtre és capaç d'obtenir un consum energètic en caches associatives per conjunt similar al de les caches de mapejada directa. Els resultats experimentals mostren que les propostes presentades en aquesta tesis conseguixen un bon compromís entre aquestros tres importants pilars de diseny.Valls Mompó, JJ. (2017). Improving Energy and Area Scalability of the Cache Hierarchy in CMPs [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/79551TESI

    Design and Implementation of Benes/Clos On-Chip Interconnection Networks

    Full text link
    Networks-on-Chip (NoCs) have emerged as the key on-chip communication architecture for multiprocessor systems-on-chip and chip multiprocessors. Single-hop non-blocking networks have the advantage of providing uniform latency and throughput, which is important for cachecoherent NoC systems. Existing work shows that Benes networks have much lower transistor count and smaller circuit area but longer delay than crossbars. To reduce the delay, we propose to design the Clos network built with larger size switches. Using less than half number of stages than the Benes network, the Clos network with 4x4 switches can significantly reduce the delay. This dissertation focuses on designing high performance Benes/Clos on-chip interconnection networks and implementing the switch setting circuits for these networks. The major contributions are summarized below: The circuit designs of both Benes and Clos networks in different sizes are conducted considering two types of implementation of the configurable switch: with NMOS transistors only and full transmission gates (TGs). The layout and simulation results under 45nm technology show that TG-based Benes networks have much better delay and power performance than their NMOS-based counterparts, though more transistor resources are needed in TG-based designs. Clos networks achieve average 60% lower delay than Benes networks with even smaller area and power consumption. The Lee’s switch setting algorithm is fully implemented in RTL and synthesized. We have refined the algorithm in data structure and initialization/updating of relation values to make it suitable for hardware implementation. The simulation and synthesis results of the switching setting circuits for 4x4 to 64x64 Benes networks under 65nm technology confirm that the trend of delay and area results of the circuit is consistent with that of the Lee’s algorithm. To the best of our knowledge, this is the first complete hardware implementation of the parallel switch setting algorithm which can handle all types of permutations including partial ones. The results in this dissertation confirm that the Benes/Clos networks are promising solution to implement on-chip interconnection network
    corecore