368 research outputs found

    Boosting performance of transactional memory through transactional read tracking and set associative locks

    Get PDF
    Multi-core processors have become so prevalent in server, desktop, and even embedded systems that they are considered the norm for modem computing systems. The trend is likely toward many-core processors with many more than just 2, 4, or 8 cores per CPU. To benefit from the increasing number of cores per chip, application developers have to develop parallel programs [1]. Traditional lock-based programming is too difficult and error prone for most of programmers and is the domain of experts. Deadlock, race, and other synchronization bugs are some of the challenges of lock-based programming. To make parallel programming mainstream, it is necessary to adapt parallel programming by the majority of programmers and not just experts, and thus simplifying parallel programming has become an important challenge. Transactional Memory (TM) is a promising programming model for managing concurrent accesses to the shared memory locations. Transactional memory allows a programmer to specify a section of a code to be "'transactional", and the underlying system guarantees atomic execution of the code. This simplifies parallel programming and reduces the possibility of synchronization bugs. This thesis develops several software- and hardware-based techniques to improve performance of existing transactional memory systems. The first technique is Transactional Read Tracking (TRT). TRT is a software-based approach that employs a locking mechanism for transactional read and write operations. The performance of TRT depends on memory access patterns of applications. In some cases, TRT falls behind the baseline scheme. To further improve performance of TRT, we introduce two hybrid methods that dynamically switches between TRT and the baseline scheme based on applications’ behavior. The second optimization technique is Set Associative Lock (SAL). Memory locations are mapped to a lock table in order to synchronize accesses to the shared memory locations. Direct mapped lock tables usually result in collision which leads to false aborts. In SAL, we increase associativity of the lock table to reduce false abort. While SAL improves performance in most of the applications, in some cases, it increases execution time due to overhead of lock tables in software. To cope with this problem, we propose Hardware-SAL (HW-SAL) which moves the set associative lock table to the hardware. As such, true power of set associativity will be harnessed without sacrificing performance

    On the maturity of parallel applications for asymmetric multi-core processors

    Get PDF
    Asymmetric multi-cores (AMCs) are a successful architectural solution for both mobile devices and supercomputers. By maintaining two types of cores (fast and slow) AMCs are able to provide high performance under the facility power budget. This paper performs the first extensive evaluation of how portable are the current HPC applications for such supercomputing systems. Specifically we evaluate several execution models on an ARM big.LITTLE AMC using the PARSEC benchmark suite that includes representative highly parallel applications. We compare schedulers at the user, OS and runtime levels, using both static and dynamic options and multiple configurations, and assess the impact of these options on the well-known problem of balancing the load across AMCs. Our results demonstrate that scheduling is more effective when it takes place in the runtime system level as it improves the baseline by 23%, while the heterogeneous-aware OS scheduling solution improves the baseline by 10%.This work has been supported by the RoMoL ERC Advanced Grant (GA 321253), by the European HiPEAC Network of Excellence, by the Spanish Ministry of Science and Innovation (contracts TIN2015-65316-P), by the Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272), and by the European Union's Horizon 2020 research and innovation programme under grant agreement No 671697 and No. 779877. M. Moretó has been partially supported by the Ministry of Economy and Competitiveness under Ramon y Cajal fellowship number RYC-2016-21104.Peer ReviewedPostprint (author's final draft

    An Energy-Efficient Reconfigurable DTLS Cryptographic Engine for Securing Internet-of-Things Applications

    Full text link
    This paper presents the first hardware implementation of the Datagram Transport Layer Security (DTLS) protocol to enable end-to-end security for the Internet of Things (IoT). A key component of this design is a reconfigurable prime field elliptic curve cryptography (ECC) accelerator, which is 238x and 9x more energy-efficient compared to software and state-of-the-art hardware respectively. Our full hardware implementation of the DTLS 1.3 protocol provides 438x improvement in energy-efficiency over software, along with code size and data memory usage as low as 8 KB and 3 KB respectively. The cryptographic accelerators are coupled with an on-chip low-power RISC-V processor to benchmark applications beyond DTLS with up to two orders of magnitude energy savings. The test chip, fabricated in 65 nm CMOS, demonstrates hardware-accelerated DTLS sessions while consuming 44.08 uJ per handshake, and 0.89 nJ per byte of encrypted data at 16 MHz and 0.8 V.Comment: Published in IEEE Journal of Solid-State Circuits (JSSC

    Hardware/Software Co-design for Multicore Architectures

    Get PDF
    Siirretty Doriast

    Energy Efficiency and Performance in Multiprocessors Systems on Chip

    Get PDF
    As process technology shrinks, the transistor count on CPUs has increased. The breakdown of Dennard scaling has led to diminishing returns in terms of performance per power. A trend which promises to impact future CPU designs. This breakdown is due in part to the increase in transistor leakage driven static power. We, now, have constrained energy and power budgets. Thus, energy consumption has to be justified by an increased in performance. Simultaneously, architects have shifted to chip multiprocessors(CMPs) designs with large shared last level cache(LLC) to mitigate the cost of long latency off-chip memory access. A primary reason for that shift is the power efficiency of CMPs. Additionally, technology scaling has allowed the integration of platform components on the chip; a design referred to as multiprocessors system on chip (MpSoC). This integration improves the system performance as the communication latency between the components is reduced. Memory subsystems are essential to CPUs performance. Larger caches provide the CPU faster access to a larger data set. Consequently, the size of last level caches have increased to become a significant leakage power dissipation source. We propose a technique to facilitate power gating a partition of the LLC by migrating the high temporal blocks to a live partition; Thus reducing the performance impact. Given the high latency of memory subsystems, prefetching improves CPU performance by speculating future memory accesses and requesting the data ahead of the demand. In the context of CMPs running multiple concurrent processes, prefetching accuracy is critical to prevent cache pollution effects. Furthermore, given the current constraint energy environment, there is a need for lightweight prefetchers with high accuracy. To this end, we present BFetch a lightweight and accurate prefetcher driven by control flow predictions and effective address speculation. MpSoCs have mostly been used in mobile devices. The energy constraint is more pronounced in MpSoCs-based, battery powered mobile devices. The need to address the energy consumption in MpSoCs is further accentuated by the proliferation of mobile devices. This dissertation presents a framework to optimize the energy in MpSoCs. The proposed framework minimizes the energy consumption while meeting performance and power budgets constraints. We first apply this framework to the CPU then extend it to accommodate the GPU

    Compiler and Runtime for Memory Management on Software Managed Manycore Processors

    Get PDF
    abstract: We are expecting hundreds of cores per chip in the near future. However, scaling the memory architecture in manycore architectures becomes a major challenge. Cache coherence provides a single image of memory at any time in execution to all the cores, yet coherent cache architectures are believed will not scale to hundreds and thousands of cores. In addition, caches and coherence logic already take 20-50% of the total power consumption of the processor and 30-60% of die area. Therefore, a more scalable architecture is needed for manycore architectures. Software Managed Manycore (SMM) architectures emerge as a solution. They have scalable memory design in which each core has direct access to only its local scratchpad memory, and any data transfers to/from other memories must be done explicitly in the application using Direct Memory Access (DMA) commands. Lack of automatic memory management in the hardware makes such architectures extremely power-efficient, but they also become difficult to program. If the code/data of the task mapped onto a core cannot fit in the local scratchpad memory, then DMA calls must be added to bring in the code/data before it is required, and it may need to be evicted after its use. However, doing this adds a lot of complexity to the programmer's job. Now programmers must worry about data management, on top of worrying about the functional correctness of the program - which is already quite complex. This dissertation presents a comprehensive compiler and runtime integration to automatically manage the code and data of each task in the limited local memory of the core. We firstly developed a Complete Circular Stack Management. It manages stack frames between the local memory and the main memory, and addresses the stack pointer problem as well. Though it works, we found we could further optimize the management for most cases. Thus a Smart Stack Data Management (SSDM) is provided. In this work, we formulate the stack data management problem and propose a greedy algorithm for the same. Later on, we propose a general cost estimation algorithm, based on which CMSM heuristic for code mapping problem is developed. Finally, heap data is dynamic in nature and therefore it is hard to manage it. We provide two schemes to manage unlimited amount of heap data in constant sized region in the local memory. In addition to those separate schemes for different kinds of data, we also provide a memory partition methodology.Dissertation/ThesisPh.D. Computer Science 201

    Exploiting asymmetric multi-core systems with flexible system software

    Get PDF
    Asymmetric multi-cores (AMCs) are a successful architectural solution for both mobile devices and supercomputers. These architectures combine different types of processing cores designed at different performance and power optimization points, thus exposing a performance-power trade-off. By maintaining two types of cores, AMCs are able to provide high performance under the facility power budget. However, there are significant challenges when using AMCs such as scheduling and load balancing. This thesis initially explores the potential of AMCs when executing current HPC applications and searches for the most appropriate execution model. Specifically we evaluate several execution models on an Arm big.LITTLE AMC using the PARSEC benchmark suite that includes representative HPC applications. We compare schedulers at the user, OS and runtime system levels, using both static and dynamic options and multiple configurations, and assess the impact of these options on the well-known problem of balancing the load across AMCs. Our results demonstrate that scheduling is more effective when it takes place in the runtime system as it improves the user-level scheduling by 23%, while the heterogeneous-aware OS scheduling solution improves the user-level scheduling by 10%. Following this outcome, this thesis focuses on increasing performance of AMC systems by improving scheduling in the runtime system level. Scheduling in the runtime system level is provided by the use of task-based parallel programming models. These programming models offer programming flexibility as they consist of an interface and a runtime system to manage the underlying resources and threads. In this thesis we improve scheduling with task-based programming models by providing three novel task schedulers for AMCs. These dynamic scheduling policies reduce total execution time either by detecting the longest or the critical path of the dynamic task dependency graph of the application. They use dynamic scheduling and information discoverable during execution, fact that makes them implementable and functional without the need of off-line profiling. In our evaluation we compare these scheduling approaches with an existing state-of the art heterogeneous scheduler and we track their improvement over a FIFO baseline scheduler. We show that the heterogeneous schedulers improve the baseline by up to 1.45x on a real 8-core AMC and up to 2.1x on a simulated 32-core AMC. Another enhancement we provide in task-based programming models is the adaptability to fine grained parallelism. The increasing number of cores on modern CMPs is pushing research towards the use of fine grained workloads, which is an important challenge for task-based programming models. Our study makes the observation that task creation becomes a bottleneck when executing fine grained workloads with task-based programming models. As the number of cores increases, the time spent generating tasks is becoming more critical to the entire execution. To overcome this issue, we propose TaskGenX. TaskGenX minimizes task creation overheads and relies both on the runtime system and a dedicated hardware. On the runtime system side, TaskGenX decouples the task creation from the other runtime activities. It then transfers this part of the runtime to a specialized hardware. From our evaluation using 11 HPC workloads on both symmetric and AMC systems, we obtain performance improvements up to 15x, averaging to 3.1x over the baseline. Finally, this thesis presents a showcase for a real-time CPU scheduler with the goal to increase the frames per second (FPS) of the game-play on mobile devices with AMC systems. We design and implement the RTS scheduler in the Android framework. RTS provides an efficient scheduling policy that takes into account the current temperature of the system to perform task migration. RTS solution increases the median FPS of the baseline mechanisms by up to 7.5% and at the same time it maintains temperature stable.Los procesadores multinúcleos asimétricos (AMC) son una solución arquitectónica exitosa para dispositivos móviles y supercomputadores. Estas arquitecturas combinan diferentes tipos de núcleos de procesamiento diseñados con diferentes propiedades de rendimiento y potencia. Al mantener dos o más tipos de núcleos, los AMCs pueden proporcionar un alto rendimiento con un consumo bajo de energía de las infraestructuras. Sin embargo, existen importantes desafíos al usar los AMC, como la programación y el equilibrio de carga. Esta tesis explora inicialmente el potencial de los AMC al ejecutar aplicaciones actuales de Computacion de Alto Rendimiento (HPC) y busca el modelo de ejecución más apropiado para ellas. Específicamente evaluamos varios modelos de ejecución en un procesador asimétrico Arm big.LITTLE utilizando las aplicaciones PARSEC que son aplicaciones representativas de HPC. En este trabajo se compara la programación en los niveles de usuario, sistema operativo y librería y evaluamos el impacto de estas opciones en el conocido problema de equilibrar la carga entre los AMCs. Nuestros resultados demuestran que la programación es más efectiva cuando se lleva a cabo en el nivel del runtime, ya que mejora la programación del nivel de usuario en un 23%, mientras que la solución de programación del sistema operativo heterogéneo mejora la programación del nivel de usuario en un 10%. Siguiendo este resultado, esta tesis se centra en aumentar el rendimiento de los sistemas AMC mejorando la programación al nivel de librería. La programación en este nivel se proporciona mediante el uso de Modelos de Programación Paralelos Basados en Tareas (MPBT). Estos modelos de programación ofrecen flexibilidad de programación, ya que consisten en una interfaz y un runtime para administrar los recursos e hilos subyacentes. En esta tesis, mejoramos la programación con MPBT al proporcionar tres nuevos planificadores de tareas para AMCs. Estos planificadores dinámicos reducen el tiempo total de ejecución ya sea detectando la camino más largo o el camino crítico del grafo de dependencia de tareas de la aplicación, que es generado dinámicamente. En nuestra evaluación, comparamos estos planificadores con un planificador heterogéneo existente y demonstramos su mejora sobre un planificador FIFO. Mostramos que los planificadores heterogéneos mejoran el planificador FIFO en hasta 1.45x en un AMC real de 8 núcleos y hasta 2.1x en un AMC simulado de 32 núcleos. Otra contribución en los MPBT es la adaptabilidad al paralelismo de grano fino. El creciente número de núcleos en los chip multinúcleos modernos está empujando la investigación hacia el uso de cargas de trabajo de grano fino, que es un desafío importante para los MPBT. Nuestro estudio observa que la creación de tareas bloquea la ejecución con cargas de trabajo de grano fino con MPBT. Cuando el número de núcleos aumenta, el tiempo empleado en generar tareas pasa a ser más crítico para toda la ejecución. Nuestra solución es TaskGenX, que minimiza los costes de creación de tareas y se basa en una extensión del runtime y en un hardware dedicado. En el runtime, TaskGenX desacopla la creación de tareas de las otras actividades del runtime, ejecutando esta actividad en un hardware especializado. Evaluamos 11 aplicaciones de HPC con TaskGenX en sistemas simétricos y AMC y obtenemos mejoras de rendimiento de hasta 15x, con un promedio de 3.1x sobre la implementación de referencia. Finalmente, esta tesis presenta un planificador de CPU con el objetivo de aumentar los fotogramas por segundo (FPS) para juegos en dispositivos móviles con sistemas AMC. Diseñamos e implementamos el planificador de Real-Time Scheduler (RTS) en Android. El RTS proporciona una política de programación eficiente que tiene en cuenta la temperatura actual del sistema para realizar la migración de tareas. La solución RTS aumenta la FPS mediana de los mecanismos de referenci

    Exploiting asymmetric multi-core systems with flexible system software

    Get PDF
    Asymmetric multi-cores (AMCs) are a successful architectural solution for both mobile devices and supercomputers. These architectures combine different types of processing cores designed at different performance and power optimization points, thus exposing a performance-power trade-off. By maintaining two types of cores, AMCs are able to provide high performance under the facility power budget. However, there are significant challenges when using AMCs such as scheduling and load balancing. This thesis initially explores the potential of AMCs when executing current HPC applications and searches for the most appropriate execution model. Specifically we evaluate several execution models on an Arm big.LITTLE AMC using the PARSEC benchmark suite that includes representative HPC applications. We compare schedulers at the user, OS and runtime system levels, using both static and dynamic options and multiple configurations, and assess the impact of these options on the well-known problem of balancing the load across AMCs. Our results demonstrate that scheduling is more effective when it takes place in the runtime system as it improves the user-level scheduling by 23%, while the heterogeneous-aware OS scheduling solution improves the user-level scheduling by 10%. Following this outcome, this thesis focuses on increasing performance of AMC systems by improving scheduling in the runtime system level. Scheduling in the runtime system level is provided by the use of task-based parallel programming models. These programming models offer programming flexibility as they consist of an interface and a runtime system to manage the underlying resources and threads. In this thesis we improve scheduling with task-based programming models by providing three novel task schedulers for AMCs. These dynamic scheduling policies reduce total execution time either by detecting the longest or the critical path of the dynamic task dependency graph of the application. They use dynamic scheduling and information discoverable during execution, fact that makes them implementable and functional without the need of off-line profiling. In our evaluation we compare these scheduling approaches with an existing state-of the art heterogeneous scheduler and we track their improvement over a FIFO baseline scheduler. We show that the heterogeneous schedulers improve the baseline by up to 1.45x on a real 8-core AMC and up to 2.1x on a simulated 32-core AMC. Another enhancement we provide in task-based programming models is the adaptability to fine grained parallelism. The increasing number of cores on modern CMPs is pushing research towards the use of fine grained workloads, which is an important challenge for task-based programming models. Our study makes the observation that task creation becomes a bottleneck when executing fine grained workloads with task-based programming models. As the number of cores increases, the time spent generating tasks is becoming more critical to the entire execution. To overcome this issue, we propose TaskGenX. TaskGenX minimizes task creation overheads and relies both on the runtime system and a dedicated hardware. On the runtime system side, TaskGenX decouples the task creation from the other runtime activities. It then transfers this part of the runtime to a specialized hardware. From our evaluation using 11 HPC workloads on both symmetric and AMC systems, we obtain performance improvements up to 15x, averaging to 3.1x over the baseline. Finally, this thesis presents a showcase for a real-time CPU scheduler with the goal to increase the frames per second (FPS) of the game-play on mobile devices with AMC systems. We design and implement the RTS scheduler in the Android framework. RTS provides an efficient scheduling policy that takes into account the current temperature of the system to perform task migration. RTS solution increases the median FPS of the baseline mechanisms by up to 7.5% and at the same time it maintains temperature stable.Los procesadores multinúcleos asimétricos (AMC) son una solución arquitectónica exitosa para dispositivos móviles y supercomputadores. Estas arquitecturas combinan diferentes tipos de núcleos de procesamiento diseñados con diferentes propiedades de rendimiento y potencia. Al mantener dos o más tipos de núcleos, los AMCs pueden proporcionar un alto rendimiento con un consumo bajo de energía de las infraestructuras. Sin embargo, existen importantes desafíos al usar los AMC, como la programación y el equilibrio de carga. Esta tesis explora inicialmente el potencial de los AMC al ejecutar aplicaciones actuales de Computacion de Alto Rendimiento (HPC) y busca el modelo de ejecución más apropiado para ellas. Específicamente evaluamos varios modelos de ejecución en un procesador asimétrico Arm big.LITTLE utilizando las aplicaciones PARSEC que son aplicaciones representativas de HPC. En este trabajo se compara la programación en los niveles de usuario, sistema operativo y librería y evaluamos el impacto de estas opciones en el conocido problema de equilibrar la carga entre los AMCs. Nuestros resultados demuestran que la programación es más efectiva cuando se lleva a cabo en el nivel del runtime, ya que mejora la programación del nivel de usuario en un 23%, mientras que la solución de programación del sistema operativo heterogéneo mejora la programación del nivel de usuario en un 10%. Siguiendo este resultado, esta tesis se centra en aumentar el rendimiento de los sistemas AMC mejorando la programación al nivel de librería. La programación en este nivel se proporciona mediante el uso de Modelos de Programación Paralelos Basados en Tareas (MPBT). Estos modelos de programación ofrecen flexibilidad de programación, ya que consisten en una interfaz y un runtime para administrar los recursos e hilos subyacentes. En esta tesis, mejoramos la programación con MPBT al proporcionar tres nuevos planificadores de tareas para AMCs. Estos planificadores dinámicos reducen el tiempo total de ejecución ya sea detectando la camino más largo o el camino crítico del grafo de dependencia de tareas de la aplicación, que es generado dinámicamente. En nuestra evaluación, comparamos estos planificadores con un planificador heterogéneo existente y demonstramos su mejora sobre un planificador FIFO. Mostramos que los planificadores heterogéneos mejoran el planificador FIFO en hasta 1.45x en un AMC real de 8 núcleos y hasta 2.1x en un AMC simulado de 32 núcleos. Otra contribución en los MPBT es la adaptabilidad al paralelismo de grano fino. El creciente número de núcleos en los chip multinúcleos modernos está empujando la investigación hacia el uso de cargas de trabajo de grano fino, que es un desafío importante para los MPBT. Nuestro estudio observa que la creación de tareas bloquea la ejecución con cargas de trabajo de grano fino con MPBT. Cuando el número de núcleos aumenta, el tiempo empleado en generar tareas pasa a ser más crítico para toda la ejecución. Nuestra solución es TaskGenX, que minimiza los costes de creación de tareas y se basa en una extensión del runtime y en un hardware dedicado. En el runtime, TaskGenX desacopla la creación de tareas de las otras actividades del runtime, ejecutando esta actividad en un hardware especializado. Evaluamos 11 aplicaciones de HPC con TaskGenX en sistemas simétricos y AMC y obtenemos mejoras de rendimiento de hasta 15x, con un promedio de 3.1x sobre la implementación de referencia. Finalmente, esta tesis presenta un planificador de CPU con el objetivo de aumentar los fotogramas por segundo (FPS) para juegos en dispositivos móviles con sistemas AMC. Diseñamos e implementamos el planificador de Real-Time Scheduler (RTS) en Android. El RTS proporciona una política de programación eficiente que tiene en cuenta la temperatura actual del sistema para realizar la migración de tareas. La solución RTS aumenta la FPS mediana de los mecanismos de referenciaPostprint (published version
    • …
    corecore