Search CORE

50 research outputs found

Understanding and Mitigating Multicore Performance Issues on theAMD Opteron Architecture

Author
Publication venue: 'Office of Scientific and Technical Information (OSTI)'
Publication date
Field of study

Memory hierarchies for future HPC architectures

Author: García Flores Víctor
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2017
Field of study

Efficiently managing the memory subsystem of modern multi/manycore architectures is increasingly becoming a challenge as systems grow in complexity and heterogeneity. In the field of high performance computing (HPC) in particular, where massively parallel architectures are used and input sets of several terabytes are common, careful management of the memory hierarchy is crucial to exploit the full computing power of these systems. The goal of this thesis is to provide computer architects with valuable information to guide the design of future systems, and in particular of those more widely used in the field of HPC, i.e., symmetric multicore processors (SMPs) and GPUs. With that aim, we present an analysis of some of the inefficiencies and shortcomings of current memory management techniques and propose two novel schemes leveraging the opportunities that arise from the use of new and emerging programming models and computing paradigms. The first contribution of this thesis is a block prefetching mechanism for task-based programming models. Using a task-based programming model simplifies parallel programming and allows for better resource utilization in the supercomputers used in the field of HPC, while enabling sophisticated memory management techniques. The scheme proposed relies on a memory-aware runtime system to guide prefetching while avoiding the main drawbacks of traditional prefetching mechanisms, i.e., cache pollution and lack of timeliness. It leverages the information provided by the user about tasks¿ input data to prefetch contiguous blocks of memory that are certain to be useful. The proposed scheme targets SMPs with large cache hierarchies and uses heuristics to dynamically decide the best cache level to prefetch into without evicting useful data. The focus of this thesis then turns to heterogeneous architectures combining GPUs and traditional multicore processors. The current trend towards tighter coupling of GPU and CPU enables new collaborative computations that tax the memory subsystem in a different manner than previous heterogeneous computations did, and requires careful analysis to understand the trade-offs that are to be expected when designing future memory organizations. The second contribution is an in-depth analysis on the impact of sharing the last-level cache between GPU and CPU cores on a system where the GPU is integrated on the same die as the CPU. The analysis focuses on the effect that a shared cache can have on collaborative computations where GPU and CPU threads concurrently work on a problem and share data at fine granularities. The results presented here show that sharing the last-level cache is largely beneficial as it allows for better resource utilization. In addition, the evaluation shows that collaborative computations benefit significantly from the faster CPU-GPU communication and higher cache hit rates that a shared cache level provides. The final contribution of this thesis analyzes the inefficiencies and drawbacks of demand paging as currently implemented in discrete GPUs by NVIDIA. Then, it proposes a novel memory organization and dynamic migration scheme that allows for efficient data sharing between GPU and CPU, specially when executing collaborative computations where data is migrated back and forth between the two separate memories. This scheme migrates data at cache line granularities transparently to the user and operating system, avoiding false sharing and the unnecessary data transfers that occur with demand paging. The results show that the proposed scheme is able to outperform the baseline system by reducing the migration latency of data that is copied multiple times between the two memories. In addition, analysis of different interconnect latencies shows that fine-grained data sharing between GPU and CPU is feasible as long as future interconnect technologies achieve four to five times lower round-trip times than PCI-Express 3.0.La gestión eficiente del subsistema de memoria se ha convertido en un problema complejo a la vez que los sistemas crecen en complejidad y heterogeneidad. En el campo de la computación de altas prestaciones (HPC) en particular, donde arquitecturas masivamente paralelas son usadas y entradas de varios terabytes son comunes, una gestión cuidadosa de la jerarquía de memoria es crucial para conseguir explotar todo el potencial de estas arquitecturas. El objetivo de esta tesis es proporcionar a los arquitectos de computadores información valiosa para el diseño de los sistemas del futuro, y en concreto de los más comúnmente usados en el campo de HPC, los procesadores multinúcleo simétricos (SMP) y las tarjetas gráficas (GPU). Para ello, presentamos un análisis de las ineficiencias y los inconvenientes de los sistemas de gestión de memoria actuales, y proponemos dos técnicas nuevas que aprovechan las oportunidades surgidas del uso de nuevos y emergentes modelos de programación y paradigmas de computación. La primera contribución de esta tesis es un mecanismo de prefetch de bloques para modelos de programación basados en tareas. Usando modelos de programación orientados a tareas simplifica la programación paralela y permite hacer un mejor uso de los recursos en los supercomputadores usados en HPC, mientras permiten el uso de sofisticados mecanismos de gestión de memoria. La técnica propuesta se basa en un sistema de runtime para guiar el prefetch de datos mientras evita los principales inconvenientes tradicionalmente asociados con prefetching, la polución de cache y la medida incorrecta de los tiempos. El mecanismo utiliza la información sobre las entradas de las tareas proporcionada por el usuario para prefetchear bloques contiguos de memoria sobre los que hay certeza que serán utilizados. El mecanismo está dirigido a arquitecturas SMP con amplias jerarquías de cache, y usa heurísticas para decidir dinámicamente en qué nivel de caché colocar los datos sin desplazar datos útiles. El focus de la tesis gira luego a arquitecturas heterogéneas que combinan GPUs con procesadores multinúcleo tradicionales. La actual tendencia a unir GPU y CPU permite el uso de una nueva serie de computaciones colaborativas que afectan al subsistema de memoria de forma diferente que las computaciones heterogéneas anteriores, y requiere de un cuidadoso análisis para entender las consecuencias que esto tiene en el diseño de las organizaciones de memoria futuras. La segunda contribución de la tesis es un análisis detallado del impacto que supone compartir el último nivel de cache entre núcleos de GPU y CPU en sistemas donde la GPU está integrada en el mismo chip que la CPU. El análisis se centra en el efecto que la cache compartida tiene en colaboraciones colaborativas donde hilos de GPU y CPU trabajan concurrentemente en un problema y comparten datos a grano fino. Los resultados presentados en esta tesis muestran que compartir el último nivel de cache es mayormente beneficioso ya que permite un mejor uso de los recursos. Además, la evaluación muestra que las computaciones colaborativas se benefician en gran medida de la comunicación más rápida entre GPU y CPU y las mayores tasas de acierto de cache que un nivel de cache compartido proporcionan

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Tesis Doctorals en Xarxa

Leveraging performance of 3D finite difference schemes in large scientific computing simulations

Author: De la Cruz Raúl
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2015
Field of study

Gone are the days when engineers and scientists conducted most of their experiments empirically. During these decades, actual tests were carried out in order to assess the robustness and reliability of forthcoming product designs and prove theoretical models. With the advent of the computational era, scientific computing has definetely become a feasible solution compared with empirical methods, in terms of effort, cost and reliability. Large and massively parallel computational resources have reduced the simulation execution times and have improved their numerical results due to the refinement of the sampled domain. Several numerical methods coexist for solving the Partial Differential Equations (PDEs). Methods such as the Finite Element (FE) and the Finite Volume (FV) are specially well suited for dealing with problems where unstructured meshes are frequent. Unfortunately, this flexibility is not bestowed for free. These schemes entail higher memory latencies due to the handling of irregular data accesses. Conversely, the Finite Difference (FD) scheme has shown to be an efficient solution for problems where the structured meshes suit the domain requirements. Many scientific areas use this scheme due to its higher performance. This thesis focuses on improving FD schemes to leverage the performance of large scientific computing simulations. Different techniques are proposed such as the Semi-stencil, a novel algorithm that increases the FLOP/Byte ratio for medium- and high-order stencils operators by reducing the accesses and endorsing data reuse. The algorithm is orthogonal and can be combined with techniques such as spatial- or time-blocking, adding further improvement. New trends on Symmetric Multi-Processing (SMP) systems -where tens of cores are replicated on the same die- pose new challenges due to the exacerbation of the memory wall problem. In order to alleviate this issue, our research is focused on different strategies to reduce pressure on the cache hierarchy, particularly when different threads are sharing resources due to Simultaneous Multi-Threading (SMT). Several domain decomposition schedulers for work-load balance are introduced ensuring quasi-optimal results without jeopardizing the overall performance. We combine these schedulers with spatial-blocking and auto-tuning techniques, exploring the parametric space and reducing misses in last level cache. As alternative to brute-force methods used in auto-tuning, where a huge parametric space must be traversed to find a suboptimal candidate, performance models are a feasible solution. Performance models can predict the performance on different architectures, selecting suboptimal parameters almost instantly. In this thesis, we devise a flexible and extensible performance model for stencils. The proposed model is capable of supporting multi- and many-core architectures including complex features such as hardware prefetchers, SMT context and algorithmic optimizations. Our model can be used not only to forecast execution time, but also to make decisions about the best algorithmic parameters. Moreover, it can be included in run-time optimizers to decide the best SMT configuration based on the execution environment. Some industries rely heavily on FD-based techniques for their codes. Nevertheless, many cumbersome aspects arising in industry are still scarcely considered in academia research. In this regard, we have collaborated in the implementation of a FD framework which covers the most important features that an HPC industrial application must include. Some of the node-level optimization techniques devised in this thesis have been included into the framework in order to contribute in the overall application performance. We show results for a couple of strategic applications in industry: an atmospheric transport model that simulates the dispersal of volcanic ash and a seismic imaging model used in Oil & Gas industry to identify hydrocarbon-rich reservoirs.Atrás quedaron los días en los que ingenieros y científicos realizaban sus experimentos empíricamente. Durante esas décadas, se llevaban a cabo ensayos reales para verificar la robustez y fiabilidad de productos venideros y probar modelos teóricos. Con la llegada de la era computacional, la computación científica se ha convertido en una solución factible comparada con métodos empíricos, en términos de esfuerzo, coste y fiabilidad. Los supercomputadores han reducido el tiempo de las simulaciones y han mejorado los resultados numéricos gracias al refinamiento del dominio. Diversos métodos numéricos coexisten para resolver las Ecuaciones Diferenciales Parciales (EDPs). Métodos como Elementos Finitos (EF) y Volúmenes Finitos (VF) están bien adaptados para tratar problemas donde las mallas no estructuradas son frecuentes. Desafortunadamente, esta flexibilidad no se confiere de forma gratuita. Estos esquemas conllevan latencias más altas debido al acceso irregular de datos. En cambio, el esquema de Diferencias Finitas (DF) ha demostrado ser una solución eficiente cuando las mallas estructuradas se adaptan a los requerimientos. Esta tesis se enfoca en mejorar los esquemas DF para impulsar el rendimiento de las simulaciones en la computación científica. Se proponen diferentes técnicas, como el Semi-stencil, un nuevo algoritmo que incrementa el ratio de FLOP/Byte para operadores de stencil de orden medio y alto reduciendo los accesos y promoviendo el reuso de datos. El algoritmo es ortogonal y puede ser combinado con técnicas como spatial- o time-blocking, añadiendo mejoras adicionales. Las nuevas tendencias hacia sistemas con procesadores multi-simétricos (SMP) -donde decenas de cores son replicados en el mismo procesador- plantean nuevos retos debido a la exacerbación del problema del ancho de memoria. Para paliar este problema, nuestra investigación se centra en estrategias para reducir la presión en la jerarquía de cache, particularmente cuando diversos threads comparten recursos debido a Simultaneous Multi-Threading (SMT). Introducimos diversos planificadores de descomposición de dominios para balancear la carga asegurando resultados casi óptimos sin poner en riesgo el rendimiento global. Combinamos estos planificadores con técnicas de spatial-blocking y auto-tuning, explorando el espacio paramétrico y reduciendo los fallos en la cache de último nivel. Como alternativa a los métodos de fuerza bruta usados en auto-tuning donde un espacio paramétrico se debe recorrer para encontrar un candidato, los modelos de rendimiento son una solución factible. Los modelos de rendimiento pueden predecir el rendimiento en diferentes arquitecturas, seleccionando parámetros suboptimos casi de forma instantánea. En esta tesis, ideamos un modelo de rendimiento para stencils flexible y extensible. El modelo es capaz de soportar arquitecturas multi-core incluyendo características complejas como prefetchers, SMT y optimizaciones algorítmicas. Nuestro modelo puede ser usado no solo para predecir los tiempos de ejecución, sino también para tomar decisiones de los mejores parámetros algorítmicos. Además, puede ser incluido en optimizadores run-time para decidir la mejor configuración SMT. Algunas industrias confían en técnicas DF para sus códigos. Sin embargo, no todos los aspectos que aparecen en la industria han sido sometidos a investigación. En este aspecto, hemos diseñado e implementado desde cero una infraestructura DF que cubre las características más importantes que una aplicación industrial debe incluir. Algunas de las técnicas de optimización propuestas en esta tesis han sido incluidas para contribuir en el rendimiento global a nivel industrial. Mostramos resultados de un par de aplicaciones estratégicas para la industria: un modelo de transporte atmosférico que simula la dispersión de ceniza volcánica y un modelo de imagen sísmica usado en la industria del petroleo y gas para identificar reservas ricas en hidrocarburo

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Tesis Doctorals en Xarxa

Energy-Efficient Acceleration of Asynchronous Programs.

Author: Chadha Gaurav
Publication venue
Publication date
Field of study

Asynchronous or event-driven programming has become the dominant programming model in the last few years. In this model, computations are posted as events to an event queue from where they get processed asynchronously by the application. A huge fraction of computing systems built today use asynchronous programming. All the Web 2.0 JavaScript applications (e.g., Gmail, Facebook) use asynchronous programming. There are now more than two million mobile applications available between the Apple App Store and Google Play, which are all written using asynchronous programming. Distributed servers (e.g., Twitter, LinkedIn, PayPal) built using actor-based languages (e.g., Scala) and platforms such as node.js rely on asynchronous events for scalable communication. Internet-of-Things (IoT), embedded systems, sensor networks, desktop GUI applications, etc., all rely on the asynchronous programming model. Despite the ubiquity of asynchronous programs, their unique execution characteristics have been largely ignored by conventional processor architectures, which have remained heavily optimized for synchronous programs. Asynchronous programs are characterized by short events executing varied tasks. This results in a large instruction footprint with little cache locality, severely degrading cache performance. Also, event execution has few repeatable patterns causing poor branch prediction. This thesis proposes novel processor optimizations exploiting the unique execution characteristics of asynchronous programs for performance optimization and energy-efficiency. These optimizations are designed to make the underlying hardware aware of discrete events and thereafter, exploit the latent Event-Level Parallelism present in these applications. Through speculative pre-execution of future events, cache addresses and branch outcomes are recorded and later used for improving cache and branch predictor performance. A hardware instruction prefetcher specialized for asynchronous programs is also proposed as a comparative design direction.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/120780/1/gauravc_1.pd

Deep Blue Documents at the University of Michigan

Automatic skeleton-driven performance optimizations for transactional memory

Author: Goes Luís Fabrício Wanderley
Publication venue: The University of Edinburgh
Publication date: 01/01/2012
Field of study

The recent shift toward multi -core chips has pushed the burden of extracting performance to the programmer. In fact, programmers now have to be able to uncover more coarse -grain parallelism with every new generation of processors, or the performance of their applications will remain roughly the same or even degrade. Unfortunately, parallel programming is still hard and error prone. This has driven the development of many new parallel programming models that aim to make this process efficient.This thesis first combines the skeleton -based and transactional memory programming models in a new framework, called OpenSkel, in order to improve performance and programmability of parallel applications. This framework provides a single skeleton that allows the implementation of transactional worklist applications. Skeleton or pattern-based programming allows parallel programs to be expressed as specialized instances of generic communication and computation patterns. This leaves the programmer with only the implementation of the particular operations required to solve the problem at hand. Thus, this programming approach simplifies parallel programming by eliminating some of the major challenges of parallel programming, namely thread communication, scheduling and orchestration. However, the application programmer has still to correctly synchronize threads on data races. This commonly requires the use of locks to guarantee atomic access to shared data. In particular, lock programming is vulnerable to deadlocks and also limits coarse grain parallelism by blocking threads that could be potentially executed in parallel.Transactional Memory (TM) thus emerges as an attractive alternative model to simplify parallel programming by removing this burden of handling data races explicitly. This model allows programmers to write parallel code as transactions, which are then guaranteed by the runtime system to execute atomically and in isolation regardless of eventual data races. TM programming thus frees the application from deadlocks and enables the exploitation of coarse grain parallelism when transactions do not conflict very often. Nevertheless, thread management and orchestration are left for the application programmer. Fortunately, this can be naturally handled by a skeleton framework. This fact makes the combination of skeleton -based and transactional programming a natural step to improve programmability since these models complement each other. In fact, this combination releases the application programmer from dealing with thread management and data races, and also inherits the performance improvements of both models. In addition to it, a skeleton framework is also amenable to skeleton - driven iii performance optimizations that exploits the application pattern and system information.This thesis thus also presents a set of pattern- oriented optimizations that are automatically selected and applied in a significant subset of transactional memory applications that shares a common pattern called worklist. These optimizations exploit the knowledge about the worklist pattern and the TM nature of the applications to avoid transaction conflicts, to prefetch data, to reduce contention etc. Using a novel autotuning mechanism, OpenSkel dynamically selects the most suitable set of these patternoriented performance optimizations for each application and adjusts them accordingly. Experimental results on a subset of five applications from the STAMP benchmark suite show that the proposed autotuning mechanism can achieve performance improvements within 2 %, on average, of a static oracle for a 16 -core UMA (Uniform Memory Access) platform and surpasses it by 7% on average for a 32 -core NUMA (Non -Uniform Memory Access) platform.Finally, this thesis also investigates skeleton -driven system- oriented performance optimizations such as thread mapping and memory page allocation. In order to do it, the OpenSkel system and also the autotuning mechanism are extended to accommodate these optimizations. The conducted experimental results on a subset of five applications from the STAMP benchmark show that the OpenSkel framework with the extended autotuning mechanism driving both pattern and system- oriented optimizations can achieve performance improvements of up to 88 %, with an average of 46 %, over a baseline version for a 16 -core UMA platform and up to 162 %, with an average of 91 %, for a 32 -core NUMA platform

Edinburgh Research Archive

Real-Time Prefetching on Shared-Memory Multi-Core Systems

Author: Garside Jamie
Publication venue: University of York
Publication date: 24/07/2015
Field of study

In recent years, there has been a growing trend towards using multi-core processors in real-time systems to cope with the rising computation requirements of real-time tasks. Coupled with this, the rising memory requirements of these tasks pushes demand beyond what can be provided by small, private on-chip caches, requiring the use of larger, slower off-chip memories such as DRAM. Due to the cost, power requirements and complexity of these memories, they are typically shared between all of the tasks within the system. In order for the execution time of these tasks to be bounded, the response time of the memory and the interference from other tasks also needs to be bounded. While there is a great amount of current research on bounding this interference, one popular method is to effectively partition the available memory bandwidth between the processors in the system. Of course, as the number of processors increases, so does the worst-case blocking, and worst-case blocking times quickly increase with the number of processors. It is difficult to further optimise the arbitration scheme; instead, this scaling problem needs to be approached from another angle. Prefetching has previously been shown to improve the execution time of tasks by speculatively issuing memory accesses ahead of time for items which may be useful in the near future, although these prefetchers are typically not used in real-time systems due to their unpredictable nature. Instead, this work presents a framework by which a prefetcher can be safely used alongside a composable memory arbiter, a predictable prefetching scheme, and finally a method by which this predictable prefetcher can be used to improve the worst-case execution time of a running task

White Rose E-theses Online

Characteristics of Web-based textual communications

Author: Küçükyılmaz Tayfun
Publication venue: Bilkent University
Publication date: 01/01/2012
Field of study

Ankara : The Department of Computer Engineering and the Graduate School of Engineering and Science of Bilkent University 2012.Thesis (Ph. D.) -- Bilkent University, 2012.Includes bibliographical references.In this thesis, we analyze different aspects of Web-based textual communications and argue that all such communications share some common properties. In order to provide practical evidence for the validity of this argument, we focus on two common properties by examining these properties on various types of Web-based textual communications data. These properties are: All Web-based communications contain features attributable to their author and reciever; and all Web-based communications exhibit similar heavy tailed distributional properties. In order to provide practical proof for the validity of our claims, we provide three practical, real life research problems and exploit the proposed common properties of Web-based textual communications to find practical solutions to these problems. In this work, we first provide a feature-based result caching framework for real life search engines. To this end, we mined attributes from user queries in order to classify queries and estimate a quality metric for giving admission and eviction decisions for the query result cache. Second, we analyzed messages of an online chat server in order to predict user and mesage attributes. Our results show that several user- and message-based attributes can be predicted with significant occuracy using both chat message- and writing-style based features of the chat users. Third, we provide a parallel framework for in-memory construction of term partitioned inverted indexes. In this work, in order to minimize the total communication time between processors, we provide a bucketing scheme that is based on term-based distributional properties of Web page contents.Küçükyılmaz, TayfunPh.D

Bilkent University Institutional Repository

Extending the reach of microprocessors : column and curious caching

Author: Chiou Derek
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/1999
Field of study

Thesis (Ph.D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1999.Includes bibliographical references (p. 162-167).by Derek T. Chiou.Ph.D

DSpace@MIT

Architectural Enhancements for Data Transport in Datacenter Systems

Author: Golestani Hossein
Publication venue
Publication date: 01/01/2021
Field of study

Datacenter systems run myriad applications, which frequently communicate with each other and/or Input/Output (I/O) devices—including network adapters, storage devices, and accelerators. Due to the growing speed of I/O devices and the emergence of microservice-based programming models, the I/O software stacks have become a critical factor in end-to-end communication performance. As such, I/O software stacks have been evolving rapidly in recent years. Datacenters rely on fast, efficient “Software Data Planes”, which orchestrate data transfer between applications and I/O devices. The goal of this dissertation is to enhance the performance, efficiency, and scalability of software data planes by diagnosing their existing issues and addressing them through hardware-software solutions. In the first step, I characterize challenges of modern software data planes, which bypass the operating system kernel to avoid associated overheads. Since traditional interrupts and system calls cannot be delivered to user code without kernel assistance, kernel-bypass data planes use spinning cores on I/O queues to identify work/data arrival. Spin-polling obviously wastes CPU cycles on checking empty queues; however, I show that it entails even more drawbacks: (1) Full-tilt spinning cores perform more (useless) polling work when there is less work pending in the queues. (2) Spin-polling scales poorly with the number of polled queues due to processor cache capacity constraints, especially when traffic is unbalanced. (3) Spin-polling also scales poorly with the number of cores due to the overhead of polling and operation rate limits. (4) Whereas shared queues can mitigate load imbalance and head-of-line blocking, synchronization overheads of spinning on them limit their potential benefits. Next, I propose a notification accelerator, dubbed HyperPlane, which replaces spin-polling in software data planes. Design principles of HyperPlane are: (1) not iterating on empty I/O queues to find work/data in ready ones, (2) blocking/halting when all queues are empty rather than spinning fruitlessly, and (3) allowing multiple cores to efficiently monitor a shared set of queues. These principles lead to queue scalability, work proportionality, and enjoying theoretical merits of shared queues. HyperPlane is realized with a programming model front-end and a hardware microarchitecture back-end. Evaluation of HyperPlane shows its significant advantage in terms of throughput, average/tail latency, and energy efficiency over a state-of-the-art spin-polling-based software data plane, with very small power and area overheads. Finally, I focus on the data transfer aspect in software data planes. Cache misses incurred by accessing I/O data are a major bottleneck in software data planes. Despite considerable efforts put into delivering I/O data directly to the last-level cache, some access latency is still exposed. Cores cannot prefetch such data to nearer caches in today's systems because of the complex access pattern of data buffers and the lack of an appropriate notification mechanism that can trigger the prefetch operations. As such, I propose HyperData, a data transfer accelerator based on targeted prefetching. HyperData prefetches exact (rather than predicted) data buffers (or a required subset to avoid cache pollution) to the L1 cache of the consumer core at the right time. Prefetching can be done for both core-peripheral and core-core communications. HyperData's prefetcher is programmable and supports various queue formats—namely, direct (regular), indirect (Virtio), and multi-consumer queues. I show that with a minor overhead, HyperData effectively hides data access latency in software data planes, thereby improving both application- and system-level performance and efficiency.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/169826/1/hosseing_1.pd

Deep Blue Documents at the University of Michigan

Runahead threads

Author: Ramírez García Tanausu
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2010
Field of study

Los temas de investigación sobre multithreading han ganado mucho interés en la arquitectura de computadores con la aparición de procesadores multihilo y multinucleo. Los procesadores SMT (Simultaneous Multithreading) son uno de estos nuevos paradigmas, combinando la capacidad de emisión de múltiples instrucciones de los procesadores superscalares con la habilidad de explotar el paralelismo a nivel de hilos (TLP). Así, la principal característica de los procesadores SMT es ejecutar varios hilos al mismo tiempo para incrementar la utilización de las etapas del procesador mediante la compartición de recursos.Los recursos compartidos son el factor clave de los procesadores SMT, ya que esta característica conlleva tratar con importantes cuestiones pues los hilos también compiten por estos recursos en el núcleo del procesador. Si bien distintos grupos de aplicaciones se benefician de disponer de SMT, las diferentes propiedades de los hilos ejecutados pueden desbalancear la asignación de recursos entre los mismos, disminuyendo los beneficios de la ejecución multihilo. Por otro lado, el problema con la memoria está aún presente en los procesadores SMT. Estos procesadores alivian algunos de los problemas de latencia provocados por la lentitud de la memoria con respecto a la CPU. Sin embargo, hilos con grandes cargas de trabajo y con altas tasas de fallos en las caches son unas de las mayores dificultades de los procesadores SMT. Estos hilos intensivos en memoria tienden a crear importantes problemas por la contención de recursos. Por ejemplo, pueden llegar a bloquear recursos críticos debido a operaciones de larga latencia impidiendo no solo su ejecución, sino el progreso de la ejecución de los otros hilos y, por tanto, degradando el rendimiento general del sistema.El principal objetivo de esta tesis es aportar soluciones novedosas a estos problemas y que mejoren el rendimiento de los procesadores SMT. Para conseguirlo, proponemos los Runahead Threads (RaT) aplicando una ejecución especulativa basada en runahead. RaT es un mecanismo alternativo a las políticas previas de gestión de recursos las cuales usualmente restringían a los hilos intensivos en memoria para conseguir más productividad.La idea clave de RaT es transformar un hilo intensivo en memoria en un hilo ligero en el uso de recursos que progrese especulativamente. Así, cuando un hilo sufre de un acceso de larga latencia, RaT transforma dicho hilo en un hilo de runahead mientras dicho fallo está pendiente. Los principales beneficios de esta simple acción son varios. Mientras un hilo está en runahead, éste usa los diferentes recursos compartidos sin monopolizarlos o limitarlos con respecto a los otros hilos. Al mismo tiempo, esta ejecución especulativa realiza prebúsquedas a memoria que se solapan con el fallo principal, por tanto explotando el paralelismo a nivel de memoria y mejorando el rendimiento.RaT añade muy poco hardware extra y complejidad en los procesadores SMT con respecto a su implementación. A través de un mecanismo de checkpoint y lógica de control adicional, podemos dotar a los contextos hardware con la capacidad de ejecución en runahead. Por medio de RaT, contribuímos a aliviar simultaneamente dos problemas en el contexto de los procesadores SMT. Primero, RaT reduce el problema de los accesos de larga latencia en los SMT mediante el paralelismo a nivel de memoria (MLP). Un hilo prebusca datos en paralelo en vez de estar parado debido a un fallo de L2 mejorando su rendimiento individual. Segundo, RaT evita que los hilos bloqueen recursos bajo fallos de larga latencia. RaT asegura que el hilo intensivo en memoria recicle más rápido los recursos compartidos que usa debido a la naturaleza de la ejecución especulativa.La principal limitación de RaT es que los hilos especulativos pueden ejecutar instrucciones extras cuando no realizan prebúsqueda e innecesariamente consumir recursos de ejecución en el procesador SMT. Este inconveniente resulta en hilos de runahead ineficientes pues no contribuyen a la ganancia de rendimiento e incrementan el consumo de energía debido al número extra de instrucciones especulativas. Por consiguiente, en esta tesis también estudiamos diferentes soluciones dirigidas a solventar esta desventaja del mecanismo RaT. El resultado es un conjunto de soluciones complementarias para mejorar la eficiencia de RaT en términos de consumo de potencia y gasto energético.Por un lado, mejoramos la eficiencia de RaT aplicando ciertas técnicas basadas en el análisis semántico del código ejecutado por los hilos en runahead. Proponemos diferentes técnicas que analizan y controlan la utilidad de ciertos patrones de código durante la ejecución en runahead. Por medio de un análisis dinámico, los hilos en runahead supervisan la utilidad de ejecutar los bucles y subrutinas dependiendo de las oportunidades de prebúsqueda. Así, RaT decide cual de estas estructuras de programa ejecutar dependiendo de la información de utilidad obtenida, decidiendo entre parar o saltar el bucle o la subrutina para reducir el número de las instrucciones no útiles. Entre las técnicas propuestas, conseguimos reducir las instrucciones especulativas y la energía gastada mientras obtenemos rendimientos similares a la técnica RaT original.Por otro lado, también proponemos lo que denominamos hilos de runahead eficientes. Esta propuesta se basa en una técnica más fina que cubre todo el rango de ejecución en runahead, independientemente de las características del programa ejecutado. La idea principal es averiguar "cuando" y "durante cuanto" un hilo en runahead debe ser ejecutado prediciendo lo que denominamos distancia útil de runahead. Los resultados muestran que la mejor de estas propuestas basadas en la predicción de la distancia de runahead reducen significativamente el número de instrucciones extras así como también el consumo de potencia. Asimismo, conseguimos mantener los beneficios de rendimiento de los hilos en runahead, mejorando de esta forma la eficiencia energética de los procesadores SMT usando el mecanismo RaT.La evolución de RaT desarrollada durante toda esta investigación nos proporciona no sólo una propuesta orientada a un mayor rendimiento sino también una forma eficiente de usar los recursos compartidos en los procesadores SMT en presencia de operaciones de memoria de larga latencia.Dado que los diseños SMT en el futuro estarán orientados a optimizar una combinación de rendimiento individual en las aplicaciones, la productividad y el consumo de energía, los mecanismos basados en RaT aquí propuestos son interesantes opciones que proporcionan un mejor balance de rendimiento y energía que las propuestas previas en esta área.Research on multithreading topics has gained a lot of interest in the computer architecture community due to new commercial multithreaded and multicore processors. Simultaneous Multithreading (SMT) is one of these relatively new paradigms, which combines the multiple instruction issue features of superscalar processors with the ability of multithreaded architectures to exploit thread level parallelism (TLP). The main feature of SMT processors is to execute multiple threads that increase the utilization of the pipeline by sharing many more resources than in other types of processors.Shared resources are the key of simultaneous multithreading, what makes the technique worthwhile.This feature also entails important challenges to deal with because threads also compete for resources in the processor core. On the one hand, although certain types and mixes of applications truly benefit from SMT, the different features of threads can unbalance the resource allocation among threads, diminishing the benefit of multithreaded execution. On the other hand, the memory wall problem is still present in these processors. SMT processors alleviate some of the latency problems arisen by main memory's slowness relative to the CPUs. Nevertheless, threads with high cache miss rates that use large working sets are one of the major pitfalls of SMT processors. These memory intensive threads tend to use processor and memory resources poorly creating the highest resource contention problems. Memory intensive threads can clog up shared resources due to long latency memory operations without making progress on a SMT processor, thereby hindering overall system performance.The main goal of this thesis is to alleviate these shortcomings on SMT scenarios. To accomplish this, the key contribution of this thesis is the application of the paradigm of Runahead execution in the design of multithreaded processors by Runahead Threads (RaT). RaT shows to be a promising alternative to prior SMT resource management mechanisms which usually restrict memory bound threads in order to get higher throughputs.The idea of RaT is to transform a memory intensive thread into a light-consumer resource thread by allowing that thread to progress speculatively. Therefore, as soon as a thread undergoes a long latency load, RaT transforms the thread to a runahead thread while it has that long latency miss outstanding. The main benefits of this simple action performed by RaT are twofold. While being a runahead thread, this thread uses the different shared resources without monopolizing or limiting the available resources for other threads. At the same time, this fast speculative thread issues prefetches that overlap other memory accesses with the main miss, thereby exploiting the memory level parallelism.Regarding implementation issues, RaT adds very little extra hardware cost and complexity to an existing SMT processor. Through a simple checkpoint mechanism and little additional control logic, we can equip the hardware contexts with the runahead thread capability. Therefore, by means of runahead threads, we contribute to alleviate simultaneously the two shortcomings in the context of SMT processor improving the performance. First, RaT alleviates the long latency load problem on SMT processors by exposing memory level parallelism (MLP). A thread prefetches data in parallel (if MLP is available) improving its individual performance rather than be stalled on an L2 miss. Second, RaT prevents threads from clogging resources on long latency loads. RaT ensures that the L2-missing thread recycles faster the shared resources it uses by the nature of runahead speculative execution. This avoids memory intensive threads clogging the important processor resources up.The main limitation of RaT though is that runahead threads can execute useless instructions and unnecessarily consume execution resources on the SMT processor when there is no prefetching to be exploited. This drawback results in inefficient runahead threads which do not contribute to the performance gain and increase dynamic energy consumption due to the number of extra speculatively executed instructions. Therefore, we also propose different solutions aimed at this major disadvantage of the Runahead Threads mechanism. The result of the research on this line is a set of complementary solutions to enhance RaT in terms of power consumption and energy efficiency.On the one hand, code semantic-aware Runahead threads improve the efficiency of RaT using coarse-grain code semantic analysis at runtime. We provide different techniques that analyze the usefulness of certain code patterns during runahead thread execution. The code patterns selected to perform that analysis are loops and subroutines. By means of the proposed coarse grain analysis, runahead threads oversee the usefulness of loops or subroutines depending on the prefetches opportunities during their executions. Thus, runahead threads decide which of these particular program structures execute depending on the obtained usefulness information, deciding either stall or skip the loop or subroutine executions to reduce the number of useless runahead instructions. Some of the proposed techniques reduce the speculative instruction and wasted energy while achieving similar performance to RaT.On the other hand, the efficient Runahead thread proposal is another contribution focused on improving RaT efficiency. This approach is based on a generic technique which covers all runahead thread executions, independently of the executed program characteristics as code semantic-aware runahead threads are. The key idea behind this new scheme is to find out --when' and --how long' a thread should be executed in runahead mode by predicting the useful runahead distance. The results show that the best of these approaches based on the runahead distance prediction significantly reduces the number of extra speculative instructions executed in runahead threads, as well as the power consumption. Likewise, it maintains the performance benefits of the runahead threads, thereby improving the energy-efficiency of SMT processors using the RaT mechanism.The evolution of Runahead Threads developed in this research provides not only a high performance but also an efficient way of using shared resources in SMT processors in the presence of long latency memory operations. As designers of future SMT systems will be increasingly required to optimize for a combination of single thread performance, total throughput, and energy consumption, RaT-based mechanisms are promising options that provide better performance and energy balance than previous proposals in the field

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Tesis Doctorals en Xarxa

Secretaría de Estado de Cultura