235 research outputs found

    Performance analysis methods for understanding scaling bottlenecks in multi-threaded applications

    Get PDF
    In dit proefschrift stellen we drie nieuwe methodes voor om de prestatie van meerdradige programma's te analyseren. Onze eerste methode, criticality stacks, is bruikbaar voor het analyseren van onevenwicht tussen draden. Om deze stacks te construeren stellen we een nieuwe criticaliteitsmetriek voor, die de uitvoeringstijd van een applicatie opsplitst in een deel voor iedere draad. Hoe groter dit deel is voor een draad, hoe kritischer deze draad is voor de applicatie. De tweede methode, bottle graphs, stelt iedere draad van een meerdradig programma voor als een rechthoek in een grafiek. De hoogte van de rechthoek wordt berekend door middel van onze criticaliteitsmetriek, en de breedte stelt het parallellisme van een draad voor. Rechthoeken die bovenaan in de grafiek zitten, als het ware in de hals van de fles, hebben een beperkt parallellisme, waardoor we ze beschouwen als “bottlenecks” voor de applicatie. Onze derde methode, speedup stacks, toont de bereikte speedup van een applicatie en de verschillende componenten die speedup beperken in een gestapelde grafiek. De intuïtie achter dit concept is dat door het reduceren van de invloed van een bepaalde component, de speedup van een applicatie proportioneel toeneemt met de grootte van die component in de stapel

    Hardware thread scheduling algorithms for single-ISA asymmetric CMPs

    Get PDF
    Through the past several decades, based on the Moore's law, the semiconductor industry was doubling the number of transistors on the single chip roughly every eighteen months. For a long time this continuous increase in transistor budget drove the increase in performance as the processors continued to exploit the instruction level parallelism (ILP) of the sequential programs. This pattern hit the wall in the early years of the twentieth century when designing larger and more complex cores became difficult because of the power and complexity reasons. Computer architects responded by integrating many cores on the same die thereby creating Chip Multicore Processors (CMP). In the last decade, the computing technology experienced tremendous developments, Chip Multiprocessors (CMP) expanded from the symmetric and homogeneous to the asymmetric or heterogeneous Multiprocessors. Having cores of different types in a single processor enables optimizing performance, power and energy efficiency for a wider range of workloads. It enables chip designers to employ specialization (that is, we can use each type of core for the type of computation where it delivers the best performance/energy trade-off). The benefits of Asymmetric Chip Multiprocessors (ACMP) are intuitive as it is well known that different workloads have different resource requirements. The CMPs improve the performance of applications by exploiting the Thread Level Parallelism (TLP). Parallel applications relying on multiple threads must be efficiently managed and dispatched for execution if the parallelism is to be properly exploited. Since more and more applications become multi-threaded we expect to find a growing number of threads executing on a machine. Consequently, the operating system will require increasingly larger amounts of CPU time to schedule these threads efficiently. Thus, dynamic thread scheduling techniques are of paramount importance in ACMP designs since they can make or break performance benefits derived from the asymmetric hardware or parallel software. Several thread scheduling methods have been proposed and applied to ACMPs. In this thesis, we first study the state of the art thread scheduling techniques and identify the main reasons limiting the thread level parallelism in an ACMP systems. We propose three novel approaches to schedule and manage threads and exploit thread level parallelism implemented in hardware, instead of perpetuating the trend of performing more complex thread scheduling in the operating system. Our first goal is to improve the performance of an ACMP systems by improving thread scheduling at the hardware level. We also show that the hardware thread scheduling reduces the energy consumption of an ACMP systems by allowing better utilization of the underlying hardware.A través de las últimas décadas, con base en la ley de Moore, la industria de semiconductores duplica el número de transistores en el chip alrededor de una vez cada dieciocho meses. Durante mucho tiempo, este aumento continuo en el número de transistores impulsó el aumento en el rendimiento de los procesadores solo explotando el paralelismo a nivel de instrucción (ILP) y el aumento de la frecuencia de los procesadores, permitiendo un aumento del rendimiento de los programas secuenciales. Este patrón llego a su limite en los primeros años del siglo XX, cuando el diseño de procesadores más grandes y complejos se convirtió en una tareá difícil debido a las debido al consumo requerido. La respuesta a este problema por parte de los arquitectos fue la integración de muchos núcleos en el mismo chip creando así chip multinúcleo Procesadores (CMP). En la última década, la tecnología de la computación experimentado enormes avances, sobre todo el en chip multiprocesadores (CMP) donde se ha pasado de diseños simetricos y homogeneous a sistemas asimétricos y heterogeneous. Tener núcleos de diferentes tipos en un solo procesador permite optimizar el rendimiento, la potencia y la eficiencia energética para una amplia gama de cargas de trabajo. Permite a los diseñadores de chips emplear especialización (es decir, podemos utilizar un tipo de núcleo diferente para distintos tipos de cálculo dependiendo del trade-off respecto del consumo y rendimiento). Los beneficios de la asimétrica chip multiprocesadores (ACMP) son intuitivos, ya que es bien sabido que diferentes cargas de trabajo tienen diferentes necesidades de recursos. Los CMP mejoran el rendimiento de las aplicaciones mediante la explotación del paralelismo a nivel de hilo (TLP). En las aplicaciones paralelas que dependen de múltiples hilos, estos deben ser manejados y enviados para su ejecución, y el paralelismo se debe explotar de manera eficiente. Cada día hay mas aplicaciones multi-hilo, por lo tanto encotraremos un numero mayor de hilos que se estaran ejecutando en la máquina. En consecuencia, el sistema operativo requerirá cantidades cada vez mayores de tiempo de CPU para organizar y ejecutar estos hilos de manera eficiente. Por lo tanto, las técnicas de optimizacion dinámica para la organizacion de la ejecucion de hilos son de suma importancia en los diseños ACMP ya que pueden incrementar o dsiminuir el rendimiento del hardware asimétrico o del software paralelo. Se han propuesto y aplicado a ACMPs varios métodos de organizar y ejecutar los hilos. En esta tesis, primero estudiamos el estado del arte en las técnicas para la gestionar la ejecucion de los hilos y hemos identificado las principales razones que limitan el paralelismo en sistemas ACMP. Proponemos tres nuevos enfoques para programar y gestionar los hilos y explotar el paralelismo a nivel de hardware, en lugar de perpetuar la tendencia actual de dejar esta gestion cada vez maas compleja al sistema operativo. Nuestro primer objetivo es mejorar el rendimiento de un sistema ACMP mediante la mejora en la gestion de los hilos a nivel de hardware. También mostramos que la gestion del los hilos a nivel de hardware reduce el consumo de energía de un sistemas de ACMP al permitir una mejor utilización del hardware subyacente

    Optimizing for a Many-Core Architecture without Compromising Ease-of-Programming

    Get PDF
    Faced with nearly stagnant clock speed advances, chip manufacturers have turned to parallelism as the source for continuing performance improvements. But even though numerous parallel architectures have already been brought to market, a universally accepted methodology for programming them for general purpose applications has yet to emerge. Existing solutions tend to be hardware-specific, rendering them difficult to use for the majority of application programmers and domain experts, and not providing scalability guarantees for future generations of the hardware. This dissertation advances the validation of the following thesis: it is possible to develop efficient general-purpose programs for a many-core platform using a model recognized for its simplicity. To prove this thesis, we refer to the eXplicit Multi-Threading (XMT) architecture designed and built at the University of Maryland. XMT is an attempt at re-inventing parallel computing with a solid theoretical foundation and an aggressive scalable design. Algorithmically, XMT is inspired by the PRAM (Parallel Random Access Machine) model and the architecture design is focused on reducing inter-task communication and synchronization overheads and providing an easy-to-program parallel model. This thesis builds upon the existing XMT infrastructure to improve support for efficient execution with a focus on ease-of-programming. Our contributions aim at reducing the programmer's effort in developing XMT applications and improving the overall performance. More concretely, we: (1) present a work-flow guiding programmers to produce efficient parallel solutions starting from a high-level problem; (2) introduce an analytical performance model for XMT programs and provide a methodology to project running time from an implementation; (3) propose and evaluate RAP -- an improved resource-aware compiler loop prefetching algorithm targeted at fine-grained many-core architectures; we demonstrate performance improvements of up to 34.79% on average over the GCC loop prefetching implementation and up to 24.61% on average over a simple hardware prefetching scheme; and (4) implement a number of parallel benchmarks and evaluate the overall performance of XMT relative to existing serial and parallel solutions, showing speedups of up to 13.89x vs.~ a serial processor and 8.10x vs.~parallel code optimized for an existing many-core (GPU). We also discuss the implementation and optimization of the Max-Flow algorithm on XMT, a problem which is among the more advanced in terms of complexity, benchmarking and research interest in the parallel algorithms community. We demonstrate better speed-ups compared to a best serial solution than previous attempts on other parallel platforms

    Intelligent Management of Inter-Thread Synchronization Dependencies for Concurrent Programs.

    Full text link
    Power dissipation limits and design complexity have made the microprocessor industry less successful in improving the performance of monolithic processors, even though semiconductor technology continues to scale. Consequently, chip multiprocessors (CMPs) have become a standard for all ranges of computing from cellular phones to high-performance servers. As sufficient thread level parallelism (TLP) is necessary to exploit the computational power provided by CMPs, most performance-aware programmers need to parallelize their programs. For shared memory multi-threaded programs, synchronization mechanisms such as mutexes, barriers, and condition variables, are used to enforce the threads to interact with each other in the way the programmers intended. However, employing synchronization operations in both correct and efficient way at the same time is extremely difficult, and there have been trade-offs between programmability and efficiency of using synchronizations. This thesis proposes a collection of works that increase the programmability and efficiency of concurrent programs by intelligently managing the synchronization operations. First, we focus on mutex locks and unlocks. Many concurrency bug detection tools and automated bug fixers rely on the precise identification of critical sections guarded by lock/unlock operations. We suggest a practical lock/unlock pairing mechanism that combines static analysis with dynamic instrumentation to identify critical sections in POSIX multi-threaded C/C++ programs. Second, we present Dynamic Core Boosting (DCB) to accelerate critical paths in multi-thread programs. Inter-thread dependencies through synchronizations form critical paths. These critical paths are major performance bottlenecks for concurrent programs, and they are exacerbated by workload imbalances in performance asymmetric CMPs. DCB coordinates its compiler, runtime subsystem, and architecture to mitigates such performance bottlenecks. Finally, we propose exploiting synchronization operations for better energy efficiency through dynamic power management.PhDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/108886/1/netforce_1.pd

    Porting the Sisal functional language to distributed-memory multiprocessors

    Get PDF
    Parallel computing is becoming increasingly ubiquitous in recent years. The sizes of application problems continuously increase for solving real-world problems. Distributed-memory multiprocessors have been regarded as a viable architecture of scalable and economical design for building large scale parallel machines. While these parallel machines can provide computational capabilities, programming such large-scale machines is often very difficult due to many practical issues including parallelization, data distribution, workload distribution, and remote memory latency. This thesis proposes to solve the programmability and performance issues of distributed-memory machines using the Sisal functional language. The programs written in Sisal will be automatically parallelized, scheduled and run on distributed-memory multiprocessors with no programmer intervention. Specifically, the proposed approach consists of the following steps. Given a program written in Sisal, the front end Sisal compiler generates a directed acyclic graph(DAG) to expose parallelism in the program. The DAG is partitioned and scheduled based on loop parallelism. The scheduled DAG is then translated to C programs with machine specific parallel constructs. The parallel C programs are finally compiled by the target machine specific compilers to generate executables. A distributed-memory parallel machine, the 80-processor ETL EM-X, has been chosen to perform experiments. The entire procedure has been implemented on the EMX multiprocessor. Four problems are selected for experiments: bitonic sorting, search, dot-product and Fast Fourier Transform. Preliminary execution results indicate that automatic parallelization of the Sisal programs based on loop parallelism is effective. The speedup for these four problems is ranging from 17 to 60 on a 64-processor EM-X. Preliminary experimental results further indicate that programming distributed-memory multiprocessors using a functional language indeed frees the programmers from lowl-evel programming details while allowing them to focus on algorithmic performance improvement
    corecore