34 research outputs found

    Complementing user-level coarse-grain parallelism with implicit speculative parallelism

    Get PDF
    Multi-core and many-core systems are the norm in contemporary processor technology and are expected to remain so for the foreseeable future. Parallel programming is, thus, here to stay and programmers have to endorse it if they are to exploit such systems for their applications. Programs using parallel programming primitives like PThreads or OpenMP often exploit coarse-grain parallelism, because it offers a good trade-off between programming effort versus performance gain. Some parallel applications show limited or no scaling beyond a number of cores. Given the abundant number of cores expected in future many-cores, several cores would remain idle in such cases while execution performance stagnates. This thesis proposes using cores that do not contribute to performance improvement for running implicit fine-grain speculative threads. In particular, we present a many-core architecture and protocols that allow applications with coarse-grain explicit parallelism to further exploit implicit speculative parallelism within each thread. We show that complementing parallel programs with implicit speculative mechanisms offers significant performance improvements for a large and diverse set of parallel benchmarks. Implicit speculative parallelism frees the programmer from the additional effort to explicitly partition the work into finer and properly synchronized tasks. Our results show that, for a many-core comprising 128 cores supporting implicit speculative parallelism in clusters of 2 or 4 cores, performance improves on top of the highest scalability point by 44% on average for the 4-core cluster and by 31% on average for the 2-core cluster. We also show that this approach often leads to better performance and energy efficiency compared to existing alternatives such as Core Fusion and Turbo Boost. Moreover, we present a dynamic mechanism to choose the number of explicit and implicit threads, which performs within 6% of the static oracle selection of threads. To improve energy efficiency processors allow for Dynamic Voltage and Frequency Scaling (DVFS), which enables changing their performance and power consumption on-the-fly. We evaluate the amenability of the proposed explicit plus implicit threads scheme to traditional power management techniques for multithreaded applications and identify room for improvement. We thus augment prior schemes and introduce a novel multithreaded power management scheme that accounts for implicit threads and aims to minimize the Energy Delay2 product (ED2). Our scheme comprises two components: a “local” component that tries to adapt to the different program phases on a per explicit thread basis, taking into account implicit thread behavior, and a “global” component that augments the local components with information regarding inter-thread synchronization. Experimental results show a reduction of ED2 of 8% compared to having no power management, with an average reduction in power of 15% that comes at a minimal loss of performance of less than 3% on average

    Hardware thread scheduling algorithms for single-ISA asymmetric CMPs

    Get PDF
    Through the past several decades, based on the Moore's law, the semiconductor industry was doubling the number of transistors on the single chip roughly every eighteen months. For a long time this continuous increase in transistor budget drove the increase in performance as the processors continued to exploit the instruction level parallelism (ILP) of the sequential programs. This pattern hit the wall in the early years of the twentieth century when designing larger and more complex cores became difficult because of the power and complexity reasons. Computer architects responded by integrating many cores on the same die thereby creating Chip Multicore Processors (CMP). In the last decade, the computing technology experienced tremendous developments, Chip Multiprocessors (CMP) expanded from the symmetric and homogeneous to the asymmetric or heterogeneous Multiprocessors. Having cores of different types in a single processor enables optimizing performance, power and energy efficiency for a wider range of workloads. It enables chip designers to employ specialization (that is, we can use each type of core for the type of computation where it delivers the best performance/energy trade-off). The benefits of Asymmetric Chip Multiprocessors (ACMP) are intuitive as it is well known that different workloads have different resource requirements. The CMPs improve the performance of applications by exploiting the Thread Level Parallelism (TLP). Parallel applications relying on multiple threads must be efficiently managed and dispatched for execution if the parallelism is to be properly exploited. Since more and more applications become multi-threaded we expect to find a growing number of threads executing on a machine. Consequently, the operating system will require increasingly larger amounts of CPU time to schedule these threads efficiently. Thus, dynamic thread scheduling techniques are of paramount importance in ACMP designs since they can make or break performance benefits derived from the asymmetric hardware or parallel software. Several thread scheduling methods have been proposed and applied to ACMPs. In this thesis, we first study the state of the art thread scheduling techniques and identify the main reasons limiting the thread level parallelism in an ACMP systems. We propose three novel approaches to schedule and manage threads and exploit thread level parallelism implemented in hardware, instead of perpetuating the trend of performing more complex thread scheduling in the operating system. Our first goal is to improve the performance of an ACMP systems by improving thread scheduling at the hardware level. We also show that the hardware thread scheduling reduces the energy consumption of an ACMP systems by allowing better utilization of the underlying hardware.A través de las últimas décadas, con base en la ley de Moore, la industria de semiconductores duplica el número de transistores en el chip alrededor de una vez cada dieciocho meses. Durante mucho tiempo, este aumento continuo en el número de transistores impulsó el aumento en el rendimiento de los procesadores solo explotando el paralelismo a nivel de instrucción (ILP) y el aumento de la frecuencia de los procesadores, permitiendo un aumento del rendimiento de los programas secuenciales. Este patrón llego a su limite en los primeros años del siglo XX, cuando el diseño de procesadores más grandes y complejos se convirtió en una tareá difícil debido a las debido al consumo requerido. La respuesta a este problema por parte de los arquitectos fue la integración de muchos núcleos en el mismo chip creando así chip multinúcleo Procesadores (CMP). En la última década, la tecnología de la computación experimentado enormes avances, sobre todo el en chip multiprocesadores (CMP) donde se ha pasado de diseños simetricos y homogeneous a sistemas asimétricos y heterogeneous. Tener núcleos de diferentes tipos en un solo procesador permite optimizar el rendimiento, la potencia y la eficiencia energética para una amplia gama de cargas de trabajo. Permite a los diseñadores de chips emplear especialización (es decir, podemos utilizar un tipo de núcleo diferente para distintos tipos de cálculo dependiendo del trade-off respecto del consumo y rendimiento). Los beneficios de la asimétrica chip multiprocesadores (ACMP) son intuitivos, ya que es bien sabido que diferentes cargas de trabajo tienen diferentes necesidades de recursos. Los CMP mejoran el rendimiento de las aplicaciones mediante la explotación del paralelismo a nivel de hilo (TLP). En las aplicaciones paralelas que dependen de múltiples hilos, estos deben ser manejados y enviados para su ejecución, y el paralelismo se debe explotar de manera eficiente. Cada día hay mas aplicaciones multi-hilo, por lo tanto encotraremos un numero mayor de hilos que se estaran ejecutando en la máquina. En consecuencia, el sistema operativo requerirá cantidades cada vez mayores de tiempo de CPU para organizar y ejecutar estos hilos de manera eficiente. Por lo tanto, las técnicas de optimizacion dinámica para la organizacion de la ejecucion de hilos son de suma importancia en los diseños ACMP ya que pueden incrementar o dsiminuir el rendimiento del hardware asimétrico o del software paralelo. Se han propuesto y aplicado a ACMPs varios métodos de organizar y ejecutar los hilos. En esta tesis, primero estudiamos el estado del arte en las técnicas para la gestionar la ejecucion de los hilos y hemos identificado las principales razones que limitan el paralelismo en sistemas ACMP. Proponemos tres nuevos enfoques para programar y gestionar los hilos y explotar el paralelismo a nivel de hardware, en lugar de perpetuar la tendencia actual de dejar esta gestion cada vez maas compleja al sistema operativo. Nuestro primer objetivo es mejorar el rendimiento de un sistema ACMP mediante la mejora en la gestion de los hilos a nivel de hardware. También mostramos que la gestion del los hilos a nivel de hardware reduce el consumo de energía de un sistemas de ACMP al permitir una mejor utilización del hardware subyacente

    Quantitative Performance Analysis of the SPEC OMPM2001 Benchmarks

    Get PDF

    Scalable Applications on Heterogeneous System Architectures: A Systematic Performance Analysis Framework

    Get PDF
    The efficient parallel execution of scientific applications is a key challenge in high-performance computing (HPC). With growing parallelism and heterogeneity of compute resources as well as increasingly complex software, performance analysis has become an indispensable tool in the development and optimization of parallel programs. This thesis presents a framework for systematic performance analysis of scalable, heterogeneous applications. Based on event traces, it automatically detects the critical path and inefficiencies that result in waiting or idle time, e.g. due to load imbalances between parallel execution streams. As a prerequisite for the analysis of heterogeneous programs, this thesis specifies inefficiency patterns for computation offloading. Furthermore, an essential contribution was made to the development of tool interfaces for OpenACC and OpenMP, which enable a portable data acquisition and a subsequent analysis for programs with offload directives. At present, these interfaces are already part of the latest OpenACC and OpenMP API specification. The aforementioned work, existing preliminary work, and established analysis methods are combined into a generic analysis process, which can be applied across programming models. Based on the detection of wait or idle states, which can propagate over several levels of parallelism, the analysis identifies wasted computing resources and their root cause as well as the critical-path share for each program region. Thus, it determines the influence of program regions on the load balancing between execution streams and the program runtime. The analysis results include a summary of the detected inefficiency patterns and a program trace, enhanced with information about wait states, their cause, and the critical path. In addition, a ranking, based on the amount of waiting time a program region caused on the critical path, highlights program regions that are relevant for program optimization. The scalability of the proposed performance analysis and its implementation is demonstrated using High-Performance Linpack (HPL), while the analysis results are validated with synthetic programs. A scientific application that uses MPI, OpenMP, and CUDA simultaneously is investigated in order to show the applicability of the analysis

    ?????????, ?????????, ????????? ?????? ???????????? ????????? ????????? ??????????????? ???????????? ????????? ??????????????? ??????

    Get PDF
    Department of Computer Science and EngineeringHardware with advanced functionalities and/or improved performance and efficiency has been introduced in modern computer systems. However, there exist several challenges with such emerging hardware. First, the characteristics of emerging hardware are unknown but deriving useful properties through characterization studies is hard because emerging hardware has different effects on applications with different characteristics. Second, sole use of emerging hardware is suboptimal but coordination of emerging hardware and other techniques is hard due to large and complex system state space. To address the problem, we first conduct in-depth characterization studies for emerging hardware based on applications with various characteristics. Guided by the observations from our characterization studies, we propose a set of system software techniques to effectively leverage emerging hardware. The system software techniques combine emerging hardware and other techniques to improve the performance, efficiency, and fairness of computer systems based on efficient optimization algorithms. First, we investigate system software techniques to effectively manage hardware-based last-level cache (LLC) and memory bandwidth partitioning functionalities. For effective memory bandwidth partitioning on commodity servers, we propose HyPart, a hybrid technique for practical memory bandwidth partitioning on commodity servers. HyPart combines the three widely used memory bandwidth partitioning techniques (i.e., thread packing, clock modulation, and Intel MBA) in a coordinated manner considering the characteristics of the target applications. We demonstrate the effectiveness of HyPart through the quantitative evaluation. We also propose CoPart, coordinated partitioning of LLC and memory bandwidth for fairness-aware workload consolidation on commodity servers. We first characterize the impact of LLC and memory bandwidth partitioning on the performance and fairness of the consolidated workloads. Guided by the characterization, we design and implement CoPart. CoPart dynamically profiles the characteristics of the consolidated workloads and partitions LLC and memory bandwidth in a coordinated manner to maximize the fairness of the consolidated workloads. Through the quantitative evaluation with various workloads and system configurations, we demonstrate the effectiveness of CoPart in the sense that it significantly improves the overall fairness of the consolidated workloads. Second, we investigate a system software technique to effectively leverage hardware-based power capping functionality. We first characterize the performance impact of the two key system knobs (i.e., concurrency level of the target applications and cross component power allocation) for power capping. Guided by the characterization results, we design and implement RPPC, a holistic runtime system for maximizing performance under power capping. RPPC dynamically controls the key system knobs in a cooperative manner considering the characteristics (e.g., scalability and memory intensity) of the target applications. Our evaluation results show the effectiveness of RPPC in the sense that it significantly improves the performance under power capping on various application and system configurations. Third, we investigate system software techniques for effective dynamic concurrency control on many-core systems and heterogeneous multiprocessing systems. We propose RMC, an integrated runtime system for adaptive many-core computing. RMC combines the two widely used dynamic concurrency control techniques (i.e., thread packing and dynamic threading) in a coordinated manner to exploit the advantages of both techniques. RMC quickly controls the concurrency level of the target applications through the thread packing technique to improve the performance and efficiency. RMC further improves the performance and efficiency by determining the optimal thread count through the dynamic threading technique. Our quantitative experiments show the effectiveness of RMC in the sense that it outperforms the existing dynamic concurrency control techniques in terms of the performance and energy efficiency. In addition, we also propose PALM, progress- and locality-aware adaptive task migration for efficient thread packing. We first conduct an in-depth performance analysis of thread packing with various synchronization-intensive benchmarks and system configurations and find the root causes of the performance pathologies of thread packing. Based on the characterization results, we design and implement PALM, which supports both of symmetric multiprocessing systems and heterogeneous multiprocessing systems. For efficient thread packing, PALM solves the three key problems, progress-aware task migration, locality-aware task migration, and scheduling period control. Our quantitative evaluation explains the effectiveness of PALM in the sense that it achieves substantially higher performance and energy efficiency than the conventional thread packing. We also present case studies in which PALM considerably improves the efficiency of dynamic server consolidation and the performance under power capping.ope

    The NASA SBIR product catalog

    Get PDF
    The purpose of this catalog is to assist small business firms in making the community aware of products emerging from their efforts in the Small Business Innovation Research (SBIR) program. It contains descriptions of some products that have advanced into Phase 3 and others that are identified as prospective products. Both lists of products in this catalog are based on information supplied by NASA SBIR contractors in responding to an invitation to be represented in this document. Generally, all products suggested by the small firms were included in order to meet the goals of information exchange for SBIR results. Of the 444 SBIR contractors NASA queried, 137 provided information on 219 products. The catalog presents the product information in the technology areas listed in the table of contents. Within each area, the products are listed in alphabetical order by product name and are given identifying numbers. Also included is an alphabetical listing of the companies that have products described. This listing cross-references the product list and provides information on the business activity of each firm. In addition, there are three indexes: one a list of firms by states, one that lists the products according to NASA Centers that managed the SBIR projects, and one that lists the products by the relevant Technical Topics utilized in NASA's annual program solicitation under which each SBIR project was selected

    Efficient distributed load balancing for parallel algorithms

    Get PDF
    2009 - 2010With the advent of massive parallel processing technology, exploiting the power offered by hundreds, or even thousands of processors is all but a trivial task. Computing by using multi-processor, multi-core or many-core adds a number of additional challenges related to the cooperation and communication of multiple processing units. The uneven distribution of data among the various processors, i.e. the load imbalance, represents one of the major problems in data parallel applications. Without good load distribution strategies, we cannot reach good speedup, thus good efficiency. Load balancing strategies can be classified in several ways, according to the methods used to balance workload. For instance, dynamic load balancing algorithms make scheduling decisions during the execution and commonly results in better performance compared to static approaches, where task assignment is done before the execution. Even more important is the difference between centralized and distributed load balancing approaches. In fact, despite that centralized algorithms have a wider vision of the computation, hence may exploit smarter balancing techniques, they expose global synchronization and communication bottlenecks involving the master node. This definitely does not assure scalability with the number of processors. This dissertation studies the impact of different load balancing strategies. In particular, one of the key observations driving our work is that distributed algorithms work better than centralized ones in the context of load balancing for multi-processors (alike for multi-cores and many-cores as well). We first show a centralized approach for load balancing, then we propose several distributed approaches for problems having different parallelization, workload distribution and communication pattern. We try to efficiently combine several approaches to improve performance, in particular using predictive metrics to obtain a per task compute-time estimation, using adaptive subdivision, improving dynamic load balancing and addressing distributed balancing schemas. The main challenge tackled on this thesis has been to combine all these approaches together in new and efficient load balancing schemas. We assess the proposed balancing techniques, starting from centralized approaches to distributed ones, in distinctive real case scenarios: Mesh-like computation, Parallel Ray Tracing, and Agent-based Simulations. Moreover, we test our algorithms with parallel hardware such has cluster of workstations, multi-core processors and exploiting SIMD vectorial instruction set. Finally, we conclude the thesis with several remarks, about the impact of distributed techniques, the effect of the communication pattern and workload distribution, the use of cost estimation for adaptive partitioning, the trade-off fast versus accuracy in prediction-based approaches, the effectiveness of work stealing combined with sorting, and a non-trivial way to exploit hybrid CPUGPU computations. [edited by author]IX n.s

    SIMD@OpenMP : a programming model approach to leverage SIMD features

    Get PDF
    SIMD instruction sets are a key feature in current general purpose and high performance architectures. SIMD instructions apply in parallel the same operation to a group of data, commonly known as vector. A single SIMD/vector instruction can, thus, replace a sequence of scalar instructions. Consequently, the number of instructions can be greatly reduced leading to improved execution times. However, SIMD instructions are not widely exploited by the vast majority of programmers. In many cases, taking advantage of these instructions relies on the compiler. Nevertheless, compilers struggle with the automatic vectorization of codes. Advanced programmers are then compelled to exploit SIMD units by hand, using low-level hardware-specific intrinsics. This approach is cumbersome, error prone and not portable across SIMD architectures. This thesis targets OpenMP to tackle the underuse of SIMD instructions from three main areas of the programming model: language constructions, compiler code optimizations and runtime algorithms. We choose the Intel Xeon Phi coprocessor (Knights Corner) and its 512-bit SIMD instruction set for our evaluation process. We make four contributions aimed at improving the exploitation of SIMD instructions in this scope. Our first contribution describes a compiler vectorization infrastructure suitable for OpenMP. This infrastructure targets for-loops and whole functions. We define a set of attributes for expressions that determine how the code is vectorized. Our vectorization infrastructure also implements support for several advanced vector features. This infrastructure is proven to be effective in the vectorization of complex codes and it is the basis upon which we build the following two contributions. The second contribution introduces a proposal to extend OpenMP 3.1 with SIMD parallelism. Essential parts of this work have become key features of the SIMD proposal included in OpenMP 4.0. We define the "simd" and "simd for" directives that allow programmers to describe SIMD parallelism of loops and whole functions. Furthermore, we propose a set of optional clauses that leads the compiler to generate a more efficient vector code. These SIMD extensions improve the programming efficiency when exploiting SIMD resources. Our evaluation on the Intel Xeon Phi coprocessor shows that our SIMD proposal allows the compiler to efficiently vectorize codes poorly or not vectorized automatically with the Intel C/C++ compiler. In the third contribution, we propose a vector code optimization that enhances overlapped vector loads. These vector loads redundantly read from memory scalar elements already loaded by other vector loads. Our vector code optimization improves the memory usage of these accesses by means of building a vector register cache and exploiting register-to-register instructions. Our proposal also includes a new clause (overlap) in the context of the SIMD extensions for OpenMP of our first contribution. This new clause allows enabling, disabling and tuning this optimization on demand. The last contribution tackles the exploitation of SIMD instructions in the OpenMP barrier and reduction primitives. We propose a new combined barrier and reduction tree scheme specifically designed to make the most of SIMD instructions. Our barrier algorithm takes advantage of simultaneous multi-threading technology (SMT) and it utilizes SIMD memory instructions in the synchronization process. The four contributions of this thesis are an important step in the direction of a more common and generalized use of SIMD instructions. Our work is having an outstanding impact on the whole OpenMP community, ranging from users of the programming model to compiler and runtime implementations. Our proposals in the context of OpenMP improves the programmability of the programming model, the overhead of runtime services and the execution time of applications by means of a better use of SIMD.Los juegos de instrucciones SIMD son un componente clave en las arquitecturas de propósito general y de alto rendimiento actuales. Estas instrucciones aplican en paralelo la misma operación a un conjunto de datos, conocido como vector. Una instrucción SIMD/vectorial puede sustituir una secuencia de instrucciones escalares. Así, el número de instrucciones puede ser reducido considerablemente, dando lugar a mejores tiempos de ejecución. No obstante, las instrucciones SIMD no son explotadas ampliamente por la mayoría de programadores. En general, beneficiarse de estas instrucciones depende del compilador. Sin embargo, los compiladores tienen dificultades con la vectorización automática de códigos por lo que los programadores avanzados se ven obligados a explotar las unidades SIMD manualmente, empleando intrínsecas de bajo nivel específicas del hardware. Esta aproximación es costosa, propensa a errores y no portable entre arquitecturas. Esta tesis se centra en el modelo de programación OpenMP para abordar el poco uso de las instrucciones SIMD desde tres áreas: construcciones del lenguaje, optimizaciones de código del compilador y algoritmos del runtime. Hemos escogido el coprocesador Intel Xeon Phi (Knights Corner) y su juego de instrucciones SIMD de 512 bits para nuestra evaluación. Realizamos cuatro contribuciones para mejorar la explotación de las instrucciones SIMD en este ámbito. Nuestra primera contribución describe una infraestructura de vectorización de compilador adecuada para OpenMP. Esta infraestructura tiene como objetivo la vectorización de bucles y funciones. Para ello definimos un conjunto de atributos que determina como se vectoriza el código. Nuestra evaluación demuestra la efectividad de esta infraestructura en la vectorización de códigos complejos. Esta infraestructura es la base de las dos propuestas siguientes. En la segunda contribución proponemos una extensión SIMD para de OpenMP 3.1. Partes esenciales de este trabajo se han convertido en características clave de la propuesta sobre SIMD incluida en OpenMP 4.0. Definimos las directivas ‘simd’ y ‘simd for’ que permiten a los programadores describir paralelismo SIMD de bucles y funciones. Además, proponemos un conjunto de cláusulas opcionales que permiten que el compilador genere código vectorial más eficiente. Nuestra evaluación muestra que nuestra propuesta SIMD permite al compilador vectorizar eficientemente códigos pobremente o no vectorizados automáticamente con el compilador Intel C/C++

    The 1991 3rd NASA Symposium on VLSI Design

    Get PDF
    Papers from the symposium are presented from the following sessions: (1) featured presentations 1; (2) very large scale integration (VLSI) circuit design; (3) VLSI architecture 1; (4) featured presentations 2; (5) neural networks; (6) VLSI architectures 2; (7) featured presentations 3; (8) verification 1; (9) analog design; (10) verification 2; (11) design innovations 1; (12) asynchronous design; and (13) design innovations 2
    corecore