91 research outputs found

    Speculative parallelization

    Get PDF
    Producción CientíficaThe most promising technique for automatically parallelizing loops when the system cannot determine dependences at compile time is speculative parallelization. Also called thread-level speculation, this technique assumes optimistically that the system can execute all iterations of a given loop in parallel. A hardware or software monitor divides the iterations into blocks and assigns them to different threads, one per processor, with no prior dependence analysis. If the system discovers a dependence violation at runtime, it stops the incorrectly computed work and restarts it with correct values. Of course, the more parallel the loop, the more benefits this technique delivers. To better understand how speculative parallelization works, it is necessary to distinguish between private and shared variables. Informally speaking, private variables are those that the program always modifies in each iteration before using them. On the other hand, values stored in shared variables are used in different iterations

    Trasgo: a nested-parallel programming system

    Get PDF
    Producción CientíficaProgramming models of pure nested-parallelism are appealing due to their ease of programming and good analysis and debugging properties. Although their simple synchronization structure is appropriate to represent abstract parallel algorithms, it does not take into account many implementation issues. In this work we present Trasgo, a programming system based on high-level, nested-parallel specifications. We show how it allows to easily express complex combinations of data and task parallelism with a common scheme, hiding the layout and scheduling details. The approach allows the development of a modular compiler where automatic transformation techniques may exploit lower level and more complex synchronization structures, unlocking the limitations of pure nested-parallel programming. This article presents an overview of the features of Trasgo, and its architecture. We present some performance results using well-known parallel algorithms, and a roadmap of improvements and new features to be added to Trasgo.This research is partly supported by the Ministerio de Educación y Ciencia, Spain (TIN2007-62302), Ministerio de Industria, Spain (FIT-350101-2007-27, FIT-350101-2006-46, TSI- 020302-2008-89, CENIT MARTA, CENIT OASIS), Junta de Castilla y León, Spain (VA094A08), and also by the Dutch government STW/PROGRESS project DES.6397. Part of this work was carried out under the HPC-EUROPA project (RII3-CT-2003-506079), with the support of the European Community–Research Infrastructure Action under the FP6 “Structuring the European Research Area” program

    Using the Xeon Phi platform to run speculatively-parallelized codes

    Get PDF
    Producción CientíficaIntel Xeon Phi accelerators are one of the newest devices used in the field of parallel computing. However, there are comparatively few studies concerning their performance when using most of the existing parallelization techniques. One of them is thread-level speculation, a technique that optimistically tries to extract parallelism of loops without the need of a compile-time analysis that guarantees that the loop can be executed in parallel. In this article we evaluate the performance delivered by an Intel Xeon Phi coprocessor when using a software, state-of-the-art thread-level speculative parallelization library in the execution of well-known benchmarks. We describe both the internal characteristics of the Xeon Phi platform and the particularities of the thread-level speculation library being used as benchmark. Our results show that, although the Xeon Phi delivers a relatively good speedup in comparison with a shared-memory architecture in terms of scalability, the relatively low computing power of its computational units when specific vectorization and SIMD instructions are not fully exploited makes this first generation of Xeon Phi architectures not competitive (in terms of absolute performance) with respect to conventional multicore systems for the execution of speculatively parallelized code.2018-04-01Castilla-Leon Regional Government (VA172A12-2); MICINN (Spain) and the European Union FEDER (MOGECOPP project TIN2011-25639, HomProg-HetSys project TIN2014-58876-P, CAPAP-H5 network TIN2014-53522-REDT)

    Using SPEC CPU2006 to evaluate the sequential and parallel code generated by commercial and open-source compilers

    Get PDF
    Producción CientíficaThe role of the compiler is fundamental to exploit the hardware capabili- ties of a system running a particular application, minimizing the sequential execution time and, in some cases, offering the possibility of parallelizing part of the code automatically. This paper relies on the SPEC CPU2006 v1.1 benchmark suite to eval- uate the performance of the code generated by three widely-used compilers (Intel C++/Fortran Compiler 11.0, Sun Studio 12 and GCC 4.3.2). Performance is measure in terms of base speed for reference problem sizes. Both sequential and automatic parallel performance obtained is analyzed, using different hardware architectures and configurations. The study includes a detailed description of the different problems that arise while compiling SPEC CPU2006 benchmarks with these tools, an informa- tion difficult to obtain elsewhere. Having in mind that performance is a moving target in the field of compilers, our evaluation shows that the sequential code generated by both Sun and Intel compilers for the SPEC CPU2006 integer benchmarks present a similar performance, while the floating-point code generated by Intel compiler is faster than its competitors. With respect to the auto-parallelization options offered by Intel and Sun compilers, our study shows that their benefits only apply to some floating-point benchmarks, with an average speedup of 1.2× with four processors. Meanwhile, the GCC suite evaluated is not capable of compiling the SPEC CPU2006 benchmark with auto-parallelization options enabled.This research was partly supported by the Ministerio de Educación, Spain (TIN2007-62302), Ministerio de Industria, Spain (FIT-350101-2007-27, FIT-350101-2006-46, TSI- 020302-2008-89, CENIT MARTA, CENIT OASIS), Junta de Castilla y León, Spain (VA094A08), and also by the Dutch government STW/PROGRESS project DES.6397. Part of this work was carried out under the HPC-EUROPA project (RII3-CT-2003-506079), with the support of the European Community—Research Infrastructure Action under the FP6 “Structuring the European Research Area” Programme

    Extending a hierarchical tiling arrays library to support sparse data partitioning

    Get PDF
    Producción CientíficaLayout methods for dense and sparse data are often seen as two separate problems with their own particular techniques. However, they are based on the same basic concepts. This paper studies how to integrate automatic data-layout and partition techniques for both dense and sparse data structures. In particular, we show how to include support for sparse matrices or graphs in Hitmap, a library for hierarchical tiling and automatic mapping of arrays. The paper shows that it is possible to offer a unique interface to work with both dense and sparse data structures. Thus, the programmer can use a single and homogeneous programming style, reducing the development effort and simplifying the use of sparse data structures in parallel computations. Our experimental evaluation shows that this integration of techniques can be effectively done without compromising performance.This research is partly supported by the Ministerio de Industria, Spain (CENIT MARTA, CENIT OASIS, CENIT OCEANLIDER), Ministerio de Ciencia y Tecnología, Spain (CAPAP-H3 network, TIN2010-12011-E, TIN2011-25639), and the HPC-EUROPA2 project (project number: 228398) with the support of the European Commission—Capacities Area—Research Infrastructures Initiative

    7th International Symposium on High-Level Parallel Programming and Applications (HLPP 2014)

    Get PDF
    Producción CientíficaSoftware-based, thread-level speculation (TLS) is a software technique that optimistically executes in parallel loops whose fully-parallel semantics can not be guaranteed at compile time. Modern TLS libraries allow to handle arbitrary data structures speculatively. This desired feature comes at the high cost of local store and/or remote recovery times: The easier the local store, the harder the remote recovery. Unfortunately, both times are on the critical path of any TLS system. In this paper we propose a solution that performs local store in constant time, while recover values in a time that is in the order of T , being T the number of threads. As we will see, this solution , together with some additional improvements, makes the difference between slowdowns and noticeable speedups in the speculative parallelization of non-synthetic, pointer-based applications on a real system. Our experimental results show a gain of 3.58× to 28× with respect to the baseline system, and a relative efficiency of up to, on average, 65% with respect to a TLS implementation specifically tailored to the benchmarks used.This research is partly supported by the Castilla-Leon Regional Government (VA172A12-2); Ministerio de Industria, Spain (CENIT OCEANLIDER); MICINN (Spain) and the European Union FEDER (MOGECOPP project TIN2011-25639, CAPAP-H3 network TIN2010-12011-E, CAPAP-H4 network TIN2011-15734-E)

    The BonaFide C Analyzer: automatic loop-level characterization and coverage measurement

    Get PDF
    Producción CientíficaThe advent of multicore technologies has increased the interest in parallelization techniques for existing sequential applications. These techniques include the need of detecting loops that are good candidates for parallelization, and classifying all variables of these loops according to their use, a task surprisingly hard to be carried out manually. In this paper, we introduce the BonaFide C Analyzer, an XML-based framework that combines static analysis of source code with profiling information to generate complete reports regarding all loops in a C application, including loop coverage, loop suitability for parallelization, a classification of all variables inside loops based on their accesses, and other hurdles that restrict the parallelization. This information allows to analyze how particular language constructs are used in real-world applications, and helps the programmer to parallelize the code. To show the features of the framework, we present the results of an in-depth loop characterization of C applications that are part of the SPEC CPU2006 benchmark suite. Our study shows that 47.72 % of loops present in the applications analyzed are potentially parallelizable with existent parallel programming models such as OpenMP, while an additional 37.7 % of loops could be run in parallel with the help of runtime speculative parallelization techniques.This research is partly supported by the Castilla-Leon Regional Government (VA172A12-2); Ministerio de Industria, Spain (CENIT OCEANLIDER); MICINN (Spain) and the European Union FEDER (MOGECOPP project TIN2011-25639, CAPAP-H3 network TIN2010-12011-E, CAPAP-H4 network TIN2011-15734-E). Sergio Aldea is supported by a research grant (EDU/1204/2010) of Consejería de Educación, Junta de Castilla y León, Spain, and the European Social Fund

    uBench: exposing the impact of CUDA block geometry in terms of performance

    Get PDF
    Producción CientíficaThe choice of thread-block size and shape is one of the most important user decisions when a parallel problem is written for any CUDA architecture. The reason is that thread-block geometry has a significant impact on the global performance of the program. Unfortunately, the programmer has not enough information about the subtle interactions between this choice of parameters and the underlying hardware. This paper presents uBench, a complete suite of micro-benchmarks, in order to explore the impact on performance of (1) the thread-block geometry choice criteria, and (2) the GPU hardware resources and configurations. Each micro-benchmark has been designed to be as simple as possible to focus on a single effect derived from the hardware and thread-block parameter choice. As an example of the capabilities of this benchmark suite, this paper shows an experimental evaluation and comparison of Fermi and Kepler architectures. Our study reveals that, in spite of the new hardware details introduced by Kepler, the principles underlying the block geometry selection criteria are similar for both architectures.This research is partly supported by the Ministerio de Industria, Spain (CENIT OCEANLIDER), MINECO (Spain) and the European Union FEDER (MOGECOPP project TIN2011-25639, CAPAP-H network TIN2010-12011-E and TIN2011-15734-E), Junta de Castilla y León (VA172A12-2), and the HPC-EUROPA2 project (project number: 228398) with the support of the European Commission—Capacities Area—Research Infrastructures Initiative

    Multi-Device Controllers: A Library To Simplify The Parallel Heterogeneous Programming

    Get PDF
    Producción CientíficaCurrent HPC clusters are composed by several machines with different computation capabilities and different kinds and families of accelerators. Programming efficiently for these heterogeneous systems has become an important challenge. There are many proposals to simplify the programming and management of accelerator devices, and the hybrid programming, mixing accelerators and CPU cores. However, in many cases, portability compromises the efficiency on different devices, and there are details concerning the coordination of different types of devices that should still be tackled by the programmer. In this work, we introduce the Multi-Controller, an abstract entity implemented in a library that coordinates the management of heterogeneous devices, including accelerators with different capabilities and sets of CPU-cores. Our proposal improves state-of-the-art solutions, simplifying data partition, mapping and the transparent deployment of both, simple generic kernels portable across different device types, and specialized implementations defined and optimized using specific native or vendor programming models (such as CUDA for NVIDIA’s GPUs, or OpenMP for CPU-cores). The run-time system automatically selects and deploys the most appropriate implementation of each kernel for each device, managing data movements and hiding the launch details. The results of an experimental study with five study cases indicates that our abstraction allows the development of flexible and highly efficient programs that adapt to the heterogeneous environment.2020-01-012020-01-01MICINN (Spain) and ERDF program of the European Union: HomProg-HetSys project (TIN2014-58876-P), CAPAP-H6 (TIN2016-81840-REDT), and COST Program Action IC1305: Network for Sustainable Ultrascale Computing (NESUS)

    New data structures to handle speculative parallelization at runtime

    Get PDF
    Producción CientíficaSoftware-based, thread-level speculation (TLS) is a software technique that optimistically executes in parallel loops whose fully-parallel semantics can not be guaranteed at compile time. Modern TLS libraries allow to handle arbitrary data structures speculatively. This desired feature comes at the high cost of local store and/or remote recovery times: The easier the local store, the harder the remote recovery. Unfortunately, both times are on the critical path of any TLS system. In this paper we propose a solution that performs local store in constant time, while recover values in a time that is in the order of T, being T the number of threads. As we will see, this solution, together with some additional improvements, makes the difference between slowdowns and noticeable speedups in the speculative parallelization of non-synthetic, pointer-based applications on a real system. Our experimental results show a gain of 3.58× to 28× with respect to the baseline system, and a relative efficiency of up to, on average, 65 % with respect to a TLS implementation specifically tailored to the benchmarks used.Castilla-Leon Regional Government (VA172A12-2); Ministerio de Industria, Spain (CENIT OCEANLIDER); MICINN (Spain) and the European Union FEDER (MOGECOPP project TIN2011-25639, CAPAP-H3 net- work TIN2010-12011-E, CAPAP-H4 network TIN2011-15734-E)
    corecore