423 research outputs found

    Speculative dynamic vectorization

    Get PDF
    Traditional vector architectures have shown to be very effective for regular codes where the compiler can detect data-level parallelism. However, this SIMD parallelism is also present in irregular or pointer-rich codes, for which the compiler is quite limited to discover it. In this paper we propose a microarchitecture extension in order to exploit SIMD parallelism in a speculative way. The idea is to predict when certain operations are likely to be vectorizable, based on some previous history information. In this case, these scalar instructions are executed in a vector mode. These vector instructions operate on several elements (vector operands) that are anticipated to be their input operands and produce a number of outputs that are stored on a vector register in order to be used by further instructions. Verification of the correctness of the applied vectorization eventually changes the status of a given vector element from speculative to non-speculative, or alternatively, generates a recovery action in case of misspeculation. The proposed microarchitecture extension applied to a 4-way issue superscalar processor with one wide bus is 19% faster than the,same processor with 4 scalar buses to Ll data cache. This speed up is due basically to 1) the reduction in number of memory accesses, 15% for SpecInt and 20% for SpecFP, 2) the transformation of scalar arithmetic instructions into their vector counterpart, 28% for SpecInt and 23% for SpecFP, and 3) the exploitation of control independence for mispredicted branches.Peer ReviewedPostprint (published version

    Speculative Vectorization for Superscalar Processors

    Get PDF
    Traditional vector architectures have been shown to be very effective in executing regular codes in which the compiler can detect data-level parallelism, i.e. repeating the same computation over different elements in the same code-level data structure.A skilled programmer can easily create efficient vector code from regular applications. Unfortunately, this vectorization can be difficult if applications are not regular or if the programmer does not have an exact knowledge of the underlying architecture. The compiler has a partial knowledge of the program (i.e. it has a limited knowledge of the values of the variables). Because of this, it generates code that is safe for any possible scenario according to its knowledge, and thus, it may lose significant opportunities to exploit SIMD parallelism. In addition to this, we have the problem of legacy codes that have been compiled for former versions of the ISA with no SIMD extensions, which are therefore not able to exploit new SIMD extensions incorporated into newer ISA versions.In this dissertation, we will describe a mechanism that is able to detect and exploit DLP at runtime by speculatively creating vector instructions for prefetching and precomputing data for future instances of their scalar counterparts. This process will be called Speculative Dynamic Vectorization.A more in-depth study of this technique reveals a very positive characteristic: the mechanism can easily be tailored to alleviate the main drawbacks of current superscalar processors, particularly branch mispredictions and the memory gap. In this dissertation, we will describe how to rearrange the basic Speculative Dynamic Vectorization mechanism to alleviate the branch misprediction penalty based on reusing control-flow independent instructions. The memory gap problem will be addressed with a set of mechanisms that exploit the stall cycles due to L2 misses in order to virtually enlarge the instruction window.Finally, more refinements of the basic Speculative Dynamic Vectorization mechanism will be presented to improve its performance at a reasonable cost.Los procesadores vectoriales han demostrado ser muy eficientes ejecutando códigos regulares en los que el compilador ha detectado Paralelismo a Nivel de Datos. Este paralelismo consiste en repetir los mismos cálculos en diferentes elementos de la misma estructura de alto nivel.Un programador avanzado puede crear código vectorial eficiente para aplicaciones regulares. Por desgracia, esta vectorización puede llegar a ser compleja en aplicaciones regulares o si el programador no tiene suficiente conocimiento de la arquitectura sobre la que se va a ejecutar la aplicación.El compilador tiene un conocimiento parcial del programa. Debido a esto, genera código que se puede ejecutar sin problemas en cualquier escenario según su conocimiento y, por tanto, puede perder oportunidades de explotar el paralelismo SIMD (Single Instruction Multiple Data). Además, existe el problema de los códigos de legado que han sido compilados con versiones anteriores del juego de instrucciones que no disponían de instrucciones SIMD lo cual hace que no se pueda explotar las extensiones SIMD de las nuevas versiones de los juegos de instrucciones.En esta tesis se presentará un mecanismo capaz de detectar y explotar, en tiempo de ejecución, el paralelismo a nivel de datos mediante la creación especulativa de instrucciones vectoriales que prebusquen y precomputen valores para futuras instancias de instrucciones escalares. Este proceso se llamará Vectorización Dinámica Especulativa.Un estudio más profundo de esta técnica conducirá a una conclusión muy importante: este mecanismo puede ser fácilmente modificado para aliviar algunos de los problemas más importantes de los procesadores superescalares. Estos problemas son: los fallos de predicción de saltos y el gap entre procesador y memoria. En esta tesis describiremos como modificar el mecanismo básico de Vectorización Dinámica Especulativa para reducir la penalización de rendimiento producida por los fallos de predicción de saltos mediante el reuso de datos de instrucciones independientes de control. Además, se presentará un conjunto de técnicas que explotan los ciclos de bloqueo del procesador debidos a un fallo en la cache de segundo nivel mediante un agrandamiento virtual de la ventana de instrucciones. Esto reducirá la penalización del problema del gap entre procesador y memoria.Finalmente, se presentarán refinamientos del mecanismo básico de Vectorización Dinámica Especulativa enfocados a mejorar el rendimiento de éste a un bajo coste

    pocl: A Performance-Portable OpenCL Implementation

    Get PDF
    OpenCL is a standard for parallel programming of heterogeneous systems. The benefits of a common programming standard are clear; multiple vendors can provide support for application descriptions written according to the standard, thus reducing the program porting effort. While the standard brings the obvious benefits of platform portability, the performance portability aspects are largely left to the programmer. The situation is made worse due to multiple proprietary vendor implementations with different characteristics, and, thus, required optimization strategies. In this paper, we propose an OpenCL implementation that is both portable and performance portable. At its core is a kernel compiler that can be used to exploit the data parallelism of OpenCL programs on multiple platforms with different parallel hardware styles. The kernel compiler is modularized to perform target-independent parallel region formation separately from the target-specific parallel mapping of the regions to enable support for various styles of fine-grained parallel resources such as subword SIMD extensions, SIMD datapaths and static multi-issue. Unlike previous similar techniques that work on the source level, the parallel region formation retains the information of the data parallelism using the LLVM IR and its metadata infrastructure. This data can be exploited by the later generic compiler passes for efficient parallelization. The proposed open source implementation of OpenCL is also platform portable, enabling OpenCL on a wide range of architectures, both already commercialized and on those that are still under research. The paper describes how the portability of the implementation is achieved. Our results show that most of the benchmarked applications when compiled using pocl were faster or close to as fast as the best proprietary OpenCL implementation for the platform at hand.Comment: This article was published in 2015; it is now openly accessible via arxi

    DAPHNE: An Open and Extensible System Infrastructure for Integrated Data Analysis Pipelines

    Get PDF
    Integrated data analysis (IDA) pipelines—that combine data management (DM) and query processing, high-performance computing (HPC), and machine learning (ML) training and scoring—become increasingly common in practice. Interestingly, systems of these areas share many compilation and runtime techniques, and the used—increasingly heterogeneous—hardware infrastructure converges as well. Yet, the programming paradigms, cluster resource management, data formats and representations, as well as execution strategies differ substantially. DAPHNE is an open and extensible system infrastructure for such IDA pipelines, including language abstractions, compilation and runtime techniques, multi-level scheduling, hardware (HW) accelerators, and computational storage for increasing productivity and eliminating unnecessary overheads. In this paper, we make a case for IDA pipelines, describe the overall DAPHNE system architecture, its key components, and the design of a vectorized execution engine for computational storage, HW accelerators, as well as local and distributed operations. Preliminary experiments that compare DAPHNE with MonetDB, Pandas, DuckDB, and TensorFlow show promising results

    PENCIL: Towards a Platform-Neutral Compute Intermediate Language for DSLs

    Full text link
    We motivate the design and implementation of a platform-neutral compute intermediate language (PENCIL) for productive and performance-portable accelerator programming

    Cost-Based Optimization of Integration Flows

    Get PDF
    Integration flows are increasingly used to specify and execute data-intensive integration tasks between heterogeneous systems and applications. There are many different application areas such as real-time ETL and data synchronization between operational systems. For the reasons of an increasing amount of data, highly distributed IT infrastructures, and high requirements for data consistency and up-to-dateness of query results, many instances of integration flows are executed over time. Due to this high load and blocking synchronous source systems, the performance of the central integration platform is crucial for an IT infrastructure. To tackle these high performance requirements, we introduce the concept of cost-based optimization of imperative integration flows that relies on incremental statistics maintenance and inter-instance plan re-optimization. As a foundation, we introduce the concept of periodical re-optimization including novel cost-based optimization techniques that are tailor-made for integration flows. Furthermore, we refine the periodical re-optimization to on-demand re-optimization in order to overcome the problems of many unnecessary re-optimization steps and adaptation delays, where we miss optimization opportunities. This approach ensures low optimization overhead and fast workload adaptation

    Polyhedral+Dataflow Graphs

    Get PDF
    This research presents an intermediate compiler representation that is designed for optimization, and emphasizes the temporary storage requirements and execution schedule of a given computation to guide optimization decisions. The representation is expressed as a dataflow graph that describes computational statements and data mappings within the polyhedral compilation model. The targeted applications include both the regular and irregular scientific domains. The intermediate representation can be integrated into existing compiler infrastructures. A specification language implemented as a domain specific language in C++ describes the graph components and the transformations that can be applied. The visual representation allows users to reason about optimizations. Graph variants can be translated into source code or other representation. The language, intermediate representation, and associated transformations have been applied to improve the performance of differential equation solvers, or sparse matrix operations, tensor decomposition, and structured multigrid methods
    corecore