785 research outputs found

    Optimizing the MapReduce Framework on Intel Xeon Phi Coprocessor

    Full text link
    With the ease-of-programming, flexibility and yet efficiency, MapReduce has become one of the most popular frameworks for building big-data applications. MapReduce was originally designed for distributed-computing, and has been extended to various architectures, e,g, multi-core CPUs, GPUs and FPGAs. In this work, we focus on optimizing the MapReduce framework on Xeon Phi, which is the latest product released by Intel based on the Many Integrated Core Architecture. To the best of our knowledge, this is the first work to optimize the MapReduce framework on the Xeon Phi. In our work, we utilize advanced features of the Xeon Phi to achieve high performance. In order to take advantage of the SIMD vector processing units, we propose a vectorization friendly technique for the map phase to assist the auto-vectorization as well as develop SIMD hash computation algorithms. Furthermore, we utilize MIMD hyper-threading to pipeline the map and reduce to improve the resource utilization. We also eliminate multiple local arrays but use low cost atomic operations on the global array for some applications, which can improve the thread scalability and data locality due to the coherent L2 caches. Finally, for a given application, our framework can either automatically detect suitable techniques to apply or provide guideline for users at compilation time. We conduct comprehensive experiments to benchmark the Xeon Phi and compare our optimized MapReduce framework with a state-of-the-art multi-core based MapReduce framework (Phoenix++). By evaluating six real-world applications, the experimental results show that our optimized framework is 1.2X to 38X faster than Phoenix++ for various applications on the Xeon Phi

    Impact Of Thread Scheduling On Modern Gpus

    Get PDF
    The Graphics Processing Unit (GPU) has become a more important component in high-performance computing systems as it accelerates data and compute intensive applications significantly with less cost and power. The GPU achieves high performance by executing massive number of threads in parallel in a SPMD (Single Program Multiple Data) fashion. Threads are grouped into workgroups by programmer and workgroups are then assigned to each compute core on the GPU by hardware. Once assigned, a workgroup is further subgrouped into wavefronts of the fixed number of threads by hardware when executed in a SIMD (Single Instruction Multiple Data) fashion. In this thesis, we investigated the impact of thread (at workgroup and wavefront level) scheduling on overall hardware utilization and performance. We implement four different thread schedulers: Two-level wavefront scheduler, Lookahead wavefront scheduler and Two-level + Lookahead wavefront scheduler, and Block workgroup scheduler. We implement and test these schedulers on a cycle accurate detailed architectural simulator called Multi2Sim targeting AMD\u27s latest Graphics Core Next (GCN) architecture. Our extensive evaluation and analysis show that using some of these alternate mechanisms, cache hit rate is improved by an average of 30% compared to the baseline round-robin scheduler, thus drastically reducing the number of stalls caused by long latency memory operations. We also observe that some of these schedulers improve overall performance by an average of 17% compared to the baseline

    PyCUDA and PyOpenCL: A Scripting-Based Approach to GPU Run-Time Code Generation

    Full text link
    High-performance computing has recently seen a surge of interest in heterogeneous systems, with an emphasis on modern Graphics Processing Units (GPUs). These devices offer tremendous potential for performance and efficiency in important large-scale applications of computational science. However, exploiting this potential can be challenging, as one must adapt to the specialized and rapidly evolving computing environment currently exhibited by GPUs. One way of addressing this challenge is to embrace better techniques and develop tools tailored to their needs. This article presents one simple technique, GPU run-time code generation (RTCG), along with PyCUDA and PyOpenCL, two open-source toolkits that support this technique. In introducing PyCUDA and PyOpenCL, this article proposes the combination of a dynamic, high-level scripting language with the massive performance of a GPU as a compelling two-tiered computing platform, potentially offering significant performance and productivity advantages over conventional single-tier, static systems. The concept of RTCG is simple and easily implemented using existing, robust infrastructure. Nonetheless it is powerful enough to support (and encourage) the creation of custom application-specific tools by its users. The premise of the paper is illustrated by a wide range of examples where the technique has been applied with considerable success.Comment: Submitted to Parallel Computing, Elsevie

    Kiihdytetyn laskennan ajoituksen ja energiankulutuksen simulointi

    Get PDF
    As the increase in the sequential processing performance of general-purpose central processing units has slowed down dramatically, computer systems have been moving towards increasingly parallel and heterogeneous architectures. Modern graphics processing units have emerged as one of the first affordable platforms for data-parallel processing. Due to their closed nature, it has been difficult for software developers to observe the performance and energy efficiency characteristics of the execution of applications of graphics processing units. In this thesis, we have explored different tools and methods for observing the execution of accelerated processing on graphics processing units. We have found that hardware vendors provide interfaces for observing the timing of events that occur on the host platform and aggregated performance metrics of execution on the graphics processing units to some extent. However, more fine-grained details of execution are currently available only by using graphics processing unit simulators. As a proof-of-concept, we have studied a functional graphics processing unit simulator as a tool for understanding the energy efficiency of accelerated processing. The presented energy estimation model and simulation method has been validated against a face detection application. The difference between the estimated and measured dynamic energy consumption in this case was found to be 5.4%. Functional simulators appear to be accurate enough to be used for observing the energy efficiency of graphics processing unit accelerated processing in certain use-cases.Suorittimien sarjallisen suorituskyvyn kasvun hidastuessa tietokonejärjestelmät ovat siirtymässä kohti rinnakkaislaskentaa ja heterogeenisia arkkitehtuureja. Modernit grafiikkasuorittimet ovat yleistyneet ensimmäisinä huokeina alustoina yleisluonteisen kiihdytetyn datarinnakkaisen laskennan suorittamiseen. Grafiikkasuorittimet ovat usein suljettuja alustoja, minkä takia ohjelmistokehittäjien on vaikea havainnoida tarkempia yksityiskohtia suorituksesta liittyen laskennan suorituskykyyn ja energian kulutukseen. Tässä työssä on tutkittu erilaisia työkaluja ja tapoja tarkkailla ohjelmien kiihdytettyä suoritusta grafiikkasuorittimilla. Laitevalmistajat tarjoavat joitakin rajapintoja tapahtumien ajoituksen havainnointiin sekä isäntäalustalla että grafiikkasuorittimella. Laskennan tarkempaan havainnointiin on kuitenkin usein käytettävä grafiikkasuoritinsimulaattoreita. Työn kokeellisessa osuudessa työssä on tutkittu funktionaalisten grafiikkasuoritinsimulaattoreiden käyttöä työkaluna grafiikkasuorittimella kiihdytetyn laskennan energiantehokkuuden arvioinnissa. Työssä on malli grafiikkasuorittimen energian kulutuksen arviontiin. Arvion validointiin on käytetty kasvontunnistussovellusta. Mittauksissa arvioidun ja mitatun energian kulutuksen eroksi mitattiin 5.4%. Funktionaaliset simulaattorit ovat mittaustemme perusteella tietyissä käyttötarkoituksissa tarpeeksi tarkkoja grafiikkasuorittimella kiihdytetyn laskennan energiatehokkuuden arviointiin

    Decoupling algorithms from schedules for easy optimization of image processing pipelines

    Get PDF
    Using existing programming tools, writing high-performance image processing code requires sacrificing readability, portability, and modularity. We argue that this is a consequence of conflating what computations define the algorithm, with decisions about storage and the order of computation. We refer to these latter two concerns as the schedule, including choices of tiling, fusion, recomputation vs. storage, vectorization, and parallelism. We propose a representation for feed-forward imaging pipelines that separates the algorithm from its schedule, enabling high-performance without sacrificing code clarity. This decoupling simplifies the algorithm specification: images and intermediate buffers become functions over an infinite integer domain, with no explicit storage or boundary conditions. Imaging pipelines are compositions of functions. Programmers separately specify scheduling strategies for the various functions composing the algorithm, which allows them to efficiently explore different optimizations without changing the algorithmic code. We demonstrate the power of this representation by expressing a range of recent image processing applications in an embedded domain specific language called Halide, and compiling them for ARM, x86, and GPUs. Our compiler targets SIMD units, multiple cores, and complex memory hierarchies. We demonstrate that it can handle algorithms such as a camera raw pipeline, the bilateral grid, fast local Laplacian filtering, and image segmentation. The algorithms expressed in our language are both shorter and faster than state-of-the-art implementations.National Science Foundation (U.S.) (Grant 0964004)National Science Foundation (U.S.) (Grant 0964218)National Science Foundation (U.S.) (Grant 0832997)United States. Dept. of Energy (Award DE-SC0005288)Cognex CorporationAdobe System

    Investigating Single Precision Floating General Matrix Multiply in Heterogeneous Hardware

    Get PDF
    The fundamental operation of matrix multiplication is ubiquitous across a myriad of disciplines. Yet, the identification of new optimizations for matrix multiplication remains relevant for emerging hardware architectures and heterogeneous systems. Frameworks such as OpenCL enable computation orchestration on existing systems, and its availability using the Intel High Level Synthesis compiler allows users to architect new designs for reconfigurable hardware using C/C++. Using the HARPv2 as a vehicle for exploration, we investigate the utility of several of the most notable matrix multiplication optimizations to better understand the performance portability of OpenCL and the implications for such optimizations on this and future heterogeneous architectures. Our results give targeted insights into the applicability of best practices that were for existing architectures when used on emerging heterogeneous systems
    corecore