1,479 research outputs found

    Optimizing the MapReduce Framework on Intel Xeon Phi Coprocessor

    Full text link
    With the ease-of-programming, flexibility and yet efficiency, MapReduce has become one of the most popular frameworks for building big-data applications. MapReduce was originally designed for distributed-computing, and has been extended to various architectures, e,g, multi-core CPUs, GPUs and FPGAs. In this work, we focus on optimizing the MapReduce framework on Xeon Phi, which is the latest product released by Intel based on the Many Integrated Core Architecture. To the best of our knowledge, this is the first work to optimize the MapReduce framework on the Xeon Phi. In our work, we utilize advanced features of the Xeon Phi to achieve high performance. In order to take advantage of the SIMD vector processing units, we propose a vectorization friendly technique for the map phase to assist the auto-vectorization as well as develop SIMD hash computation algorithms. Furthermore, we utilize MIMD hyper-threading to pipeline the map and reduce to improve the resource utilization. We also eliminate multiple local arrays but use low cost atomic operations on the global array for some applications, which can improve the thread scalability and data locality due to the coherent L2 caches. Finally, for a given application, our framework can either automatically detect suitable techniques to apply or provide guideline for users at compilation time. We conduct comprehensive experiments to benchmark the Xeon Phi and compare our optimized MapReduce framework with a state-of-the-art multi-core based MapReduce framework (Phoenix++). By evaluating six real-world applications, the experimental results show that our optimized framework is 1.2X to 38X faster than Phoenix++ for various applications on the Xeon Phi

    Towards co-designed optimizations in parallel frameworks: A MapReduce case study

    Full text link
    The explosion of Big Data was followed by the proliferation of numerous complex parallel software stacks whose aim is to tackle the challenges of data deluge. A drawback of a such multi-layered hierarchical deployment is the inability to maintain and delegate vital semantic information between layers in the stack. Software abstractions increase the semantic distance between an application and its generated code. However, parallel software frameworks contain inherent semantic information that general purpose compilers are not designed to exploit. This paper presents a case study demonstrating how the specific semantic information of the MapReduce paradigm can be exploited on multicore architectures. MR4J has been implemented in Java and evaluated against hand-optimized C and C++ equivalents. The initial observed results led to the design of a semantically aware optimizer that runs automatically without requiring modification to application code. The optimizer is able to speedup the execution time of MR4J by up to 2.0x. The introduced optimization not only improves the performance of the generated code, during the map phase, but also reduces the pressure on the garbage collector. This demonstrates how semantic information can be harnessed without sacrificing sound software engineering practices when using parallel software frameworks.Comment: 8 page

    Parallel Discrete Event Simulation with Erlang

    Full text link
    Discrete Event Simulation (DES) is a widely used technique in which the state of the simulator is updated by events happening at discrete points in time (hence the name). DES is used to model and analyze many kinds of systems, including computer architectures, communication networks, street traffic, and others. Parallel and Distributed Simulation (PADS) aims at improving the efficiency of DES by partitioning the simulation model across multiple processing elements, in order to enabling larger and/or more detailed studies to be carried out. The interest on PADS is increasing since the widespread availability of multicore processors and affordable high performance computing clusters. However, designing parallel simulation models requires considerable expertise, the result being that PADS techniques are not as widespread as they could be. In this paper we describe ErlangTW, a parallel simulation middleware based on the Time Warp synchronization protocol. ErlangTW is entirely written in Erlang, a concurrent, functional programming language specifically targeted at building distributed systems. We argue that writing parallel simulation models in Erlang is considerably easier than using conventional programming languages. Moreover, ErlangTW allows simulation models to be executed either on single-core, multicore and distributed computing architectures. We describe the design and prototype implementation of ErlangTW, and report some preliminary performance results on multicore and distributed architectures using the well known PHOLD benchmark.Comment: Proceedings of ACM SIGPLAN Workshop on Functional High-Performance Computing (FHPC 2012) in conjunction with ICFP 2012. ISBN: 978-1-4503-1577-

    Parallel Sort-Based Matching for Data Distribution Management on Shared-Memory Multiprocessors

    Full text link
    In this paper we consider the problem of identifying intersections between two sets of d-dimensional axis-parallel rectangles. This is a common problem that arises in many agent-based simulation studies, and is of central importance in the context of High Level Architecture (HLA), where it is at the core of the Data Distribution Management (DDM) service. Several realizations of the DDM service have been proposed; however, many of them are either inefficient or inherently sequential. These are serious limitations since multicore processors are now ubiquitous, and DDM algorithms -- being CPU-intensive -- could benefit from additional computing power. We propose a parallel version of the Sort-Based Matching algorithm for shared-memory multiprocessors. Sort-Based Matching is one of the most efficient serial algorithms for the DDM problem, but is quite difficult to parallelize due to data dependencies. We describe the algorithm and compute its asymptotic running time; we complete the analysis by assessing its performance and scalability through extensive experiments on two commodity multicore systems based on a dual socket Intel Xeon processor, and a single socket Intel Core i7 processor.Comment: Proceedings of the 21-th ACM/IEEE International Symposium on Distributed Simulation and Real Time Applications (DS-RT 2017). Best Paper Award @DS-RT 201

    High-level programming of stencil computations on multi-GPU systems using the SkelCL library

    Get PDF
    The implementation of stencil computations on modern, massively parallel systems with GPUs and other accelerators currently relies on manually-tuned coding using low-level approaches like OpenCL and CUDA. This makes development of stencil applications a complex, time-consuming, and error-prone task. We describe how stencil computations can be programmed in our SkelCL approach that combines high-level programming abstractions with competitive performance on multi-GPU systems. SkelCL extends the OpenCL standard by three high-level features: 1) pre-implemented parallel patterns (a.k.a. skeletons); 2) container data types for vectors and matrices; 3) automatic data (re)distribution mechanism. We introduce two new SkelCL skeletons which specifically target stencil computations – MapOverlap and Stencil – and we describe their use for particular application examples, discuss their efficient parallel implementation, and report experimental results on systems with multiple GPUs. Our evaluation of three real-world applications shows that stencil code written with SkelCL is considerably shorter and offers competitive performance to hand-tuned OpenCL code
    • …
    corecore