23,960 research outputs found

    Architectural support for task dependence management with flexible software scheduling

    Get PDF
    The growing complexity of multi-core architectures has motivated a wide range of software mechanisms to improve the orchestration of parallel executions. Task parallelism has become a very attractive approach thanks to its programmability, portability and potential for optimizations. However, with the expected increase in core counts, finer-grained tasking will be required to exploit the available parallelism, which will increase the overheads introduced by the runtime system. This work presents Task Dependence Manager (TDM), a hardware/software co-designed mechanism to mitigate runtime system overheads. TDM introduces a hardware unit, denoted Dependence Management Unit (DMU), and minimal ISA extensions that allow the runtime system to offload costly dependence tracking operations to the DMU and to still perform task scheduling in software. With lower hardware cost, TDM outperforms hardware-based solutions and enhances the flexibility, adaptability and composability of the system. Results show that TDM improves performance by 12.3% and reduces EDP by 20.4% on average with respect to a software runtime system. Compared to a runtime system fully implemented in hardware, TDM achieves an average speedup of 4.2% with 7.3x less area requirements and significant EDP reductions. In addition, five different software schedulers are evaluated with TDM, illustrating its flexibility and performance gains.This work has been supported by the RoMoL ERC Advanced Grant (GA 321253), by the European HiPEAC Network of Excellence, by the Spanish Ministry of Science and Innovation (contracts TIN2015-65316-P, TIN2016-76635-C2-2-R and TIN2016-81840-REDT), by the Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272), and by the European Union’s Horizon 2020 research and innovation programme under grant agreement No 671697 and No. 671610. M. Moretó has been partially supported by the Ministry of Economy and Competitiveness under Juan de la Cierva postdoctoral fellowship number JCI-2012-15047.Peer ReviewedPostprint (author's final draft

    OpenACC Based GPU Parallelization of Plane Sweep Algorithm for Geometric Intersection

    Get PDF
    Line segment intersection is one of the elementary operations in computational geometry. Complex problems in Geographic Information Systems (GIS) like finding map overlays or spatial joins using polygonal data require solving segment intersections. Plane sweep paradigm is used for finding geometric intersection in an efficient manner. However, it is difficult to parallelize due to its in-order processing of spatial events. We present a new fine-grained parallel algorithm for geometric intersection and its CPU and GPU implementation using OpenMP and OpenACC. To the best of our knowledge, this is the first work demonstrating an effective parallelization of plane sweep on GPUs. We chose compiler directive based approach for implementation because of its simplicity to parallelize sequential code. Using Nvidia Tesla P100 GPU, our implementation achieves around 40X speedup for line segment intersection problem on 40K and 80K data sets compared to sequential CGAL library

    Scaling Monte Carlo Tree Search on Intel Xeon Phi

    Full text link
    Many algorithms have been parallelized successfully on the Intel Xeon Phi coprocessor, especially those with regular, balanced, and predictable data access patterns and instruction flows. Irregular and unbalanced algorithms are harder to parallelize efficiently. They are, for instance, present in artificial intelligence search algorithms such as Monte Carlo Tree Search (MCTS). In this paper we study the scaling behavior of MCTS, on a highly optimized real-world application, on real hardware. The Intel Xeon Phi allows shared memory scaling studies up to 61 cores and 244 hardware threads. We compare work-stealing (Cilk Plus and TBB) and work-sharing (FIFO scheduling) approaches. Interestingly, we find that a straightforward thread pool with a work-sharing FIFO queue shows the best performance. A crucial element for this high performance is the controlling of the grain size, an approach that we call Grain Size Controlled Parallel MCTS. Our subsequent comparing with the Xeon CPUs shows an even more comprehensible distinction in performance between different threading libraries. We achieve, to the best of our knowledge, the fastest implementation of a parallel MCTS on the 61 core Intel Xeon Phi using a real application (47 relative to a sequential run).Comment: 8 pages, 9 figure

    The Glasgow Parallel Reduction Machine: Programming Shared-memory Many-core Systems using Parallel Task Composition

    Get PDF
    We present the Glasgow Parallel Reduction Machine (GPRM), a novel, flexible framework for parallel task-composition based many-core programming. We allow the programmer to structure programs into task code, written as C++ classes, and communication code, written in a restricted subset of C++ with functional semantics and parallel evaluation. In this paper we discuss the GPRM, the virtual machine framework that enables the parallel task composition approach. We focus the discussion on GPIR, the functional language used as the intermediate representation of the bytecode running on the GPRM. Using examples in this language we show the flexibility and power of our task composition framework. We demonstrate the potential using an implementation of a merge sort algorithm on a 64-core Tilera processor, as well as on a conventional Intel quad-core processor and an AMD 48-core processor system. We also compare our framework with OpenMP tasks in a parallel pointer chasing algorithm running on the Tilera processor. Our results show that the GPRM programs outperform the corresponding OpenMP codes on all test platforms, and can greatly facilitate writing of parallel programs, in particular non-data parallel algorithms such as reductions.Comment: In Proceedings PLACES 2013, arXiv:1312.221
    corecore