Search CORE

33 research outputs found

Scaling Monte Carlo Tree Search on Intel Xeon Phi

Author: Herik Jaap van den
Mirsoleimani S. Ali
Plaat Aske
Vermaseren Jos
Publication venue
Publication date: 15/07/2015
Field of study

Many algorithms have been parallelized successfully on the Intel Xeon Phi coprocessor, especially those with regular, balanced, and predictable data access patterns and instruction flows. Irregular and unbalanced algorithms are harder to parallelize efficiently. They are, for instance, present in artificial intelligence search algorithms such as Monte Carlo Tree Search (MCTS). In this paper we study the scaling behavior of MCTS, on a highly optimized real-world application, on real hardware. The Intel Xeon Phi allows shared memory scaling studies up to 61 cores and 244 hardware threads. We compare work-stealing (Cilk Plus and TBB) and work-sharing (FIFO scheduling) approaches. Interestingly, we find that a straightforward thread pool with a work-sharing FIFO queue shows the best performance. A crucial element for this high performance is the controlling of the grain size, an approach that we call Grain Size Controlled Parallel MCTS. Our subsequent comparing with the Xeon CPUs shows an even more comprehensible distinction in performance between different threading libraries. We achieve, to the best of our knowledge, the fastest implementation of a parallel MCTS on the 61 core Intel Xeon Phi using a real application (47 relative to a sequential run).Comment: 8 pages, 9 figure

arXiv.org e-Print Archive

Crossref

The Cilkview scalability analyzer

Author: Charles E. Leiserson
William M. Leiserson
Yuxiong He
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2010
Field of study

The Cilkview scalability analyzer is a software tool for profiling, estimating scalability, and benchmarking multithreaded Cilk++ ap-plications. Cilkview monitors logical parallelism during an instru-mented execution of the Cilk++ application on a single process-ing core. As Cilkview executes, it analyzes logical dependencies within the computation to determine its work and span (critical-path length). These metrics allow Cilkview to estimate parallelism and predict how the application will scale with the number of pro-cessing cores. In addition, Cilkview analyzes scheduling overhead using the concept of a “burdened dag, ” which allows it to diagnose performance problems in the application due to an insufficient grain size of parallel subcomputations. Cilkview employs the Pin dynamic-instrumentation framework to collect metrics during a serial execution of the application code. It operates directly on the optimized code rather than on a debug version. Metadata embedded by the Cilk++ compiler in the binary executable identifies the parallel control constructs in the executing application. This approach introduces little or no overhead to the program binary in normal runs. Cilkview can perform real-time scalability benchmarking auto-matically, producing gnuplot-compatible output that allows devel-opers to compare an application’s performance with the tool’s pre-dictions. If the program performs beneath the range of expectation, the programmer can be confident in seeking a cause such as insuf-ficient memory bandwidth, false sharing, or contention, rather than inadequate parallelism or insufficient grain size

CiteSeerX

Crossref

Parallel Computation of the Minimal Elements of a Poset

Author: Leiserson Charles E.
Li Liyun
Maza Marc Moreno
Xie Yuzhen
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2010
Field of study

Computing the minimal elements of a partially ordered finite set (poset) is a fundamental problem in combinatorics with numerous applications such as polynomial expression optimization, transversal hypergraph generation and redundant component removal, to name a few. We propose a divide-and-conquer algorithm which is not only cache-oblivious but also can be parallelized free of determinacy races. We have implemented it in Cilk++ targeting multicores. For our test problems of sufficiently large input size our code demonstrates a linear speedup on 32 cores.National Science Foundation (U.S.). (Grant number CNS-0615215)National Science Foundation (U.S.). (Grant number CCF- 0621511

DSpace@MIT

Crossref

Coz: Finding Code that Counts with Causal Profiling

Author: Ammons G.
Hollingsworth J. K.
Kambadur M.
Kambadur M.
Miller B. P.
Thompson K.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 12/08/2016
Field of study

Improving performance is a central concern for software developers. To locate optimization opportunities, developers rely on software profilers. However, these profilers only report where programs spent their time: optimizing that code may have no impact on performance. Past profilers thus both waste developer time and make it difficult for them to uncover significant optimization opportunities. This paper introduces causal profiling. Unlike past profiling approaches, causal profiling indicates exactly where programmers should focus their optimization efforts, and quantifies their potential impact. Causal profiling works by running performance experiments during program execution. Each experiment calculates the impact of any potential optimization by virtually speeding up code: inserting pauses that slow down all other code running concurrently. The key insight is that this slowdown has the same relative effect as running that line faster, thus "virtually" speeding it up. We present Coz, a causal profiler, which we evaluate on a range of highly-tuned applications: Memcached, SQLite, and the PARSEC benchmark suite. Coz identifies previously unknown optimization opportunities that are both significant and targeted. Guided by Coz, we improve the performance of Memcached by 9%, SQLite by 25%, and accelerate six PARSEC applications by as much as 68%; in most cases, these optimizations involve modifying under 10 lines of code.Comment: Published at SOSP 2015 (Best Paper Award

arXiv.org e-Print Archive

Crossref

Efficient Evaluation of Large Polynomials

Author: Li Liyun
Publication venue: Scholarship@Western
Publication date: 01/01/2010
Field of study

In scientific computing, it is often required to evaluate a polynomial expression (or a matrix depending on some variables) at many points which are not known in advance or with coordinates containing “symbolic expressions”. In these circumstances, standard evaluation schemes, such as those based on Fast Fourier Transforms do not apply. Given a polynomial f expressed as the sum of its terms, we propose an algorithm which generates a representation of f optimizing the process of evaluating f at some points. In addition, this evaluation of f can be done efficiently in terms of data locality and parallelism. We have implemented our algorithm in the Cilk++ concurrency platform and our implementation achieves nearly linear speedup on 16 cores with large enough input. For some large polynomials, the generated schedule can be evaluated at least 10 times faster than the schedules produced by other available software solutions. Moreover, our code can handle much larger input polynomials

Scholarship@Western

A Fast Causal Profiler for Task Parallel Programs

Author: Nagarakatte Santosh
Yoga Adarsh
Publication venue
Publication date: 02/07/2017
Field of study

This paper proposes TASKPROF, a profiler that identifies parallelism bottlenecks in task parallel programs. It leverages the structure of a task parallel execution to perform fine-grained attribution of work to various parts of the program. TASKPROF's use of hardware performance counters to perform fine-grained measurements minimizes perturbation. TASKPROF's profile execution runs in parallel using multi-cores. TASKPROF's causal profile enables users to estimate improvements in parallelism when a region of code is optimized even when concrete optimizations are not yet known. We have used TASKPROF to isolate parallelism bottlenecks in twenty three applications that use the Intel Threading Building Blocks library. We have designed parallelization techniques in five applications to in- crease parallelism by an order of magnitude using TASKPROF. Our user study indicates that developers are able to isolate performance bottlenecks with ease using TASKPROF.Comment: 11 page

arXiv.org e-Print Archive

Crossref

On-the-fly pipeline parallelism

Author: Lee I-Ting Angelina
Leiserson Charles E.
Schardl Tao Benjamin
Sukha Jim
Zhang Zhunping
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2013
Field of study

Pipeline parallelism organizes a parallel program as a linear sequence of s stages. Each stage processes elements of a data stream, passing each processed data element to the next stage, and then taking on a new element before the subsequent stages have necessarily completed their processing. Pipeline parallelism is used especially in streaming applications that perform video, audio, and digital signal processing. Three out of 13 benchmarks in PARSEC, a popular software benchmark suite designed for shared-memory multiprocessors, can be expressed as pipeline parallelism. Whereas most concurrency platforms that support pipeline parallelism use a "construct-and-run" approach, this paper investigates "on-the-fly" pipeline parallelism, where the structure of the pipeline emerges as the program executes rather than being specified a priori. On-the-fly pipeline parallelism allows the number of stages to vary from iteration to iteration and dependencies to be data dependent. We propose simple linguistics for specifying on-the-fly pipeline parallelism and describe a provably efficient scheduling algorithm, the Piper algorithm, which integrates pipeline parallelism into a work-stealing scheduler, allowing pipeline and fork-join parallelism to be arbitrarily nested. The Piper algorithm automatically throttles the parallelism, precluding "runaway" pipelines. Given a pipeline computation with T[subscript 1] work and T[subscript ∞] span (critical-path length), Piper executes the computation on P processors in T[subscript P]≤ T[subscript 1]/P + O(T[subscript ∞] + lg P) expected time. Piper also limits stack space, ensuring that it does not grow unboundedly with running time. We have incorporated on-the-fly pipeline parallelism into a Cilk-based work-stealing runtime system. Our prototype Cilk-P implementation exploits optimizations such as lazy enabling and dependency folding. We have ported the three PARSEC benchmarks that exhibit pipeline parallelism to run on Cilk-P. One of these, x264, cannot readily be executed by systems that support only construct-and-run pipeline parallelism. Benchmark results indicate that Cilk-P has low serial overhead and good scalability. On x264, for example, Cilk-P exhibits a speedup of 13.87 over its respective serial counterpart when running on 16 processors.National Science Foundation (U.S.) (Grant CNS-1017058)National Science Foundation (U.S.) (Grant CCF-1162148)National Science Foundation (U.S.). Graduate Research Fellowshi

CiteSeerX

DSpace@MIT

Crossref

Structured parallel programming for Monte Carlo Tree Search

Author: Mirsoleimani S.A.
Publication venue
Publication date: 17/06/2020
Field of study

The thesis is part of a bigger project, the HEPGAME (High Energy Physics Game). The main objective for HEPGAME is the utilization of AI solutions, particularly by using MCTS for simplification of HEP calculations. One of the issues is solving mathematical expressions of interest with millions of terms. These calculations can be solved with the FORM program, which is software for symbolic manipulation. Since these calculations are computationally intensive and take a large amount of time, the FORM program was parallelized to solve them in a reasonable amount of time.Therefore, any new algorithm based on MCTS, should also be parallelized. This requirement was behind the problem statement of the thesis: “How do we design a structured pattern-based parallel programming approach for efficient parallelism of MCTS for both multi-core and manycore shared-memory machines?”.To answer this question, the thesis approached the MCTS parallelization problem in three levels: (1) implementation level, (2) data structure level, and (3) algorithm level.In the implementation level, we proposed task-level parallelization over thread-level parallelization. Task-level parallelization provides us with efficient parallelism for MCTS to utilize cores on both multi-core and manycore machines.In the data structure level, we presented a lock-free data structure that guarantees the correctness. A lock-free data structure (1) removes the synchronization overhead when a parallel program needs many tasks to feed its cores and (2) improves both performance and scalability.In the algorithm level, we first explained how to use pipeline pattern for parallelization of MCTS to overcome search overhead. Then, through a step by step approach, we were able to propose and detail the structured parallel programming approach for Monte Carlo Tree Search.Algorithms and the Foundations of Software technolog

Leiden University Scholary Publications