Search CORE

170 research outputs found

Scaling Monte Carlo Tree Search on Intel Xeon Phi

Author: Herik Jaap van den
Mirsoleimani S. Ali
Plaat Aske
Vermaseren Jos
Publication venue
Publication date: 15/07/2015
Field of study

Many algorithms have been parallelized successfully on the Intel Xeon Phi coprocessor, especially those with regular, balanced, and predictable data access patterns and instruction flows. Irregular and unbalanced algorithms are harder to parallelize efficiently. They are, for instance, present in artificial intelligence search algorithms such as Monte Carlo Tree Search (MCTS). In this paper we study the scaling behavior of MCTS, on a highly optimized real-world application, on real hardware. The Intel Xeon Phi allows shared memory scaling studies up to 61 cores and 244 hardware threads. We compare work-stealing (Cilk Plus and TBB) and work-sharing (FIFO scheduling) approaches. Interestingly, we find that a straightforward thread pool with a work-sharing FIFO queue shows the best performance. A crucial element for this high performance is the controlling of the grain size, an approach that we call Grain Size Controlled Parallel MCTS. Our subsequent comparing with the Xeon CPUs shows an even more comprehensible distinction in performance between different threading libraries. We achieve, to the best of our knowledge, the fastest implementation of a parallel MCTS on the 61 core Intel Xeon Phi using a real application (47 relative to a sequential run).Comment: 8 pages, 9 figure

arXiv.org e-Print Archive

Crossref

Contention in multicore hardware shared resources: Understanding of the state of the art

Author: Abella Ferrer Jaume
Cazorla Almeida Francisco Javier
Fernández Gabriel
Quiñones Eduardo
Rochange Christine
Vardanega Tullio
Publication venue: Schloss Dagstuhl - Leibniz-Zentrum für Informatik
Publication date: 01/01/2014
Field of study

The real-time systems community has over the years devoted considerable attention to the impact on execution timing that arises from contention on access to hardware shared resources. The relevance of this problem has been accentuated with the arrival of multicore processors. From the state of the art on the subject, there appears to be considerable diversity in the understanding of the problem and in the “approach” to solve it. This sparseness makes it difficult for any reader to form a coherent picture of the problem and solution space. This paper draws a tentative taxonomy in which each known approach to the problem can be categorised based on its specific goals and assumptions.Postprint (published version

UPCommons. Portal del coneixement obert de la UPC

Scientific Publications of the University of Toulouse II Le Mirail

Open Archive Toulouse Archive Ouverte

Dagstuhl Research Online Publication Server

Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable

Author: Blelloch G. E.
Blelloch G. E.
Cormen T. H.
Da Zheng D. M.
Dasari N. S.
Gonzalez J. E.
Greenlaw R.
Karp R. M.
Low Y.
Maon Y.
Ramachandran V.
Shiloach Y.
Zhou W.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 03/07/2019
Field of study

There has been significant recent interest in parallel graph processing due to the need to quickly analyze the large graphs available today. Many graph codes have been designed for distributed memory or external memory. However, today even the largest publicly-available real-world graph (the Hyperlink Web graph with over 3.5 billion vertices and 128 billion edges) can fit in the memory of a single commodity multicore server. Nevertheless, most experimental work in the literature report results on much smaller graphs, and the ones for the Hyperlink graph use distributed or external memory. Therefore, it is natural to ask whether we can efficiently solve a broad class of graph problems on this graph in memory. This paper shows that theoretically-efficient parallel graph algorithms can scale to the largest publicly-available graphs using a single machine with a terabyte of RAM, processing them in minutes. We give implementations of theoretically-efficient parallel algorithms for 20 important graph problems. We also present the optimizations and techniques that we used in our implementations, which were crucial in enabling us to process these large graphs quickly. We show that the running times of our implementations outperform existing state-of-the-art implementations on the largest real-world graphs. For many of the problems that we consider, this is the first time they have been solved on graphs at this scale. We have made the implementations developed in this work publicly-available as the Graph-Based Benchmark Suite (GBBS).Comment: This is the full version of the paper appearing in the ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), 201

arXiv.org e-Print Archive

Crossref

DSpace@MIT

Concurrent Secure Computation with Optimal Query Complexity

Author: Abhishek Jain
Ran Canetti
Vipul Goyal
Publication venue: International Association for Cryptologic Research (IACR)
Publication date: 17/06/2015
Field of study

The multiple ideal query (MIQ) model [Goyal, Jain, and Ostrovsky, Crypto\u2710] offers a relaxed notion of security for concurrent secure computation, where the simulator is allowed to query the ideal functionality multiple times per session (as opposed to just once in the standard definition). The model provides a quantitative measure for the degradation in security under concurrent self-composition, where the degradation is measured by the number of ideal queries. However, to date, all known MIQ-secure protocols guarantee only an overall average bound on the number of queries per session throughout the execution, thus allowing the adversary to potentially fully compromise some sessions of its choice. Furthermore, [Goyal and Jain, Eurocrypt\u2713] rule out protocols where the simulator makes only an adversary-independent constant number of ideal queries per session. We show the first MIQ-secure protocol with worst-case per-session guarantee. Specifically, we show a protocol for any functionality that matches the [GJ13] bound: The simulator makes only a constant number of ideal queries in every session. The constant depends on the adversary but is independent of the security parameter. As an immediate corollary of our main result, we obtain the first password authenticated key exchange (PAKE) protocol for the fully concurrent, multiple password setting in the standard model with no set-up assumptions

CiteSeerX

Cryptology ePrint Archive

GraphBLAST: A High-Performance Linear Algebra-based Graph Framework on the GPU

Author: Buluc Aydin
Owens John D.
Yang Carl
Publication venue
Publication date: 14/11/2020
Field of study

High-performance implementations of graph algorithms are challenging to implement on new parallel hardware such as GPUs because of three challenges: (1) the difficulty of coming up with graph building blocks, (2) load imbalance on parallel hardware, and (3) graph problems having low arithmetic intensity. To address some of these challenges, GraphBLAS is an innovative, on-going effort by the graph analytics community to propose building blocks based on sparse linear algebra, which will allow graph algorithms to be expressed in a performant, succinct, composable and portable manner. In this paper, we examine the performance challenges of a linear-algebra-based approach to building graph frameworks and describe new design principles for overcoming these bottlenecks. Among the new design principles is exploiting input sparsity, which allows users to write graph algorithms without specifying push and pull direction. Exploiting output sparsity allows users to tell the backend which values of the output in a single vectorized computation they do not want computed. Load-balancing is an important feature for balancing work amongst parallel workers. We describe the important load-balancing features for handling graphs with different characteristics. The design principles described in this paper have been implemented in "GraphBLAST", the first high-performance linear algebra-based graph framework on NVIDIA GPUs that is open-source. The results show that on a single GPU, GraphBLAST has on average at least an order of magnitude speedup over previous GraphBLAS implementations SuiteSparse and GBTL, comparable performance to the fastest GPU hardwired primitives and shared-memory graph frameworks Ligra and Gunrock, and better performance than any other GPU graph framework, while offering a simpler and more concise programming model.Comment: 50 pages, 14 figures, 14 table

arXiv.org e-Print Archive

eScholarship - University of California

Maintaining Acyclicity of Concurrent Graphs*

Author: Peri Sathya
Sa M
Singhal N
Publication venue: 'Cornell University Library'
Publication date: 01/01/2016
Field of study

In this paper, we consider the problem of preserving acyclicity in a directed graph (for shared memory architecture) that is concurrently being updated by threads adding/deleting vertices and edges. To the best of our knowledge, no previous paper has presented a con- current graph data structure. We implement the concurrent directed graph data-structure as a concurrent adjacency list representation. We extend the lazy list implementation of concurrent linked lists for maintaining concurrent adjacency lists. There exists a number of graph applications which require the acyclic invariant in a directed graph. One such example is Serialization Graph Testing Algorithm used in databases and transactional memory. We present two concurrent algorithms for maintaining acyclicity in a concurrent graph: (i) Based on obstruction-free snapshots (ii) Using wait-free reachability. We compare the performance of these algorithms against the coarse-grained locking strategy, commonly used technique for allowing concurrent updates. We present the speedup obtained by these algorithms over sequential execution. As a future direction, we plan to extend this data structure for other progress conditions

Research Archive of Indian Institute of Technology Hyderabad

Scheduling in Transactional Memory Systems: Models, Algorithms, and Evaluations

Author: Sharma Gokarna
Publication venue: LSU Digital Commons
Publication date: 01/01/2014
Field of study

Transactional memory provides an alternative synchronization mechanism that removes many limitations of traditional lock-based synchronization so that concurrent program writing is easier than lock-based code in modern multicore architectures. The fundamental module in a transactional memory system is the transaction which represents a sequence of read and write operations that are performed atomically to a set of shared resources; transactions may conflict if they access the same shared resources. A transaction scheduling algorithm is used to handle these transaction conflicts and schedule appropriately the transactions. In this dissertation, we study transaction scheduling problem in several systems that differ through the variation of the intra-core communication cost in accessing shared resources. Symmetric communication costs imply tightly-coupled systems, asymmetric communication costs imply large-scale distributed systems, and partially asymmetric communication costs imply non-uniform memory access systems. We made several theoretical contributions providing tight, near-tight, and/or impossibility results on three different performance evaluation metrics: execution time, communication cost, and load, for any transaction scheduling algorithm. We then complement these theoretical results by experimental evaluations, whenever possible, showing their benefits in practical scenarios. To the best of our knowledge, the contributions of this dissertation are either the first of their kind or significant improvements over the best previously known results

Louisiana State University