170 research outputs found
Scaling Monte Carlo Tree Search on Intel Xeon Phi
Many algorithms have been parallelized successfully on the Intel Xeon Phi
coprocessor, especially those with regular, balanced, and predictable data
access patterns and instruction flows. Irregular and unbalanced algorithms are
harder to parallelize efficiently. They are, for instance, present in
artificial intelligence search algorithms such as Monte Carlo Tree Search
(MCTS). In this paper we study the scaling behavior of MCTS, on a highly
optimized real-world application, on real hardware. The Intel Xeon Phi allows
shared memory scaling studies up to 61 cores and 244 hardware threads. We
compare work-stealing (Cilk Plus and TBB) and work-sharing (FIFO scheduling)
approaches. Interestingly, we find that a straightforward thread pool with a
work-sharing FIFO queue shows the best performance. A crucial element for this
high performance is the controlling of the grain size, an approach that we call
Grain Size Controlled Parallel MCTS. Our subsequent comparing with the Xeon
CPUs shows an even more comprehensible distinction in performance between
different threading libraries. We achieve, to the best of our knowledge, the
fastest implementation of a parallel MCTS on the 61 core Intel Xeon Phi using a
real application (47 relative to a sequential run).Comment: 8 pages, 9 figure
Contention in multicore hardware shared resources: Understanding of the state of the art
The real-time systems community has over the years devoted considerable attention to the impact on execution timing that arises from contention on access to hardware shared resources. The relevance of this problem has been accentuated with the arrival of multicore processors. From the state of the art on the subject, there appears to be considerable diversity in the understanding of the problem and in the “approach” to solve it. This sparseness makes it difficult for any reader to form a coherent picture of the problem and solution space. This paper draws a tentative taxonomy in which each known approach to the problem can be categorised based on its specific goals and assumptions.Postprint (published version
Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable
There has been significant recent interest in parallel graph processing due
to the need to quickly analyze the large graphs available today. Many graph
codes have been designed for distributed memory or external memory. However,
today even the largest publicly-available real-world graph (the Hyperlink Web
graph with over 3.5 billion vertices and 128 billion edges) can fit in the
memory of a single commodity multicore server. Nevertheless, most experimental
work in the literature report results on much smaller graphs, and the ones for
the Hyperlink graph use distributed or external memory. Therefore, it is
natural to ask whether we can efficiently solve a broad class of graph problems
on this graph in memory.
This paper shows that theoretically-efficient parallel graph algorithms can
scale to the largest publicly-available graphs using a single machine with a
terabyte of RAM, processing them in minutes. We give implementations of
theoretically-efficient parallel algorithms for 20 important graph problems. We
also present the optimizations and techniques that we used in our
implementations, which were crucial in enabling us to process these large
graphs quickly. We show that the running times of our implementations
outperform existing state-of-the-art implementations on the largest real-world
graphs. For many of the problems that we consider, this is the first time they
have been solved on graphs at this scale. We have made the implementations
developed in this work publicly-available as the Graph-Based Benchmark Suite
(GBBS).Comment: This is the full version of the paper appearing in the ACM Symposium
on Parallelism in Algorithms and Architectures (SPAA), 201
Concurrent Secure Computation with Optimal Query Complexity
The multiple ideal query (MIQ) model [Goyal, Jain, and Ostrovsky, Crypto\u2710] offers a relaxed notion of security for concurrent secure computation, where the simulator is allowed to query the ideal functionality multiple times per session (as opposed to just once in the standard definition). The model provides a quantitative measure for the degradation in security under concurrent self-composition, where the degradation is measured by the number of ideal queries. However, to date, all known MIQ-secure protocols guarantee only an overall average bound on the number of queries per session throughout the execution, thus allowing the adversary to potentially fully compromise some sessions of its choice. Furthermore, [Goyal and Jain, Eurocrypt\u2713] rule out protocols where the simulator makes only an adversary-independent constant number of ideal queries per session.
We show the first MIQ-secure protocol with worst-case per-session guarantee. Specifically, we show a protocol for any functionality that matches the [GJ13] bound: The simulator makes only a constant number of ideal queries in every session. The constant depends on the adversary but is independent of the security parameter.
As an immediate corollary of our main result, we obtain the first password authenticated key exchange (PAKE) protocol for the fully concurrent, multiple password setting in the standard model with no set-up assumptions
GraphBLAST: A High-Performance Linear Algebra-based Graph Framework on the GPU
High-performance implementations of graph algorithms are challenging to
implement on new parallel hardware such as GPUs because of three challenges:
(1) the difficulty of coming up with graph building blocks, (2) load imbalance
on parallel hardware, and (3) graph problems having low arithmetic intensity.
To address some of these challenges, GraphBLAS is an innovative, on-going
effort by the graph analytics community to propose building blocks based on
sparse linear algebra, which will allow graph algorithms to be expressed in a
performant, succinct, composable and portable manner. In this paper, we examine
the performance challenges of a linear-algebra-based approach to building graph
frameworks and describe new design principles for overcoming these bottlenecks.
Among the new design principles is exploiting input sparsity, which allows
users to write graph algorithms without specifying push and pull direction.
Exploiting output sparsity allows users to tell the backend which values of the
output in a single vectorized computation they do not want computed.
Load-balancing is an important feature for balancing work amongst parallel
workers. We describe the important load-balancing features for handling graphs
with different characteristics. The design principles described in this paper
have been implemented in "GraphBLAST", the first high-performance linear
algebra-based graph framework on NVIDIA GPUs that is open-source. The results
show that on a single GPU, GraphBLAST has on average at least an order of
magnitude speedup over previous GraphBLAS implementations SuiteSparse and GBTL,
comparable performance to the fastest GPU hardwired primitives and
shared-memory graph frameworks Ligra and Gunrock, and better performance than
any other GPU graph framework, while offering a simpler and more concise
programming model.Comment: 50 pages, 14 figures, 14 table
Maintaining Acyclicity of Concurrent Graphs*
In this paper, we consider the problem of preserving acyclicity in a directed graph (for
shared memory architecture) that is concurrently being updated by threads adding/deleting
vertices and edges. To the best of our knowledge, no previous paper has presented a con-
current graph data structure. We implement the concurrent directed graph data-structure
as a concurrent adjacency list representation. We extend the lazy list implementation of
concurrent linked lists for maintaining concurrent adjacency lists. There exists a number of
graph applications which require the acyclic invariant in a directed graph. One such example
is Serialization Graph Testing Algorithm used in databases and transactional memory. We
present two concurrent algorithms for maintaining acyclicity in a concurrent graph: (i) Based
on obstruction-free snapshots (ii) Using wait-free reachability. We compare the performance
of these algorithms against the coarse-grained locking strategy, commonly used technique
for allowing concurrent updates. We present the speedup obtained by these algorithms over
sequential execution. As a future direction, we plan to extend this data structure for other
progress conditions
Scheduling in Transactional Memory Systems: Models, Algorithms, and Evaluations
Transactional memory provides an alternative synchronization mechanism that removes many limitations of traditional lock-based synchronization so that concurrent program writing is easier than lock-based code in modern multicore architectures. The fundamental module in a transactional memory system is the transaction which represents a sequence of read and write operations that are performed atomically to a set of shared resources; transactions may conflict if they access the same shared resources. A transaction scheduling algorithm is used to handle these transaction conflicts and schedule appropriately the transactions. In this dissertation, we study transaction scheduling problem in several systems that differ through the variation of the intra-core communication cost in accessing shared resources. Symmetric communication costs imply tightly-coupled systems, asymmetric communication costs imply large-scale distributed systems, and partially asymmetric communication costs imply non-uniform memory access systems. We made several theoretical contributions providing tight, near-tight, and/or impossibility results on three different performance evaluation metrics: execution time, communication cost, and load, for any transaction scheduling algorithm. We then complement these theoretical results by experimental evaluations, whenever possible, showing their benefits in practical scenarios. To the best of our knowledge, the contributions of this dissertation are either the first of their kind or significant improvements over the best previously known results
- …