9 research outputs found
Activity recognition from videos with parallel hypergraph matching on GPUs
In this paper, we propose a method for activity recognition from videos based
on sparse local features and hypergraph matching. We benefit from special
properties of the temporal domain in the data to derive a sequential and fast
graph matching algorithm for GPUs.
Traditionally, graphs and hypergraphs are frequently used to recognize
complex and often non-rigid patterns in computer vision, either through graph
matching or point-set matching with graphs. Most formulations resort to the
minimization of a difficult discrete energy function mixing geometric or
structural terms with data attached terms involving appearance features.
Traditional methods solve this minimization problem approximately, for instance
with spectral techniques.
In this work, instead of solving the problem approximatively, the exact
solution for the optimal assignment is calculated in parallel on GPUs. The
graphical structure is simplified and regularized, which allows to derive an
efficient recursive minimization algorithm. The algorithm distributes
subproblems over the calculation units of a GPU, which solves them in parallel,
allowing the system to run faster than real-time on medium-end GPUs
Efficient Synchronization Primitives for GPUs
In this paper, we revisit the design of synchronization
primitives---specifically barriers, mutexes, and semaphores---and how they
apply to the GPU. Previous implementations are insufficient due to the
discrepancies in hardware and programming model of the GPU and CPU. We create
new implementations in CUDA and analyze the performance of spinning on the GPU,
as well as a method of sleeping on the GPU, by running a set of memory-system
benchmarks on two of the most common GPUs in use, the Tesla- and Fermi-class
GPUs from NVIDIA. From our results we define higher-level principles that are
valid for generic many-core processors, the most important of which is to limit
the number of atomic accesses required for a synchronization operation because
atomic accesses are slower than regular memory accesses. We use the results of
the benchmarks to critique existing synchronization algorithms and guide our
new implementations, and then define an abstraction of GPUs to classify any GPU
based on the behavior of the memory system. We use this abstraction to create
suitable implementations of the primitives specifically targeting the GPU, and
analyze the performance of these algorithms on Tesla and Fermi. We then predict
performance on future GPUs based on characteristics of the abstraction. We also
examine the roles of spin waiting and sleep waiting in each primitive and how
their performance varies based on the machine abstraction, then give a set of
guidelines for when each strategy is useful based on the characteristics of the
GPU and expected contention.Comment: 13 pages with appendix, several figures, plans to submit to CompSci
conference in early 201
Towards Chapel-based Exascale Tree Search Algorithms: dealing with multiple GPU accelerators
International audienceTree-based search algorithms applied to combinatorial optimization problems are highly irregular and time consuming when it comes to solving big instances. Solving such instances efficiently requires the use of massively parallel distributed-memory supercomputers. According to recent Top 500 trends, the degree of parallelism in these supercomputers continues to increase in size and complexity, with millions of heterogeneous (mainly CPU-GPU) cores. Harnessing this scale of computing resources raises at least three challenging issues which are described and addressed in this paper. Indeed, as a step towards exascale computing, we revisit the design and implementation of tree search algorithms dealing with multiple GPUs, in addition to scalability and productivity-awareness using Chapel. The proposed algorithm exploits Chapel's distributed iterators by combining a partial search strategy with pre-compiled CUDA kernels for more efficient exploitation of the intra-node parallelism. Extensive experimentation on big N-Queens problem instances using 24 GPUs shows that up to 90% of the linear speedup can be achieved
GSI: GPU-friendly Subgraph Isomorphism
Subgraph isomorphism is a well-known NP-hard problem that is widely used in
many applications, such as social network analysis and query over the knowledge
graph. Due to the inherent hardness, its performance is often a bottleneck in
various real-world applications. Therefore, we address this by designing an
efficient subgraph isomorphism algorithm leveraging features of GPU
architecture, such as massive parallelism and memory hierarchy. Existing
GPU-based solutions adopt a two-step output scheme, performing the same join
process twice in order to write intermediate results concurrently. They also
lack GPU architecture-aware optimizations that allow scaling to large graphs.
In this paper, we propose a GPU-friendly subgraph isomorphism algorithm, GSI.
Different from existing edge join-based GPU solutions, we propose a
Prealloc-Combine strategy based on the vertex-oriented framework, which avoids
joining-twice in existing solutions. Also, a GPU-friendly data structure
(called PCSR) is proposed to represent an edge-labeled graph. Extensive
experiments on both synthetic and real graphs show that GSI outperforms the
state-of-the-art algorithms by up to several orders of magnitude and has good
scalability with graph size scaling to hundreds of millions of edges.Comment: 15 pages, 17 figures, conferenc
Parallelizing Maximal Clique Enumeration on GPUs
We present a GPU solution for exact maximal clique enumeration (MCE) that
performs a search tree traversal following the Bron-Kerbosch algorithm. Prior
works on parallelizing MCE on GPUs perform a breadth-first traversal of the
tree, which has limited scalability because of the explosion in the number of
tree nodes at deep levels. We propose to parallelize MCE on GPUs by performing
depth-first traversal of independent subtrees in parallel. Since MCE suffers
from high load imbalance and memory capacity requirements, we propose a worker
list for dynamic load balancing, as well as partial induced subgraphs and a
compact representation of excluded vertex sets to regulate memory consumption.
Our evaluation shows that our GPU implementation on a single GPU outperforms
the state-of-the-art parallel CPU implementation by a geometric mean of 4.9x
(up to 16.7x), and scales efficiently to multiple GPUs. Our code has been
open-sourced to enable further research on accelerating MCE
Solving large permutation flow-shop scheduling problems on GPU-accelerated supercomputers
Makespan minimization in permutation flow-shop scheduling is a well-known hard combinatorial optimization problem. Among the 120 standard benchmark instances proposed by E. Taillard in 1993, 23 have remained unsolved for almost three decades. In this paper, we present our attempts to solve these instances to optimality using parallel Branch-and-Bound tree search on the GPU-accelerated Jean Zay supercomputer. We report the exact solution of 11 previously unsolved problem instances and improved upper bounds for 8 instances. The solution of these problems requires both algorithmic improvements and leveraging the computing power of peta-scale high-performance computing platforms. The challenge consists in efficiently performing parallel depth-first traversal of a highly irregular, fine-grained search tree on distributed systems composed of hundreds of massively parallel accelerator devices and multi-core processors. We present and discuss the design and implementation of our permutation-based B&B and experimentally evaluate its parallel performance on up to 384 V100 GPUs (2 million CUDA cores) and 3840 CPU cores. The optimality proof for the largest solved instance requires about 64 CPU-years of computation-using 256 GPUs and over 4 million parallel search agents, the traversal of the search tree is completed in 13 hours, exploring 339 Tera-nodes
Lessons learned from exploring the backtracking paradigm on the GPU
Abstract. We explore the backtracking paradigm with properties seen as sub-optimal for GPU architectures, using as a case study the maximal clique enumeration problem, and find that the presence of these properties limit GPU performance to approximately 1.4–2.25 times a single CPU core. The GPU performance “lessons ” we find critical to providing this performance include a coarse-and-fine-grain parallelization of the search space, a low-overhead load-balanced distribution of work, global memory latency hiding through coalescence, saturation, and shared memory utilization, and the use of GPU output buffering as a solution to irregular workloads and a large solution domain. We also find a strong reliance on an efficient global problem structure representation that bounds any efficiencies gained from these lessons, and discuss the meanings of these results to backtracking problems in general.
Recommended from our members
Lessons Learned from Exploring the Backtracking Paradigm on the GPU
Abstract. We explore the backtracking paradigm with properties seen as sub-optimal for GPU architectures, using as a case study the maximal clique enumeration problem, and find that the presence of these properties limit GPU performance to approximately 1.4--2.25 times a single CPU core. The GPU performance ''lessons'' we find critical to providing this performance include a coarse-and-fine-grain parallelization of the search space, a low-overhead load-balanced distribution of work, global memory latency hiding through coalescence, saturation, and shared memory utilization, and the use of GPU output buffering as a solution to irregular workloads and a large solution domain. We also find a strong reliance on an efficient global problem structure representation that bounds any efficiencies gained from these lessons, and discuss the meanings of these results to backtracking problems in general
Optimisation massivement multi-tâche sur grappes de calcul hétérogènes – Application aux problèmes de permutation
Branch-and-Bound (B&B) is a frequently used tree-search exploratory method for the exact resolution of combinatorial optimization problems (COPs). However, in practice, only small problem instances can be solved on a sequential computer, as B&B generates often generates a huge amount of subproblems to be evaluated. In order to solve large COPs, we revisit the design and implementation of massively parallel B&B on top of large heterogeneous clusters, integrating multi-core CPUs, many-core processors and GPUs.For the efficient storage and management of subproblems an original data structure (IVM) dedicated to permutation problems is used. Because of the highly irregular and unpredictable shape of the B&B tree, dynamic load balancing between parallel exploration processes is one of the main issues addressed in this thesis. Based on a compact encoding of the search space in the form of intervals, work stealing strategies for multi-core and GPU are proposed, as well as hierarchical approaches for load balancing in distributed memory multi-CPU/multi-GPU systems. Three permutation problems, the Flowshop Scheduling Problem (FSP), the Quadratic Assignment Problem (QAP) and the n-Queens puzzle problem are used as test-cases.The resolution, in 9 hours, of a FSP instance with an estimated sequential execution time of 22 years demonstrates the scalability of the proposed algorithms on a cluster composed of 36 GPUs.L'algorithme Branch-and-Bound (B&B) est une méthode de recherche arborescente fréquemment utilisé pour la résolution exacte de problèmes d'optimisation combinatoire (POC). Néanmoins, seules des petites instances peuvent être effectivement résolues sur une machine séquentielle, le nombre de sous-problèmes à évaluer étant souvent très grand. Visant la resolution de POC de grande taille, nous réexaminons la conception et l'implémentation d'algorithmes B&B massivement parallèles sur de larges plateformes hétérogènes de calcul, intégrant des processeurs multi-coeurs, many-cores et et processeurs graphiques (GPUs). Pour une représentation compacte en mémoire des sous-problèmes une structure de données originale (IVM), dédiée aux problèmes de permutation est utilisée. En raison de la forte irrégularité de l'arbre de recherche, l'équilibrage de charge dynamique entre processus d'exploration parallèles occupe une place centrale dans cette thèse. Basés sur un encodage compact de l'espace de recherche sous forme d'intervalles, des stratégies de vol de tâches sont proposées pour processeurs multi-core et GPU, ainsi une approche hiérarchique pour l'équilibrage de charge dans les systèmes multi-GPU et multi-CPU à mémoire distribuée. Trois problèmes d'optimisation définis sur l'ensemble des permutations, le problème d'ordonnancement Flow-Shop (FSP), d'affectation quadratique (QAP) et le problème des n-dames sont utilisés comme cas d'étude. La resolution en 9 heures d'une instance du FSP dont le temps de résolution séquentiel est estimé à 22 ans demontre la capacité de passage à l'échelle des algorithmes proposés sur une grappe de calcul composé de 36 GPUs