Search CORE

1,480 research outputs found

Recommended from our members

Preparing sparse solvers for exascale computing.

Author: Anzt Hartwig
Boman Erik
Curfman McInnes Lois
Falgout Rob
Ghysels Pieter
Heroux Michael
Li Xiaoye
Meier Yang Ulrike
Rajamanickam Sivasankaran
Rupp Karl
Smith Barry
Tran Mills Richard
Yamazaki Ichitaro
Publication venue: eScholarship, University of California
Publication date: 01/03/2020
Field of study

Sparse solvers provide essential functionality for a wide variety of scientific applications. Highly parallel sparse solvers are essential for continuing advances in high-fidelity, multi-physics and multi-scale simulations, especially as we target exascale platforms. This paper describes the challenges, strategies and progress of the US Department of Energy Exascale Computing project towards providing sparse solvers for exascale computing platforms. We address the demands of systems with thousands of high-performance node devices where exposing concurrency, hiding latency and creating alternative algorithms become essential. The efforts described here are works in progress, highlighting current success and upcoming challenges. This article is part of a discussion meeting issue 'Numerical algorithms for high-performance computational science'

eScholarship - University of California

Algorithmic patterns for $\mathcal{H}$ -matrices on many-core processors

Author: Zaspel Peter
Publication venue
Publication date: 01/01/2017
Field of study

In this work, we consider the reformulation of hierarchical (

\mathcal{H}

) matrix algorithms for many-core processors with a model implementation on graphics processing units (GPUs).

\mathcal{H}

matrices approximate specific dense matrices, e.g., from discretized integral equations or kernel ridge regression, leading to log-linear time complexity in dense matrix-vector products. The parallelization of

\mathcal{H}

matrix operations on many-core processors is difficult due to the complex nature of the underlying algorithms. While previous algorithmic advances for many-core hardware focused on accelerating existing

\mathcal{H}

matrix CPU implementations by many-core processors, we here aim at totally relying on that processor type. As main contribution, we introduce the necessary parallel algorithmic patterns allowing to map the full

\mathcal{H}

matrix construction and the fast matrix-vector product to many-core hardware. Here, crucial ingredients are space filling curves, parallel tree traversal and batching of linear algebra operations. The resulting model GPU implementation hmglib is the, to the best of the authors knowledge, first entirely GPU-based Open Source

\mathcal{H}

matrix library of this kind. We conclude this work by an in-depth performance analysis and a comparative performance study against a standard

\mathcal{H}

matrix library, highlighting profound speedups of our many-core parallel approach

arXiv.org e-Print Archive

edoc

Geometry-Oblivious FMM for Compressing Dense SPD Matrices

Author: Biros George
Levitt James
Reiz Severin
Yu Chenhan D.
Publication venue
Publication date: 01/07/2017
Field of study

We present GOFMM (geometry-oblivious FMM), a novel method that creates a hierarchical low-rank approximation, "compression," of an arbitrary dense symmetric positive definite (SPD) matrix. For many applications, GOFMM enables an approximate matrix-vector multiplication in

N \log N

or even

N

time, where

N

is the matrix size. Compression requires

N \log N

storage and work. In general, our scheme belongs to the family of hierarchical matrix approximation methods. In particular, it generalizes the fast multipole method (FMM) to a purely algebraic setting by only requiring the ability to sample matrix entries. Neither geometric information (i.e., point coordinates) nor knowledge of how the matrix entries have been generated is required, thus the term "geometry-oblivious." Also, we introduce a shared-memory parallel scheme for hierarchical matrix computations that reduces synchronization barriers. We present results on the Intel Knights Landing and Haswell architectures, and on the NVIDIA Pascal architecture for a variety of matrices.Comment: 13 pages, accepted by SC'1

arXiv.org e-Print Archive

Crossref

DeepWalk: Online Learning of Social Representations

Author: Al-Rfou R.
Bottou L.
Dean J.
Hinton G. E.
Kondor R. I.
Krizhevsky A.
Macskassy S. A.
Mikolov T.
Mikolov T.
Morin F.
Neville J.
Recht B.
Vishwanathan S.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 27/06/2014
Field of study

We present DeepWalk, a novel approach for learning latent representations of vertices in a network. These latent representations encode social relations in a continuous vector space, which is easily exploited by statistical models. DeepWalk generalizes recent advancements in language modeling and unsupervised feature learning (or deep learning) from sequences of words to graphs. DeepWalk uses local information obtained from truncated random walks to learn latent representations by treating walks as the equivalent of sentences. We demonstrate DeepWalk's latent representations on several multi-label network classification tasks for social networks such as BlogCatalog, Flickr, and YouTube. Our results show that DeepWalk outperforms challenging baselines which are allowed a global view of the network, especially in the presence of missing information. DeepWalk's representations can provide

F_1

scores up to 10% higher than competing methods when labeled data is sparse. In some experiments, DeepWalk's representations are able to outperform all baseline methods while using 60% less training data. DeepWalk is also scalable. It is an online learning algorithm which builds useful incremental results, and is trivially parallelizable. These qualities make it suitable for a broad class of real world applications such as network classification, and anomaly detection.Comment: 10 pages, 5 figures, 4 table

arXiv.org e-Print Archive

Crossref

Clustering Study of Vehicle Behaviors Using License Plate Recognition

Author: Bermúdez Edo María del Campo
Bolaños Martinez Daniel
Garrido Bullejos José Luis
Publication venue: Springer International Publishing
Publication date: 21/11/2022
Field of study

Ubiquitous computing and artificial intelligence contribute to deploying intelligent environments. Sensor networks in cities generate large amounts of data that can be analyzed to provide relevant information in different fields, such as traffic control. We propose an analysis of vehicular behavior based on license plate recognition (LPR) in a rural region of three small villages. The contribution is twofold. First, we extend an existing taxonomy of the most widely used clustering algorithms in machine learning with additional classes. Second, we compare the performance of algorithms from each class of the taxonomy, extracting behavioral patterns. Partitional and hierarchical algorithms obtain the best results, while density-based algorithms have poor results. The results show four differentiated patterns in vehicular behavior, distinguishing different patterns in both residents and tourists. Our work can help policymakers develop strategies to improve services in rural villages, and developers choose the correct algorithm for a similar study.LifeWatch ERI

Repositorio Institucional Universidad de Granada

Program Development Tools and Infrastructures

Author: Schulz M
Publication venue: 'Office of Scientific and Technical Information (OSTI)'
Publication date: 12/03/2012
Field of study

Exascale class machines will exhibit a new level of complexity: they will feature an unprecedented number of cores and threads, will most likely be heterogeneous and deeply hierarchical, and offer a range of new hardware techniques (such as speculative threading, transactional memory, programmable prefetching, and programmable accelerators), which all have to be utilized for an application to realize the full potential of the machine. Additionally, users will be faced with less memory per core, fixed total power budgets, and sharply reduced MTBFs. At the same time, it is expected that the complexity of applications will rise sharply for exascale systems, both to implement new science possible at exascale and to exploit the new hardware features necessary to achieve exascale performance. This is particularly true for many of the NNSA codes, which are large and often highly complex integrated simulation codes that push the limits of everything in the system including language features. To overcome these limitations and to enable users to reach exascale performance, users will expect a new generation of tools that address the bottlenecks of exascale machines, that work seamlessly with the (set of) programming models on the target machines, that scale with the machine, that provide automatic analysis capabilities, and that are flexible and modular enough to overcome the complexities and changing demands of the exascale architectures. Further, any tool must be robust enough to handle the complexity of large integrated codes while keeping the user's learning curve low. With the ASC program, in particular the CSSE (Computational Systems and Software Engineering) and CCE (Common Compute Environment) projects, we are working towards a new generation of tools that fulfill these requirements and that provide our users as well as the larger HPC community with the necessary tools, techniques, and methodologies required to make exascale performance a reality

Crossref

UNT Digital Library

A scalable H-matrix approach for the solution of boundary integral equations on multi-GPU clusters

Author: Harbrecht Helmut
Zaspel Peter
Publication venue: Universität Basel
Publication date: 01/06/2018
Field of study

In this work, we consider the solution of boundary integral equations by means of a scalable hierarchical matrix approach on clusters equipped with graphics hardware, i.e. graphics processing units (GPUs). To this end, we extend our existing single-GPU hierarchical matrix library hmglib such that it is able to scale on many GPUs and such that it can be coupled to arbitrary application codes. Using a model GPU implementation of a boundary element method (BEM) solver, we are able to achieve more than 67 percent relative parallel speed-up going from 128 to 1024 GPUs for a model geometry test case with 1.5 million unknowns and a real-world geometry test case with almost 1.2 million unknowns. On 1024 GPUs of the cluster Titan, it takes less than 6 minutes to solve the 1.5 million unknowns problem, with 5.7 minutes for the setup phase and 20 seconds for the iterative solver. To the best of the authors’ knowledge, we here discuss the first fully GPU-based distributed-memory parallel hierarchical matrix Open Source library using the traditional H-matrix format and adaptive cross approximation with an application to BEM problems

edoc