Search CORE

13,880 research outputs found

Memory-Adjustable Navigation Piles with Applications to Sorting and Convex Hulls

Author: Darwish Omar
Elmasry Amr
Katajainen Jyrki
Publication venue
Publication date: 24/10/2015
Field of study

We consider space-bounded computations on a random-access machine (RAM) where the input is given on a read-only random-access medium, the output is to be produced to a write-only sequential-access medium, and the available workspace allows random reads and writes but is of limited capacity. The length of the input is

N

elements, the length of the output is limited by the computation, and the capacity of the workspace is

O(S)

bits for some predetermined parameter

S

. We present a state-of-the-art priority queue---called an adjustable navigation pile---for this restricted RAM model. Under some reasonable assumptions, our priority queue supports

\mathit{minimum}

and

\mathit{insert}

O(1)

worst-case time and

\mathit{extract}

O(N/S + \lg{} S)

worst-case time for any

S \geq \lg{} N

. We show how to use this data structure to sort

N

elements and to compute the convex hull of

N

points in the two-dimensional Euclidean space in

O(N^2/S + N \lg{} S)

worst-case time for any

S \geq \lg{} N

. Following a known lower bound for the space-time product of any branching program for finding unique elements, both our sorting and convex-hull algorithms are optimal. The adjustable navigation pile has turned out to be useful when designing other space-efficient algorithms, and we expect that it will find its way to yet other applications.Comment: 21 page

arXiv.org e-Print Archive

MPG.PuRe

Gunrock: GPU Graph Analytics

Author: Davidson Andrew
Liu Weitang
Osama Muhammad
Owens John D.
Pan Yuechao
Riffel Andy T.
Wang Leyuan
Wang Yangzihao
Wu Yuduo
Yang Carl
Yuan Chenshan
Publication venue
Publication date: 04/01/2017
Field of study

For large-scale graph analytics on the GPU, the irregularity of data access and control flow, and the complexity of programming GPUs, have presented two significant challenges to developing a programmable high-performance graph library. "Gunrock", our graph-processing system designed specifically for the GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on operations on a vertex or edge frontier. Gunrock achieves a balance between performance and expressiveness by coupling high performance GPU computing primitives and optimization strategies with a high-level programming model that allows programmers to quickly develop new graph primitives with small code size and minimal GPU programming knowledge. We characterize the performance of various optimization strategies and evaluate Gunrock's overall performance on different GPU architectures on a wide range of graph primitives that span from traversal-based algorithms and ranking algorithms, to triangle counting and bipartite-graph-based algorithms. The results show that on a single GPU, Gunrock has on average at least an order of magnitude speedup over Boost and PowerGraph, comparable performance to the fastest GPU hardwired primitives and CPU shared-memory graph libraries such as Ligra and Galois, and better performance than any other GPU high-level graph library.Comment: 52 pages, invited paper to ACM Transactions on Parallel Computing (TOPC), an extended version of PPoPP'16 paper "Gunrock: A High-Performance Graph Processing Library on the GPU

arXiv.org e-Print Archive

eScholarship - University of California

FigShare

GPU LSM: A Dynamic Dictionary Data Structure for the GPU

Author: Amenta Nina
Ashkiani Saman
Farach-Colton Martin
Li Shengren
Owens John D.
Publication venue
Publication date: 01/01/2018
Field of study

We develop a dynamic dictionary data structure for the GPU, supporting fast insertions and deletions, based on the Log Structured Merge tree (LSM). Our implementation on an NVIDIA K40c GPU has an average update (insertion or deletion) rate of 225 M elements/s, 13.5x faster than merging items into a sorted array. The GPU LSM supports the retrieval operations of lookup, count, and range query operations with an average rate of 75 M, 32 M and 23 M queries/s respectively. The trade-off for the dynamic updates is that the sorted array is almost twice as fast on retrievals. We believe that our GPU LSM is the first dynamic general-purpose dictionary data structure for the GPU.Comment: 11 pages, accepted to appear on the Proceedings of IEEE International Parallel and Distributed Processing Symposium (IPDPS'18

arXiv.org e-Print Archive

eScholarship - University of California

Doctor of Philosophy

Author: Fu Zhisong
Publication venue: University of Utah
Publication date: 01/12/2013
Field of study

dissertationPartial differential equations (PDEs) are widely used in science and engineering to model phenomena such as sound, heat, and electrostatics. In many practical science and engineering applications, the solutions of PDEs require the tessellation of computational domains into unstructured meshes and entail computationally expensive and time-consuming processes. Therefore, efficient and fast PDE solving techniques on unstructured meshes are important in these applications. Relative to CPUs, the faster growth curves in the speed and greater power efficiency of the SIMD streaming processors, such as GPUs, have gained them an increasingly important role in the high-performance computing area. Combining suitable parallel algorithms and these streaming processors, we can develop very efficient numerical solvers of PDEs. The contributions of this dissertation are twofold: proposal of two general strategies to design efficient PDE solvers on GPUs and the specific applications of these strategies to solve different types of PDEs. Specifically, this dissertation consists of four parts. First, we describe the general strategies, the domain decomposition strategy and the hybrid gathering strategy. Next, we introduce a parallel algorithm for solving the eikonal equation on fully unstructured meshes efficiently. Third, we present the algorithms and data structures necessary to move the entire FEM pipeline to the GPU. Fourth, we propose a parallel algorithm for solving the levelset equation on fully unstructured 2D or 3D meshes or manifolds. This algorithm combines a narrowband scheme with domain decomposition for efficient levelset equation solving

The University of Utah: J. Willard Marriott Digital Library

TRIQS: A Toolbox for Research on Interacting Quantum Systems

Author: Ayral Thomas
Ferrero Michel
Hafermann Hartmut
Krivenko Igor
Messio Laura
Parcollet Olivier
Seth Priyanka
Publication venue: 'Elsevier BV'
Publication date: 01/01/2015
Field of study

We present the TRIQS library, a Toolbox for Research on Interacting Quantum Systems. It is an open-source, computational physics library providing a framework for the quick development of applications in the field of many-body quantum physics, and in particular, strongly-correlated electronic systems. It supplies components to develop codes in a modern, concise and efficient way: e.g. Green's function containers, a generic Monte Carlo class, and simple interfaces to HDF5. TRIQS is a C++/Python library that can be used from either language. It is distributed under the GNU General Public License (GPLv3). State-of-the-art applications based on the library, such as modern quantum many-body solvers and interfaces between density-functional-theory codes and dynamical mean-field theory (DMFT) codes are distributed along with it.Comment: 27 page

arXiv.org e-Print Archive

HAL-CEA

HAL-Polytechnique

A randomised primal-dual algorithm for distributed radio-interferometric imaging

Author: Carrillo Rafael E.
McEwen Jason D.
Onose Alexandru
Wiaux Yves
Publication venue
Publication date: 30/05/2016
Field of study

Next generation radio telescopes, like the Square Kilometre Array, will acquire an unprecedented amount of data for radio astronomy. The development of fast, parallelisable or distributed algorithms for handling such large-scale data sets is of prime importance. Motivated by this, we investigate herein a convex optimisation algorithmic structure, based on primal-dual forward-backward iterations, for solving the radio interferometric imaging problem. It can encompass any convex prior of interest. It allows for the distributed processing of the measured data and introduces further flexibility by employing a probabilistic approach for the selection of the data blocks used at a given iteration. We study the reconstruction performance with respect to the data distribution and we propose the use of nonuniform probabilities for the randomised updates. Our simulations show the feasibility of the randomisation given a limited computing infrastructure as well as important computational advantages when compared to state-of-the-art algorithmic structures.Comment: 5 pages, 3 figures, Proceedings of the European Signal Processing Conference (EUSIPCO) 2016, Related journal publication available at https://arxiv.org/abs/1601.0402

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne

Crossref

Heriot Watt Pure

UCL Discovery

Accelerated hardware video object segmentation: From foreground detection to connected components labelling

Author: Andrew Hunter
Batlle
Boykov
Cucchiara
Haralick
Hongying Meng
Kofi Appiah
Patrick Dickinson
Rosenfeld
Stauffer
Zhou
Zhou
Publication venue: 'Elsevier BV'
Publication date: 01/01/2010
Field of study

This is the preprint version of the Article - Copyright @ 2010 ElsevierThis paper demonstrates the use of a single-chip FPGA for the segmentation of moving objects in a video sequence. The system maintains highly accurate background models, and integrates the detection of foreground pixels with the labelling of objects using a connected components algorithm. The background models are based on 24-bit RGB values and 8-bit gray scale intensity values. A multimodal background differencing algorithm is presented, using a single FPGA chip and four blocks of RAM. The real-time connected component labelling algorithm, also designed for FPGA implementation, run-length encodes the output of the background subtraction, and performs connected component analysis on this representation. The run-length encoding, together with other parts of the algorithm, is performed in parallel; sequential operations are minimized as the number of run-lengths are typically less than the number of pixels. The two algorithms are pipelined together for maximum efficiency

University of Lincoln Institutional Repository

Crossref

Nottingham Trent Institutional Repository (IRep)

Sheffield Hallam University Research Archive

Brunel University Research Archive

Nonnegative approximations of nonnegative tensors

Author: Comon Pierre
Lim Lek-Heng
Publication venue
Publication date: 01/01/2009
Field of study

We study the decomposition of a nonnegative tensor into a minimal sum of outer product of nonnegative vectors and the associated parsimonious naive Bayes probabilistic model. We show that the corresponding approximation problem, which is central to nonnegative PARAFAC, will always have optimal solutions. The result holds for any choice of norms and, under a mild assumption, even Bregman divergences.Comment: 14 page

arXiv.org e-Print Archive

CiteSeerX

HAL-UNICE