32,404 research outputs found
Parallel Working-Set Search Structures
In this paper we present two versions of a parallel working-set map on p
processors that supports searches, insertions and deletions. In both versions,
the total work of all operations when the map has size at least p is bounded by
the working-set bound, i.e., the cost of an item depends on how recently it was
accessed (for some linearization): accessing an item in the map with recency r
takes O(1+log r) work. In the simpler version each map operation has O((log
p)^2+log n) span (where n is the maximum size of the map). In the pipelined
version each map operation on an item with recency r has O((log p)^2+log r)
span. (Operations in parallel may have overlapping span; span is additive only
for operations in sequence.)
Both data structures are designed to be used by a dynamic multithreading
parallel program that at each step executes a unit-time instruction or makes a
data structure call. To achieve the stated bounds, the pipelined data structure
requires a weak-priority scheduler, which supports a limited form of 2-level
prioritization. At the end we explain how the results translate to practical
implementations using work-stealing schedulers.
To the best of our knowledge, this is the first parallel implementation of a
self-adjusting search structure where the cost of an operation adapts to the
access sequence. A corollary of the working-set bound is that it achieves work
static optimality: the total work is bounded by the access costs in an optimal
static search tree.Comment: Authors' version of a paper accepted to SPAA 201
Parallel Finger Search Structures
In this paper we present two versions of a parallel finger structure FS on p processors that supports searches, insertions and deletions, and has a finger at each end. This is to our knowledge the first implementation of a parallel search structure that is work-optimal with respect to the finger bound and yet has very good parallelism (within a factor of O(log p)^2) of optimal). We utilize an extended implicit batching framework that transparently facilitates the use of FS by any parallel program P that is modelled by a dynamically generated DAG D where each node is either a unit-time instruction or a call to FS.
The work done by FS is bounded by the finger bound F_L (for some linearization L of D), i.e. each operation on an item with distance r from a finger takes O(log r+1) amortized work. Running P using the simpler version takes O((T_1+F_L)/p + T_infty + d * ((log p)^2 + log n)) time on a greedy scheduler, where T_1, T_infty are the size and span of D respectively, and n is the maximum number of items in FS, and d is the maximum number of calls to FS along any path in D. Using the faster version, this is reduced to O((T_1+F_L)/p + T_infty + d *(log p)^2 + s_L) time, where s_L is the weighted span of D where each call to FS is weighted by its cost according to F_L. FS can be extended to a fixed number of movable fingers.
The data structures in our paper fit into the dynamic multithreading paradigm, and their performance bounds are directly composable with other data structures given in the same paradigm. Also, the results can be translated to practical implementations using work-stealing schedulers
Efficient transfer entropy analysis of non-stationary neural time series
Information theory allows us to investigate information processing in neural
systems in terms of information transfer, storage and modification. Especially
the measure of information transfer, transfer entropy, has seen a dramatic
surge of interest in neuroscience. Estimating transfer entropy from two
processes requires the observation of multiple realizations of these processes
to estimate associated probability density functions. To obtain these
observations, available estimators assume stationarity of processes to allow
pooling of observations over time. This assumption however, is a major obstacle
to the application of these estimators in neuroscience as observed processes
are often non-stationary. As a solution, Gomez-Herrero and colleagues
theoretically showed that the stationarity assumption may be avoided by
estimating transfer entropy from an ensemble of realizations. Such an ensemble
is often readily available in neuroscience experiments in the form of
experimental trials. Thus, in this work we combine the ensemble method with a
recently proposed transfer entropy estimator to make transfer entropy
estimation applicable to non-stationary time series. We present an efficient
implementation of the approach that deals with the increased computational
demand of the ensemble method's practical application. In particular, we use a
massively parallel implementation for a graphics processing unit to handle the
computationally most heavy aspects of the ensemble method. We test the
performance and robustness of our implementation on data from simulated
stochastic processes and demonstrate the method's applicability to
magnetoencephalographic data. While we mainly evaluate the proposed method for
neuroscientific data, we expect it to be applicable in a variety of fields that
are concerned with the analysis of information transfer in complex biological,
social, and artificial systems.Comment: 27 pages, 7 figures, submitted to PLOS ON
Multi-GPU maximum entropy image synthesis for radio astronomy
The maximum entropy method (MEM) is a well known deconvolution technique in
radio-interferometry. This method solves a non-linear optimization problem with
an entropy regularization term. Other heuristics such as CLEAN are faster but
highly user dependent. Nevertheless, MEM has the following advantages: it is
unsupervised, it has a statistical basis, it has a better resolution and better
image quality under certain conditions. This work presents a high performance
GPU version of non-gridding MEM, which is tested using real and simulated data.
We propose a single-GPU and a multi-GPU implementation for single and
multi-spectral data, respectively. We also make use of the Peer-to-Peer and
Unified Virtual Addressing features of newer GPUs which allows to exploit
transparently and efficiently multiple GPUs. Several ALMA data sets are used to
demonstrate the effectiveness in imaging and to evaluate GPU performance. The
results show that a speedup from 1000 to 5000 times faster than a sequential
version can be achieved, depending on data and image size. This allows to
reconstruct the HD142527 CO(6-5) short baseline data set in 2.1 minutes,
instead of 2.5 days that takes a sequential version on CPU.Comment: 11 pages, 13 figure
- …