2,103 research outputs found
High-speed detection of emergent market clustering via an unsupervised parallel genetic algorithm
We implement a master-slave parallel genetic algorithm (PGA) with a bespoke
log-likelihood fitness function to identify emergent clusters within price
evolutions. We use graphics processing units (GPUs) to implement a PGA and
visualise the results using disjoint minimal spanning trees (MSTs). We
demonstrate that our GPU PGA, implemented on a commercially available general
purpose GPU, is able to recover stock clusters in sub-second speed, based on a
subset of stocks in the South African market. This represents a pragmatic
choice for low-cost, scalable parallel computing and is significantly faster
than a prototype serial implementation in an optimised C-based
fourth-generation programming language, although the results are not directly
comparable due to compiler differences. Combined with fast online intraday
correlation matrix estimation from high frequency data for cluster
identification, the proposed implementation offers cost-effective,
near-real-time risk assessment for financial practitioners.Comment: 10 pages, 5 figures, 4 tables, More thorough discussion of
implementatio
Speculative Segmented Sum for Sparse Matrix-Vector Multiplication on Heterogeneous Processors
Sparse matrix-vector multiplication (SpMV) is a central building block for
scientific software and graph applications. Recently, heterogeneous processors
composed of different types of cores attracted much attention because of their
flexible core configuration and high energy efficiency. In this paper, we
propose a compressed sparse row (CSR) format based SpMV algorithm utilizing
both types of cores in a CPU-GPU heterogeneous processor. We first
speculatively execute segmented sum operations on the GPU part of a
heterogeneous processor and generate a possibly incorrect results. Then the CPU
part of the same chip is triggered to re-arrange the predicted partial sums for
a correct resulting vector. On three heterogeneous processors from Intel, AMD
and nVidia, using 20 sparse matrices as a benchmark suite, the experimental
results show that our method obtains significant performance improvement over
the best existing CSR-based SpMV algorithms. The source code of this work is
downloadable at https://github.com/bhSPARSE/Benchmark_SpMV_using_CSRComment: 22 pages, 8 figures, Published at Parallel Computing (PARCO
Performance Analysis of a Novel GPU Computation-to-core Mapping Scheme for Robust Facet Image Modeling
Though the GPGPU concept is well-known
in image processing, much more work remains to be done
to fully exploit GPUs as an alternative computation
engine. This paper investigates the computation-to-core
mapping strategies to probe the efficiency and scalability
of the robust facet image modeling algorithm on GPUs.
Our fine-grained computation-to-core mapping scheme
shows a significant performance gain over the standard
pixel-wise mapping scheme. With in-depth performance
comparisons across the two different mapping schemes,
we analyze the impact of the level of parallelism on
the GPU computation and suggest two principles for
optimizing future image processing applications on the
GPU platform
SWAPHI: Smith-Waterman Protein Database Search on Xeon Phi Coprocessors
The maximal sensitivity of the Smith-Waterman (SW) algorithm has enabled its
wide use in biological sequence database search. Unfortunately, the high
sensitivity comes at the expense of quadratic time complexity, which makes the
algorithm computationally demanding for big databases. In this paper, we
present SWAPHI, the first parallelized algorithm employing Xeon Phi
coprocessors to accelerate SW protein database search. SWAPHI is designed based
on the scale-and-vectorize approach, i.e. it boosts alignment speed by
effectively utilizing both the coarse-grained parallelism from the many
co-processing cores (scale) and the fine-grained parallelism from the 512-bit
wide single instruction, multiple data (SIMD) vectors within each core
(vectorize). By searching against the large UniProtKB/TrEMBL protein database,
SWAPHI achieves a performance of up to 58.8 billion cell updates per second
(GCUPS) on one coprocessor and up to 228.4 GCUPS on four coprocessors.
Furthermore, it demonstrates good parallel scalability on varying number of
coprocessors, and is also superior to both SWIPE on 16 high-end CPU cores and
BLAST+ on 8 cores when using four coprocessors, with the maximum speedup of
1.52 and 1.86, respectively. SWAPHI is written in C++ language (with a set of
SIMD intrinsics), and is freely available at http://swaphi.sourceforge.net.Comment: A short version of this paper has been accepted by the IEEE ASAP 2014
conferenc
UPIR: Toward the Design of Unified Parallel Intermediate Representation for Parallel Programming Models
The complexity of heterogeneous computing architectures, as well as the
demand for productive and portable parallel application development, have
driven the evolution of parallel programming models to become more
comprehensive and complex than before. Enhancing the conventional compilation
technologies and software infrastructure to be parallelism-aware has become one
of the main goals of recent compiler development. In this paper, we propose the
design of unified parallel intermediate representation (UPIR) for multiple
parallel programming models and for enabling unified compiler transformation
for the models. UPIR specifies three commonly used parallelism patterns (SPMD,
data and task parallelism), data attributes and explicit data movement and
memory management, and synchronization operations used in parallel programming.
We demonstrate UPIR via a prototype implementation in the ROSE compiler for
unifying IR for both OpenMP and OpenACC and in both C/C++ and Fortran, for
unifying the transformation that lowers both OpenMP and OpenACC code to LLVM
runtime, and for exporting UPIR to LLVM MLIR dialect.Comment: Typos corrected. Format update
- …