1,041 research outputs found
On Designing Multicore-aware Simulators for Biological Systems
The stochastic simulation of biological systems is an increasingly popular
technique in bioinformatics. It often is an enlightening technique, which may
however result in being computational expensive. We discuss the main
opportunities to speed it up on multi-core platforms, which pose new challenges
for parallelisation techniques. These opportunities are developed in two
general families of solutions involving both the single simulation and a bulk
of independent simulations (either replicas of derived from parameter sweep).
Proposed solutions are tested on the parallelisation of the CWC simulator
(Calculus of Wrapped Compartments) that is carried out according to proposed
solutions by way of the FastFlow programming framework making possible fast
development and efficient execution on multi-cores.Comment: 19 pages + cover pag
Breadth First Search Vectorization on the Intel Xeon Phi
Breadth First Search (BFS) is a building block for graph algorithms and has
recently been used for large scale analysis of information in a variety of
applications including social networks, graph databases and web searching. Due
to its importance, a number of different parallel programming models and
architectures have been exploited to optimize the BFS. However, due to the
irregular memory access patterns and the unstructured nature of the large
graphs, its efficient parallelization is a challenge. The Xeon Phi is a
massively parallel architecture available as an off-the-shelf accelerator,
which includes a powerful 512 bit vector unit with optimized scatter and gather
functions. Given its potential benefits, work related to graph traversing on
this architecture is an active area of research.
We present a set of experiments in which we explore architectural features of
the Xeon Phi and how best to exploit them in a top-down BFS algorithm but the
techniques can be applied to the current state-of-the-art hybrid, top-down plus
bottom-up, algorithms.
We focus on the exploitation of the vector unit by developing an improved
highly vectorized OpenMP parallel algorithm, using vector intrinsics, and
understanding the use of data alignment and prefetching. In addition, we
investigate the impact of hyperthreading and thread affinity on performance, a
topic that appears under researched in the literature. As a result, we achieve
what we believe is the fastest published top-down BFS algorithm on the version
of Xeon Phi used in our experiments. The vectorized BFS top-down source code
presented in this paper can be available on request as free-to-use software
Design of testbed and emulation tools
The research summarized was concerned with the design of testbed and emulation tools suitable to assist in projecting, with reasonable accuracy, the expected performance of highly concurrent computing systems on large, complete applications. Such testbed and emulation tools are intended for the eventual use of those exploring new concurrent system architectures and organizations, either as users or as designers of such systems. While a range of alternatives was considered, a software based set of hierarchical tools was chosen to provide maximum flexibility, to ease in moving to new computers as technology improves and to take advantage of the inherent reliability and availability of commercially available computing systems
Efficiently and Transparently Maintaining High SIMD Occupancy in the Presence of Wavefront Irregularity
Demand is increasing for high throughput processing of irregular streaming applications; examples of such applications from scientific and engineering domains include biological sequence alignment, network packet filtering, automated face detection, and big graph algorithms. With wide SIMD, lightweight threads, and low-cost thread-context switching, wide-SIMD architectures such as GPUs allow considerable flexibility in the way application work is assigned to threads. However, irregular applications are challenging to map efficiently onto wide SIMD because data-dependent filtering or replication of items creates an unpredictable data wavefront of items ready for further processing. Straightforward implementations of irregular applications on a wide-SIMD architecture are prone to load imbalance and reduced occupancy, while more sophisticated implementations require advanced use of parallel GPU operations to redistribute work efficiently among threads.
This dissertation will present strategies for addressing the performance challenges of wavefront- irregular applications on wide-SIMD architectures. These strategies are embodied in a developer framework called Mercator that (1) allows developers to map irregular applications onto GPUs ac- cording to the streaming paradigm while abstracting from low-level data movement and (2) includes generalized techniques for transparently overcoming the obstacles to high throughput presented by wavefront-irregular applications on a GPU. Mercator forms the centerpiece of this dissertation, and we present its motivation, performance model, implementation, and extensions in this work
GHOST: Building blocks for high performance sparse linear algebra on heterogeneous systems
While many of the architectural details of future exascale-class high
performance computer systems are still a matter of intense research, there
appears to be a general consensus that they will be strongly heterogeneous,
featuring "standard" as well as "accelerated" resources. Today, such resources
are available as multicore processors, graphics processing units (GPUs), and
other accelerators such as the Intel Xeon Phi. Any software infrastructure that
claims usefulness for such environments must be able to meet their inherent
challenges: massive multi-level parallelism, topology, asynchronicity, and
abstraction. The "General, Hybrid, and Optimized Sparse Toolkit" (GHOST) is a
collection of building blocks that targets algorithms dealing with sparse
matrix representations on current and future large-scale systems. It implements
the "MPI+X" paradigm, has a pure C interface, and provides hybrid-parallel
numerical kernels, intelligent resource management, and truly heterogeneous
parallelism for multicore CPUs, Nvidia GPUs, and the Intel Xeon Phi. We
describe the details of its design with respect to the challenges posed by
modern heterogeneous supercomputers and recent algorithmic developments.
Implementation details which are indispensable for achieving high efficiency
are pointed out and their necessity is justified by performance measurements or
predictions based on performance models. The library code and several
applications are available as open source. We also provide instructions on how
to make use of GHOST in existing software packages, together with a case study
which demonstrates the applicability and performance of GHOST as a component
within a larger software stack.Comment: 32 pages, 11 figure
The Parallelism Motifs of Genomic Data Analysis
Genomic data sets are growing dramatically as the cost of sequencing
continues to decline and small sequencing devices become available. Enormous
community databases store and share this data with the research community, but
some of these genomic data analysis problems require large scale computational
platforms to meet both the memory and computational requirements. These
applications differ from scientific simulations that dominate the workload on
high end parallel systems today and place different requirements on programming
support, software libraries, and parallel architectural design. For example,
they involve irregular communication patterns such as asynchronous updates to
shared data structures. We consider several problems in high performance
genomics analysis, including alignment, profiling, clustering, and assembly for
both single genomes and metagenomes. We identify some of the common
computational patterns or motifs that help inform parallelization strategies
and compare our motifs to some of the established lists, arguing that at least
two key patterns, sorting and hashing, are missing
FPGA-Based Processor Acceleration for Image Processing Applications
FPGA-based embedded image processing systems offer considerable computing resources but present programming challenges when compared to software systems. The paper describes an approach based on an FPGA-based soft processor called Image Processing Processor (IPPro) which can operate up to 337 MHz on a high-end Xilinx FPGA family and gives details of the dataflow-based programming environment. The approach is demonstrated for a k-means clustering operation and a traffic sign recognition application, both of which have been prototyped on an Avnet Zedboard that has Xilinx Zynq-7000 system-on-chip (SoC). A number of parallel dataflow mapping options were explored giving a speed-up of 8 times for the k-means clustering using 16 IPPro cores, and a speed-up of 9.6 times for the morphology filter operation of the traffic sign recognition using 16 IPPro cores compared to their equivalent ARM-based software implementations. We show that for k-means clustering, the 16 IPPro cores implementation is 57, 28 and 1.7 times more power efficient (fps/W) than ARM Cortex-A7 CPU, nVIDIA GeForce GTX980 GPU and ARM Mali-T628 embedded GPU respectively
Massively-parallel marker-passing in semantic networks
AbstractOne approach to using the information available in a semantic network is the use of marker-passing algorithms, which propagate information through the network to determine relationships between objects. One of the primary arguments in favor of these algorithms are their ability to be implemented in parallel. Despite this, most implementations have been serial and have only sometimes gone so far as to simulate parallelism. In this paper the marker-passing approach is presented. An actual parallel implementation which shows that such programs can be written on commercially available massively parallel machines is also presented
- …