30,037 research outputs found
GPU Accelerated Similarity Self-Join for Multi-Dimensional Data
The self-join finds all objects in a dataset that are within a search
distance, epsilon, of each other; therefore, the self-join is a building block
of many algorithms. We advance a GPU-accelerated self-join algorithm targeted
towards high dimensional data. The massive parallelism afforded by the GPU and
high aggregate memory bandwidth makes the architecture well-suited for
data-intensive workloads. We leverage a grid-based, GPU-tailored index to
perform range queries. We propose the following optimizations: (i) a trade-off
between candidate set filtering and index search overhead by exploiting
properties of the index; (ii) reordering the data based on variance in each
dimension to improve the filtering power of the index; and (iii) a pruning
method for reducing the number of expensive distance calculations. Across most
scenarios on real-world and synthetic datasets, our algorithm outperforms the
parallel state-of-the-art approach. Exascale systems are converging on
heterogeneous distributed-memory architectures. We show that an entity
partitioning method can be utilized to achieve a balanced workload, and thus
good scalability for multi-GPU or distributed-memory self-joins
Big Data Analytics in Bioinformatics: A Machine Learning Perspective
Bioinformatics research is characterized by voluminous and incremental
datasets and complex data analytics methods. The machine learning methods used
in bioinformatics are iterative and parallel. These methods can be scaled to
handle big data using the distributed and parallel computing technologies.
Usually big data tools perform computation in batch-mode and are not
optimized for iterative processing and high data dependency among operations.
In the recent years, parallel, incremental, and multi-view machine learning
algorithms have been proposed. Similarly, graph-based architectures and
in-memory big data tools have been developed to minimize I/O cost and optimize
iterative processing.
However, there lack standard big data architectures and tools for many
important bioinformatics problems, such as fast construction of co-expression
and regulatory networks and salient module identification, detection of
complexes over growing protein-protein interaction data, fast analysis of
massive DNA, RNA, and protein sequence data, and fast querying on incremental
and heterogeneous disease networks. This paper addresses the issues and
challenges posed by several big data problems in bioinformatics, and gives an
overview of the state of the art and the future research opportunities.Comment: 20 pages survey paper on Big data analytics in Bioinformatic
GPU Accelerated Self-join for the Distance Similarity Metric
The self-join finds all objects in a dataset within a threshold of each other
defined by a similarity metric. As such, the self-join is a building block for
the field of databases and data mining, and is employed in Big Data
applications. In this paper, we advance a GPU-efficient algorithm for the
similarity self-join that uses the Euclidean distance metric. The
search-and-refine strategy is an efficient approach for low dimensionality
datasets, as index searches degrade with increasing dimension (i.e., the curse
of dimensionality). Thus, we target the low dimensionality problem, and compare
our GPU self-join to a search-and-refine implementation, and a state-of-the-art
parallel algorithm. In low dimensionality, there are several unique challenges
associated with efficiently solving the self-join problem on the GPU. Low
dimensional data often results in higher data densities, causing a significant
number of distance calculations and a large result set. As dimensionality
increases, index searches become increasingly exhaustive, forming a performance
bottleneck. We advance several techniques to overcome these challenges using
the GPU. The techniques we propose include a GPU-efficient index that employs a
bounded search, a batching scheme to accommodate large result set sizes, and a
reduction in distance calculations through duplicate search removal. Our GPU
self-join outperforms both search-and-refine and state-of-the-art algorithms.Comment: Accepted for Publication in the 4th IEEE International Workshop on
High-Performance Big Data, Deep Learning, and Cloud Computing. To appear in
the Proceedings of the 32nd IEEE International Parallel and Distributed
Processing Symposium Workshop
Comparative Performance Analysis of Intel Xeon Phi, GPU, and CPU
We investigate and characterize the performance of an important class of
operations on GPUs and Many Integrated Core (MIC) architectures. Our work is
motivated by applications that analyze low-dimensional spatial datasets
captured by high resolution sensors, such as image datasets obtained from whole
slide tissue specimens using microscopy image scanners. We identify the data
access and computation patterns of operations in object segmentation and
feature computation categories. We systematically implement and evaluate the
performance of these core operations on modern CPUs, GPUs, and MIC systems for
a microscopy image analysis application. Our results show that (1) the data
access pattern and parallelization strategy employed by the operations strongly
affect their performance. While the performance on a MIC of operations that
perform regular data access is comparable or sometimes better than that on a
GPU; (2) GPUs are significantly more efficient than MICs for operations and
algorithms that irregularly access data. This is a result of the low
performance of the latter when it comes to random data access; (3) adequate
coordinated execution on MICs and CPUs using a performance aware task
scheduling strategy improves about 1.29x over a first-come-first-served
strategy. The example application attained an efficiency of 84% in an execution
with of 192 nodes (3072 CPU cores and 192 MICs).Comment: 11 pages, 2 figure
End-to-End Entity Resolution for Big Data: A Survey
One of the most important tasks for improving data quality and the
reliability of data analytics results is Entity Resolution (ER). ER aims to
identify different descriptions that refer to the same real-world entity, and
remains a challenging problem. While previous works have studied specific
aspects of ER (and mostly in traditional settings), in this survey, we provide
for the first time an end-to-end view of modern ER workflows, and of the novel
aspects of entity indexing and matching methods in order to cope with more than
one of the Big Data characteristics simultaneously. We present the basic
concepts, processing steps and execution strategies that have been proposed by
different communities, i.e., database, semantic Web and machine learning, in
order to cope with the loose structuredness, extreme diversity, high speed and
large scale of entity descriptions used by real-world applications. Finally, we
provide a synthetic discussion of the existing approaches, and conclude with a
detailed presentation of open research directions
Technical Report: KNN Joins Using a Hybrid Approach: Exploiting CPU/GPU Workload Characteristics
This paper studies finding the K nearest neighbors (KNN) of all points in a
dataset. Typical solutions to KNN searches use indexing to prune the search,
which reduces the number of candidate points that may be within the set of the
nearest points of each query point. In high dimensionality, index searches
degrade, making the KNN self-join a prohibitively expensive operation in some
scenarios. Furthermore, there are a significant number of distance calculations
needed to determine which points are nearest to each query point. To address
these challenges, we propose a hybrid CPU/GPU approach. Since the CPU and GPU
are considerably different architectures that are best exploited using
different algorithms, we advocate for splitting the work between both
architectures based on the characteristic workloads defined by the query points
in the dataset. As such, we assign dense regions to the GPU, and sparse regions
to the CPU to most efficiently exploit the relative strengths of each
architecture. Critically, we find that the relative performance gains over the
reference implementation across four real-world datasets are a function of the
data properties (size, dimensionality, distribution), and number of neighbors,
K.Comment: 30 pages, 10 figures, 6 table
COSINE: Compressive Network Embedding on Large-scale Information Networks
There is recently a surge in approaches that learn low-dimensional embeddings
of nodes in networks. As there are many large-scale real-world networks, it's
inefficient for existing approaches to store amounts of parameters in memory
and update them edge after edge. With the knowledge that nodes having similar
neighborhood will be close to each other in embedding space, we propose COSINE
(COmpresSIve NE) algorithm which reduces the memory footprint and accelerates
the training process by parameters sharing among similar nodes. COSINE applies
graph partitioning algorithms to networks and builds parameter sharing
dependency of nodes based on the result of partitioning. With parameters
sharing among similar nodes, COSINE injects prior knowledge about higher
structural information into training process which makes network embedding more
efficient and effective. COSINE can be applied to any embedding lookup method
and learn high-quality embeddings with limited memory and shorter training
time. We conduct experiments of multi-label classification and link prediction,
where baselines and our model have the same memory usage. Experimental results
show that COSINE gives baselines up to 23% increase on classification and up to
25% increase on link prediction. Moreover, time of all representation learning
methods using COSINE decreases from 30% to 70%
Performance Analysis and Efficient Execution on Systems with multi-core CPUs, GPUs and MICs
We carry out a comparative performance study of multi-core CPUs, GPUs and
Intel Xeon Phi (Many Integrated Core - MIC) with a microscopy image analysis
application. We experimentally evaluate the performance of computing devices on
core operations of the application. We correlate the observed performance with
the characteristics of computing devices and data access patterns, computation
complexities, and parallelization forms of the operations. The results show a
significant variability in the performance of operations with respect to the
device used. The performances of operations with regular data access are
comparable or sometimes better on a MIC than that on a GPU. GPUs are more
efficient than MICs for operations that access data irregularly, because of the
lower bandwidth of the MIC for random data accesses. We propose new
performance-aware scheduling strategies that consider variabilities in
operation speedups. Our scheduling strategies significantly improve application
performance compared to classic strategies in hybrid configurations.Comment: 22 pages, 12 figures, 6 table
GPIC - GPU Power Iteration Cluster
This work presents a new clustering algorithm, the GPIC, a Graphics
Processing Unit (GPU) accelerated algorithm for Power Iteration Clustering
(PIC). Our algorithm is based on the original PIC proposal, adapted to take
advantage of the GPU architecture, maintining the algorith original properties.
The proposed method was compared against the serial and parallel Spark
implementation, achieving a considerable speed-up in the test problems
Workflow Design Analysis for High Resolution Satellite Image Analysis
Ecological sciences are using imagery from a variety of sources to monitor
and survey populations and ecosystems. Very High Resolution (VHR) satellite
imagery provide an effective dataset for large scale surveys. Convolutional
Neural Networks have successfully been employed to analyze such imagery and
detect large animals. As the datasets increase in volume, O(TB), and number of
images, O(1k), utilizing High Performance Computing (HPC) resources becomes
necessary. In this paper, we investigate a task-parallel data-driven workflows
design to support imagery analysis pipelines with heterogeneous tasks on HPC.
We analyze the capabilities of each design when processing a dataset of 3,000
VHR satellite images for a total of 4~TB. We experimentally model the execution
time of the tasks of the image processing pipeline. We perform experiments to
characterize the resource utilization, total time to completion, and overheads
of each design. Based on the model, overhead and utilization analysis, we show
which design approach to is best suited in scientific pipelines with similar
characteristics.Comment: 10 page
- …