1,203 research outputs found
Computing Double Precision Euclidean Distances using GPU Tensor Cores
Tensor cores (TCs) are a type of Application-Specific Integrated Circuit
(ASIC) and are a recent addition to Graphics Processing Unit (GPU)
architectures. As such, TCs are purposefully designed to greatly improve the
performance of Matrix Multiply-Accumulate (MMA) operations. While TCs are
heavily studied for machine learning and closely related fields, where their
high efficiency is undeniable, MMA operations are not unique to these fields.
More generally, any computation that can be expressed as MMA operations can
leverage TCs, and potentially benefit from their higher computational
throughput compared to other general-purpose cores, such as CUDA cores on
Nvidia GPUs. In this paper, we propose the first double precision (FP64)
Euclidean distance calculation algorithm, which is expressed as MMA operations
to leverage TCs on Nvidia GPUs, rather than the more commonly used CUDA cores.
To show that the Euclidean distance can be accelerated in a real-world
application, we evaluate our proposed TC algorithm on the distance similarity
self-join problem, as the most computationally intensive part of the algorithm
consists of computing distances in a multi-dimensional space. We find that the
performance gain from using the tensor core algorithm over the CUDA core
algorithm depends weakly on the dataset size and distribution, but is strongly
dependent on data dimensionality. Overall, TCs are a compelling alternative to
CUDA cores, particularly when the data dimensionality is low (), as we
achieve an average speedup of and up to against a
state-of-the-art GPU distance similarity self-join algorithm. Furthermore,
because this paper is among the first to explore the use of TCs for FP64
general-purpose computation, future research is promising.Comment: Accepted for publicatio
Accelerating Time Series Analysis via Processing using Non-Volatile Memories
Time Series Analysis (TSA) is a critical workload for consumer-facing
devices. Accelerating TSA is vital for many domains as it enables the
extraction of valuable information and predict future events. The
state-of-the-art algorithm in TSA is the subsequence Dynamic Time Warping
(sDTW) algorithm. However, sDTW's computation complexity increases
quadratically with the time series' length, resulting in two performance
implications. First, the amount of data parallelism available is significantly
higher than the small number of processing units enabled by commodity systems
(e.g., CPUs). Second, sDTW is bottlenecked by memory because it 1) has low
arithmetic intensity and 2) incurs a large memory footprint. To tackle these
two challenges, we leverage Processing-using-Memory (PuM) by performing in-situ
computation where data resides, using the memory cells. PuM provides a
promising solution to alleviate data movement bottlenecks and exposes immense
parallelism.
In this work, we present MATSA, the first MRAM-based Accelerator for Time
Series Analysis. The key idea is to exploit magneto-resistive memory crossbars
to enable energy-efficient and fast time series computation in memory. MATSA
provides the following key benefits: 1) it leverages high levels of parallelism
in the memory substrate by exploiting column-wise arithmetic operations, and 2)
it significantly reduces the data movement costs performing computation using
the memory cells. We evaluate three versions of MATSA to match the requirements
of different environments (e.g., embedded, desktop, or HPC computing) based on
MRAM technology trends. We perform a design space exploration and demonstrate
that our HPC version of MATSA can improve performance by 7.35x/6.15x/6.31x and
energy efficiency by 11.29x/4.21x/2.65x over server CPU, GPU and PNM
architectures, respectively
Fast Knowledge Graph Completion using Graphics Processing Units
Knowledge graphs can be used in many areas related to data semantics such as
question-answering systems, knowledge based systems. However, the currently
constructed knowledge graphs need to be complemented for better knowledge in
terms of relations. It is called knowledge graph completion. To add new
relations to the existing knowledge graph by using knowledge graph embedding
models, we have to evaluate vector operations, where
is the number of entities and is the number of relation types. It is very
costly.
In this paper, we provide an efficient knowledge graph completion framework
on GPUs to get new relations using knowledge graph embedding vectors. In the
proposed framework, we first define "transformable to a metric space" and then
provide a method to transform the knowledge graph completion problem into the
similarity join problem for a model which is "transformable to a metric space".
After that, to efficiently process the similarity join problem, we derive
formulas using the properties of a metric space. Based on the formulas, we
develop a fast knowledge graph completion algorithm. Finally, we experimentally
show that our framework can efficiently process the knowledge graph completion
problem
The Parallelism Motifs of Genomic Data Analysis
Genomic data sets are growing dramatically as the cost of sequencing
continues to decline and small sequencing devices become available. Enormous
community databases store and share this data with the research community, but
some of these genomic data analysis problems require large scale computational
platforms to meet both the memory and computational requirements. These
applications differ from scientific simulations that dominate the workload on
high end parallel systems today and place different requirements on programming
support, software libraries, and parallel architectural design. For example,
they involve irregular communication patterns such as asynchronous updates to
shared data structures. We consider several problems in high performance
genomics analysis, including alignment, profiling, clustering, and assembly for
both single genomes and metagenomes. We identify some of the common
computational patterns or motifs that help inform parallelization strategies
and compare our motifs to some of the established lists, arguing that at least
two key patterns, sorting and hashing, are missing
Serving Deep Learning Model in Relational Databases
Serving deep learning (DL) models on relational data has become a critical
requirement across diverse commercial and scientific domains, sparking growing
interest recently. In this visionary paper, we embark on a comprehensive
exploration of representative architectures to address the requirement. We
highlight three pivotal paradigms: The state-of-the-artDL-Centricarchitecture
offloadsDL computations to dedicated DL frameworks. The potential UDF-Centric
architecture encapsulates one or more tensor computations into User Defined
Functions (UDFs) within the database system. The
potentialRelation-Centricarchitecture aims to represent a large-scale tensor
computation through relational operators. While each of these architectures
demonstrates promise in specific use scenarios, we identify urgent requirements
for seamless integration of these architectures and the middle ground between
these architectures. We delve into the gaps that impede the integration and
explore innovative strategies to close them. We present a pathway to establish
a novel database system for enabling a broad class of data-intensive DL
inference applications.Comment: Authors are ordered alphabetically; Jia Zou is the corresponding
autho
Improving the performance of similarity joins using graphics processing unit
Ankara : The Department of Computer Engineering and the Graduate School of Engineering and Science of Bilkent University, 2012.Thesis (Master's) -- Bilkent University, 2012.Includes bibliographical refences.The similarity join is an important operation in data mining and it is used in
many applications from varying domains. A similarity join operator takes one or
two sets of data points and outputs pairs of points whose distances in the data
space is within a certain threshold value, ". The baseline nested loop approach
computes the distances between all pairs of objects. When considering large set
of objects which yield too long query time for nested loop paradigm, accelerating
such operator becomes more important. The computing capability of recent
GPUs with the help of a general purpose parallel computing architecture (CUDA)
has attracted many researches. With this motivation, we propose two similarity
join algorithms for Graphics Processing Unit (GPU). To exploit the advantages of
general purpose GPU computing, we rst propose an improved nested loop join
algorithm (GPU-INLJ) for the speci c environment of GPU. Also we present a
partitioning-based join algorithm (KMEANS-JOIN) that guarantees each partition
can be joined independently without missing any join pair. Our experiments
demonstrate massive performance gains and the suitability of our algorithms for
large datasets.Korkmaz, ZeynepM.S
- …