1,125 research outputs found
Computing Double Precision Euclidean Distances using GPU Tensor Cores
Tensor cores (TCs) are a type of Application-Specific Integrated Circuit
(ASIC) and are a recent addition to Graphics Processing Unit (GPU)
architectures. As such, TCs are purposefully designed to greatly improve the
performance of Matrix Multiply-Accumulate (MMA) operations. While TCs are
heavily studied for machine learning and closely related fields, where their
high efficiency is undeniable, MMA operations are not unique to these fields.
More generally, any computation that can be expressed as MMA operations can
leverage TCs, and potentially benefit from their higher computational
throughput compared to other general-purpose cores, such as CUDA cores on
Nvidia GPUs. In this paper, we propose the first double precision (FP64)
Euclidean distance calculation algorithm, which is expressed as MMA operations
to leverage TCs on Nvidia GPUs, rather than the more commonly used CUDA cores.
To show that the Euclidean distance can be accelerated in a real-world
application, we evaluate our proposed TC algorithm on the distance similarity
self-join problem, as the most computationally intensive part of the algorithm
consists of computing distances in a multi-dimensional space. We find that the
performance gain from using the tensor core algorithm over the CUDA core
algorithm depends weakly on the dataset size and distribution, but is strongly
dependent on data dimensionality. Overall, TCs are a compelling alternative to
CUDA cores, particularly when the data dimensionality is low (), as we
achieve an average speedup of and up to against a
state-of-the-art GPU distance similarity self-join algorithm. Furthermore,
because this paper is among the first to explore the use of TCs for FP64
general-purpose computation, future research is promising.Comment: Accepted for publicatio
Fast Knowledge Graph Completion using Graphics Processing Units
Knowledge graphs can be used in many areas related to data semantics such as
question-answering systems, knowledge based systems. However, the currently
constructed knowledge graphs need to be complemented for better knowledge in
terms of relations. It is called knowledge graph completion. To add new
relations to the existing knowledge graph by using knowledge graph embedding
models, we have to evaluate vector operations, where
is the number of entities and is the number of relation types. It is very
costly.
In this paper, we provide an efficient knowledge graph completion framework
on GPUs to get new relations using knowledge graph embedding vectors. In the
proposed framework, we first define "transformable to a metric space" and then
provide a method to transform the knowledge graph completion problem into the
similarity join problem for a model which is "transformable to a metric space".
After that, to efficiently process the similarity join problem, we derive
formulas using the properties of a metric space. Based on the formulas, we
develop a fast knowledge graph completion algorithm. Finally, we experimentally
show that our framework can efficiently process the knowledge graph completion
problem
DeepJoin: Joinable Table Discovery with Pre-trained Language Models
Due to the usefulness in data enrichment for data analysis tasks, joinable
table discovery has become an important operation in data lake management.
Existing approaches target equi-joins, the most common way of combining tables
for creating a unified view, or semantic joins, which tolerate misspellings and
different formats to deliver more join results. They are either exact solutions
whose running time is linear in the sizes of query column and target table
repository or approximate solutions lacking precision. In this paper, we
propose Deepjoin, a deep learning model for accurate and efficient joinable
table discovery. Our solution is an embedding-based retrieval, which employs a
pre-trained language model (PLM) and is designed as one framework serving both
equi- and semantic joins. We propose a set of contextualization options to
transform column contents to a text sequence. The PLM reads the sequence and is
fine-tuned to embed columns to vectors such that columns are expected to be
joinable if they are close to each other in the vector space. Since the output
of the PLM is fixed in length, the subsequent search procedure becomes
independent of the column size. With a state-of-the-art approximate nearest
neighbor search algorithm, the search time is logarithmic in the repository
size. To train the model, we devise the techniques for preparing training data
as well as data augmentation. The experiments on real datasets demonstrate that
by training on a small subset of a corpus, Deepjoin generalizes to large
datasets and its precision consistently outperforms other approximate
solutions'. Deepjoin is even more accurate than an exact solution to semantic
joins when evaluated with labels from experts. Moreover, when equipped with a
GPU, Deepjoin is up to two orders of magnitude faster than existing solutions
Keyframe-based monocular SLAM: design, survey, and future directions
Extensive research in the field of monocular SLAM for the past fifteen years
has yielded workable systems that found their way into various applications in
robotics and augmented reality. Although filter-based monocular SLAM systems
were common at some time, the more efficient keyframe-based solutions are
becoming the de facto methodology for building a monocular SLAM system. The
objective of this paper is threefold: first, the paper serves as a guideline
for people seeking to design their own monocular SLAM according to specific
environmental constraints. Second, it presents a survey that covers the various
keyframe-based monocular SLAM systems in the literature, detailing the
components of their implementation, and critically assessing the specific
strategies made in each proposed solution. Third, the paper provides insight
into the direction of future research in this field, to address the major
limitations still facing monocular SLAM; namely, in the issues of illumination
changes, initialization, highly dynamic motion, poorly textured scenes,
repetitive textures, map maintenance, and failure recovery
Accelerating Time Series Analysis via Processing using Non-Volatile Memories
Time Series Analysis (TSA) is a critical workload for consumer-facing
devices. Accelerating TSA is vital for many domains as it enables the
extraction of valuable information and predict future events. The
state-of-the-art algorithm in TSA is the subsequence Dynamic Time Warping
(sDTW) algorithm. However, sDTW's computation complexity increases
quadratically with the time series' length, resulting in two performance
implications. First, the amount of data parallelism available is significantly
higher than the small number of processing units enabled by commodity systems
(e.g., CPUs). Second, sDTW is bottlenecked by memory because it 1) has low
arithmetic intensity and 2) incurs a large memory footprint. To tackle these
two challenges, we leverage Processing-using-Memory (PuM) by performing in-situ
computation where data resides, using the memory cells. PuM provides a
promising solution to alleviate data movement bottlenecks and exposes immense
parallelism.
In this work, we present MATSA, the first MRAM-based Accelerator for Time
Series Analysis. The key idea is to exploit magneto-resistive memory crossbars
to enable energy-efficient and fast time series computation in memory. MATSA
provides the following key benefits: 1) it leverages high levels of parallelism
in the memory substrate by exploiting column-wise arithmetic operations, and 2)
it significantly reduces the data movement costs performing computation using
the memory cells. We evaluate three versions of MATSA to match the requirements
of different environments (e.g., embedded, desktop, or HPC computing) based on
MRAM technology trends. We perform a design space exploration and demonstrate
that our HPC version of MATSA can improve performance by 7.35x/6.15x/6.31x and
energy efficiency by 11.29x/4.21x/2.65x over server CPU, GPU and PNM
architectures, respectively
- …