33 research outputs found
The Use of MPI and OpenMP Technologies for Subsequence Similarity Search in Very Large Time Series on Computer Cluster System with Nodes Based on the Intel Xeon Phi Knights Landing Many-core Processor
Nowadays, subsequence similarity search is required in a wide range of time
series mining applications: climate modeling, financial forecasts, medical
research, etc. In most of these applications, the Dynamic TimeWarping (DTW)
similarity measure is used since DTW is empirically confirmed as one of the
best similarity measure for most subject domains. Since the DTW measure has a
quadratic computational complexity w.r.t. the length of query subsequence, a
number of parallel algorithms for various many-core architectures have been
developed, namely FPGA, GPU, and Intel MIC. In this article, we propose a new
parallel algorithm for subsequence similarity search in very large time series
on computer cluster systems with nodes based on Intel Xeon Phi Knights Landing
(KNL) many-core processors. Computations are parallelized on two levels as
follows: through MPI at the level of all cluster nodes, and through OpenMP
within one cluster node. The algorithm involves additional data structures and
redundant computations, which make it possible to effectively use the
capabilities of vector computations on Phi KNL. Experimental evaluation of the
algorithm on real-world and synthetic datasets shows that it is highly
scalable.Comment: Accepted for publication in the "Numerical Methods and Programming"
journal (http://num-meth.srcc.msu.ru/english/, in Russian "Vychislitelnye
Metody i Programmirovanie"), in Russia
WLCSSCuda: a CUDA accelerated template matching method for gesture recognition
Template matching methods can benefit from multi-cores architecture in order to parallelise and accelerate the matching of multiple templates. We present WLCSSCuda: a GPU accelerated implementation of the Warping Longest Common Subsequence (WLCSS) pattern recognition algorithm. We evaluate our method on 4 NVIDIA GPUs and 4 multi-cores CPUs. We observe a 67-times speedup for the GPU implementation in the best case against the multithreaded CPU implementation
Accelerating Time Series Analysis via Processing using Non-Volatile Memories
Time Series Analysis (TSA) is a critical workload for consumer-facing
devices. Accelerating TSA is vital for many domains as it enables the
extraction of valuable information and predict future events. The
state-of-the-art algorithm in TSA is the subsequence Dynamic Time Warping
(sDTW) algorithm. However, sDTW's computation complexity increases
quadratically with the time series' length, resulting in two performance
implications. First, the amount of data parallelism available is significantly
higher than the small number of processing units enabled by commodity systems
(e.g., CPUs). Second, sDTW is bottlenecked by memory because it 1) has low
arithmetic intensity and 2) incurs a large memory footprint. To tackle these
two challenges, we leverage Processing-using-Memory (PuM) by performing in-situ
computation where data resides, using the memory cells. PuM provides a
promising solution to alleviate data movement bottlenecks and exposes immense
parallelism.
In this work, we present MATSA, the first MRAM-based Accelerator for Time
Series Analysis. The key idea is to exploit magneto-resistive memory crossbars
to enable energy-efficient and fast time series computation in memory. MATSA
provides the following key benefits: 1) it leverages high levels of parallelism
in the memory substrate by exploiting column-wise arithmetic operations, and 2)
it significantly reduces the data movement costs performing computation using
the memory cells. We evaluate three versions of MATSA to match the requirements
of different environments (e.g., embedded, desktop, or HPC computing) based on
MRAM technology trends. We perform a design space exploration and demonstrate
that our HPC version of MATSA can improve performance by 7.35x/6.15x/6.31x and
energy efficiency by 11.29x/4.21x/2.65x over server CPU, GPU and PNM
architectures, respectively
High-Throughput DTW accelerator with minimum area in AMD FPGA by HLS.
Dynamic Time Warping (DTW) is a dynamic programming
algorithm that is known to be one of the best methods
to measure the similarities between two signals, even if there are
variations in the speed of those. It is extensively used in many
machine learning algorithms, especially for pattern recognition
and classification. U nfortunately, i t h as a q uadratic complexity,
which results in very high computational costs. Furthermore,
its data dependency made it also very difficult t o parallelize.
Special attention has been paid to computing DTW on the edge,
as a way to reduce the load of communication on Internet-of-
Thing applications. In this work, we propose a minimum area
implementation of the DTW algorithm in AMD FPGAs with
optimal use of the resources. That is achieved by maximizing
the use time of the resources and taking advantage of the inner
structure of the AMD FPGAs. This architecture could be used in
small devices or as a base for a multi-core implementation with
very high-throughput.MCIN/AEI/10.13039/501100011033and European Union Next Generation EU/PRTR under Project TED2021-
131527B-I00; by the Fondo Europeo de Desarrollo Regional (UMA20-FEDERJA-059); and by AMD™(Xilinx™) University Program
Universidad de Málaga. Campus de Excelencia Internacional AndalucĂa Tech
NATSA: A Near-Data Processing Accelerator for Time Series Analysis
Time series analysis is a key technique for extracting and predicting events
in domains as diverse as epidemiology, genomics, neuroscience, environmental
sciences, economics, and more. Matrix profile, the state-of-the-art algorithm
to perform time series analysis, computes the most similar subsequence for a
given query subsequence within a sliced time series. Matrix profile has low
arithmetic intensity, but it typically operates on large amounts of time series
data. In current computing systems, this data needs to be moved between the
off-chip memory units and the on-chip computation units for performing matrix
profile. This causes a major performance bottleneck as data movement is
extremely costly in terms of both execution time and energy.
In this work, we present NATSA, the first Near-Data Processing accelerator
for time series analysis. The key idea is to exploit modern 3D-stacked High
Bandwidth Memory (HBM) to enable efficient and fast specialized matrix profile
computation near memory, where time series data resides. NATSA provides three
key benefits: 1) quickly computing the matrix profile for a wide range of
applications by building specialized energy-efficient floating-point arithmetic
processing units close to HBM, 2) improving the energy efficiency and execution
time by reducing the need for data movement over slow and energy-hungry buses
between the computation units and the memory units, and 3) analyzing time
series data at scale by exploiting low-latency, high-bandwidth, and
energy-efficient memory access provided by HBM. Our experimental evaluation
shows that NATSA improves performance by up to 14.2x (9.9x on average) and
reduces energy by up to 27.2x (19.4x on average), over the state-of-the-art
multi-core implementation. NATSA also improves performance by 6.3x and reduces
energy by 10.2x over a general-purpose NDP platform with 64 in-order cores.Comment: To appear in the 38th IEEE International Conference on Computer
Design (ICCD 2020
GPU Acceleration of Melody Accurate Matching in Query-by-Humming
With the increasing scale of the melody database, the query-by-humming system faces the trade-offs between response speed and retrieval accuracy. Melody accurate matching is the key factor to restrict the response speed. In this paper, we present a GPU acceleration method for melody accurate matching, in order to improve the response speed without reducing retrieval accuracy. The method develops two parallel strategies (intra-task parallelism and inter-task parallelism) to obtain accelerated effects. The efficiency of our method is validated through extensive experiments. Evaluation results show that our single GPU implementation achieves 20x to 40x speedup ratio, when compared to a typical general purpose CPU's execution time
Tuning the Computational Effort: An Adaptive Accuracy-aware Approach Across System Layers
This thesis introduces a novel methodology to realize accuracy-aware systems, which will help designers integrate accuracy awareness into their systems. It proposes an adaptive accuracy-aware approach across system layers that addresses current challenges in that domain, combining and tuning accuracy-aware methods on different system layers. To widen the scope of accuracy-aware computing including approximate computing for other domains, this thesis presents innovative accuracy-aware methods and techniques for different system layers.
The required tuning of the accuracy-aware methods is integrated into a configuration layer that tunes the available knobs of the accuracy-aware methods integrated into a system