Search CORE

6,528 research outputs found

A quick search method for audio signals based on a piecewise linear representation of feature trajectories

Author: Kashino Kunio
Kimura Akisato
Kurozumi Takayuki
Murase Hiroshi
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 22/10/2007
Field of study

This paper presents a new method for a quick similarity-based search through long unlabeled audio streams to detect and locate audio clips provided by users. The method involves feature-dimension reduction based on a piecewise linear representation of a sequential feature trajectory extracted from a long audio stream. Two techniques enable us to obtain a piecewise linear representation: the dynamic segmentation of feature trajectories and the segment-based Karhunen-L\'{o}eve (KL) transform. The proposed search method guarantees the same search results as the search method without the proposed feature-dimension reduction method in principle. Experiment results indicate significant improvements in search speed. For example the proposed method reduced the total search time to approximately 1/12 that of previous methods and detected queries in approximately 0.3 seconds from a 200-hour audio database.Comment: 20 pages, to appear in IEEE Transactions on Audio, Speech and Language Processin

arXiv.org e-Print Archive

Crossref

DRSP : Dimension Reduction For Similarity Matching And Pruning Of Time Series Data Streams

Author: Patnaik L M
Samartha T V
Srikantaiah K C
Venugopal K R
Vishwanath R H
Publication venue: 'Academy and Industry Research Collaboration Center (AIRCC)'
Publication date: 01/01/2013
Field of study

Similarity matching and join of time series data streams has gained a lot of relevance in today's world that has large streaming data. This process finds wide scale application in the areas of location tracking, sensor networks, object positioning and monitoring to name a few. However, as the size of the data stream increases, the cost involved to retain all the data in order to aid the process of similarity matching also increases. We develop a novel framework to addresses the following objectives. Firstly, Dimension reduction is performed in the preprocessing stage, where large stream data is segmented and reduced into a compact representation such that it retains all the crucial information by a technique called Multi-level Segment Means (MSM). This reduces the space complexity associated with the storage of large time-series data streams. Secondly, it incorporates effective Similarity Matching technique to analyze if the new data objects are symmetric to the existing data stream. And finally, the Pruning Technique that filters out the pseudo data object pairs and join only the relevant pairs. The computational cost for MSM is O(l*ni) and the cost for pruning is O(DRF*wsize*d), where DRF is the Dimension Reduction Factor. We have performed exhaustive experimental trials to show that the proposed framework is both efficient and competent in comparison with earlier works.Comment: 20 pages,8 figures, 6 Table

arXiv.org e-Print Archive

ePrints@Bangalore University

DROP: Dimensionality Reduction Optimization for Time Series

Author: Bailis Peter
Suri Sahaana
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2019
Field of study

Dimensionality reduction is a critical step in scaling machine learning pipelines. Principal component analysis (PCA) is a standard tool for dimensionality reduction, but performing PCA over a full dataset can be prohibitively expensive. As a result, theoretical work has studied the effectiveness of iterative, stochastic PCA methods that operate over data samples. However, termination conditions for stochastic PCA either execute for a predetermined number of iterations, or until convergence of the solution, frequently sampling too many or too few datapoints for end-to-end runtime improvements. We show how accounting for downstream analytics operations during DR via PCA allows stochastic methods to efficiently terminate after operating over small (e.g., 1%) subsamples of input data, reducing whole workload runtime. Leveraging this, we propose DROP, a DR optimizer that enables speedups of up to 5x over Singular-Value-Decomposition-based PCA techniques, and exceeds conventional approaches like FFT and PAA by up to 16x in end-to-end workloads

arXiv.org e-Print Archive

Crossref

Generic Subsequence Matching Framework: Modularity, Flexibility, Efficiency

Author: A. Guttman
C. Faloutsos
C.S. Perng
D. Novak
E. Keogh
E. Keogh
F. Korn
H. Sakoe
J. Shieh
M. Batko
Publication venue
Publication date: 01/01/2012
Field of study

Subsequence matching has appeared to be an ideal approach for solving many problems related to the fields of data mining and similarity retrieval. It has been shown that almost any data class (audio, image, biometrics, signals) is or can be represented by some kind of time series or string of symbols, which can be seen as an input for various subsequence matching approaches. The variety of data types, specific tasks and their partial or full solutions is so wide that the choice, implementation and parametrization of a suitable solution for a given task might be complicated and time-consuming; a possibly fruitful combination of fragments from different research areas may not be obvious nor easy to realize. The leading authors of this field also mention the implementation bias that makes difficult a proper comparison of competing approaches. Therefore we present a new generic Subsequence Matching Framework (SMF) that tries to overcome the aforementioned problems by a uniform frame that simplifies and speeds up the design, development and evaluation of subsequence matching related systems. We identify several relatively separate subtasks solved differently over the literature and SMF enables to combine them in straightforward manner achieving new quality and efficiency. This framework can be used in many application domains and its components can be reused effectively. Its strictly modular architecture and openness enables also involvement of efficient solutions from different fields, for instance efficient metric-based indexes. This is an extended version of a paper published on DEXA 2012.Comment: This is an extended version of a paper published on DEXA 201

arXiv.org e-Print Archive

Crossref

Univerzitní repozitář Masarykovy univerzity

Adaptive Representations for Tracking Breaking News on Twitter

Author: Brigadir Igor
Cunningham Pádraig
Greene Derek
Publication venue
Publication date: 28/11/2014
Field of study

Twitter is often the most up-to-date source for finding and tracking breaking news stories. Therefore, there is considerable interest in developing filters for tweet streams in order to track and summarize stories. This is a non-trivial text analytics task as tweets are short, and standard retrieval methods often fail as stories evolve over time. In this paper we examine the effectiveness of adaptive mechanisms for tracking and summarizing breaking news stories. We evaluate the effectiveness of these mechanisms on a number of recent news events for which manually curated timelines are available. Assessments based on ROUGE metrics indicate that an adaptive approaches are best suited for tracking evolving stories on Twitter.Comment: 8 Pag

arXiv.org e-Print Archive

Research Repository UCD

FLASH: Randomized Algorithms Accelerated over CPU-GPU for Ultra-High Dimensional Similarity Search

Author: Andoni A.
Broder A. Z.
Li P.
Lv Q.
Shrivastava A.
Shrivastava A.
Weber R.
Publication venue
Publication date: 03/07/2018
Field of study

We present FLASH (\textbf{F}ast \textbf{L}SH \textbf{A}lgorithm for \textbf{S}imilarity search accelerated with \textbf{H}PC), a similarity search system for ultra-high dimensional datasets on a single machine, that does not require similarity computations and is tailored for high-performance computing platforms. By leveraging a LSH style randomized indexing procedure and combining it with several principled techniques, such as reservoir sampling, recent advances in one-pass minwise hashing, and count based estimations, we reduce the computational and parallelization costs of similarity search, while retaining sound theoretical guarantees. We evaluate FLASH on several real, high-dimensional datasets from different domains, including text, malicious URL, click-through prediction, social networks, etc. Our experiments shed new light on the difficulties associated with datasets having several million dimensions. Current state-of-the-art implementations either fail on the presented scale or are orders of magnitude slower than FLASH. FLASH is capable of computing an approximate k-NN graph, from scratch, over the full webspam dataset (1.3 billion nonzeros) in less than 10 seconds. Computing a full k-NN graph in less than 10 seconds on the webspam dataset, using brute-force (

n^2D

), will require at least 20 teraflops. We provide CPU and GPU implementations of FLASH for replicability of our results

arXiv.org e-Print Archive

Crossref

Using bag-of-concepts to improve the performance of support vector machines in text categorization

Author: Cöster Rickard
Sahlgren Magnus
Publication venue
Publication date: 01/01/2004
Field of study

This paper investigates the use of concept-based representations for text categorization. We introduce a new approach to create concept-based text representations, and apply it to a standard text categorization collection. The representations are used as input to a Support Vector Machine classifier, and the results show that there are certain categories for which concept-based representations constitute a viable supplement to word-based ones. We also demonstrate how the performance of the Support Vector Machine can be improved by combining representations

CiteSeerX

RISE – Research Institutes of Sweden

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Swedish Institute of Computer Science Publications Database

Software institutes' Online Digital Archive

De Novo Assembly of Nucleotide Sequences in a Compressed Feature Space

Author: Robertson David L.
Tapinos Avraam
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/08/2017
Field of study

Sequencing technologies allow for an in-depth analysis of biological species but the size of the generated datasets introduce a number of analytical challenges. Recently, we demonstrated the application of numerical sequence representations and data transformations for the alignment of short reads to a reference genome. Here, we expand out approach for de novo assembly of short reads. Our results demonstrate that highly compressed data can encapsulate the signal suffi- ciently to accurately assemble reads to big contigs or complete genomes

Crossref

Enlighten