Search CORE

36,711 research outputs found

Convolutional Embedding for Edit Distance

Author: Cheng James
Dai Xinyan
Wang Yuxuan
Yan Xiao
Yang Han
Zhou Kaiwen
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 22/05/2020
Field of study

Edit-distance-based string similarity search has many applications such as spell correction, data de-duplication, and sequence alignment. However, computing edit distance is known to have high complexity, which makes string similarity search challenging for large datasets. In this paper, we propose a deep learning pipeline (called CNN-ED) that embeds edit distance into Euclidean distance for fast approximate similarity search. A convolutional neural network (CNN) is used to generate fixed-length vector embeddings for a dataset of strings and the loss function is a combination of the triplet loss and the approximation error. To justify our choice of using CNN instead of other structures (e.g., RNN) as the model, theoretical analysis is conducted to show that some basic operations in our CNN model preserve edit distance. Experimental results show that CNN-ED outperforms data-independent CGK embedding and RNN-based GRU embedding in terms of both accuracy and efficiency by a large margin. We also show that string similarity search can be significantly accelerated using CNN-based embeddings, sometimes by orders of magnitude.Comment: Accepted by the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 202

arXiv.org e-Print Archive

Crossref

Melody recognition with learned edit distances

Author: A. Dempster
D. Rizo
G. Aloupis
G. Bouchard
J. Oncina
L. Boyer
M. Bernard
M. Bernard
M. Mongeau
R. Durbin
S. Doraisamy
S.M. Selkow
S.V. Ristad
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2008
Field of study

In a music recognition task, the classification of a new melody is often achieved by looking for the closest piece in a set of already known prototypes. The definition of a relevant similarity measure becomes then a crucial point. So far, the edit distance approach with a-priori fixed operation costs has been one of the most used to accomplish the task. In this paper, the application of a probabilistic learning model to both string and tree edit distances is proposed and is compared to a genetic algorithm cost fitting approach. The results show that both learning models outperform fixed-costs systems, and that the probabilistic approach is able to describe consistently the underlying melodic similarity model.This work was funded by the French ANR Marmota project, the Spanish PROSEMUS project (TIN2006-14932-C02), the research programme Consolider Ingenio 2010 (MIPRCV, CSD2007-00018), and the Pascal Network of Excellence

Repositorio Institucional de la Universidad de Alicante

HAL-UJM

Crossref

HAL AMU

Learning Constrained Edit State Machines

Author: Boyer Laurent
Gandrillon Olivier
Habrard Amaury
Pellerin Mathilde
Sebban Marc
Publication venue: HAL CCSD
Publication date: 02/11/2009
Field of study

International audienceLearning the parameters of the edit distance has been increasingly studied during the past few years to improve the assessment of similarities between structured data, such as strings, trees or graphs. Often based on the optimization of the likelihood of pairs of data, the learned models usually take the form of probabilistic state machines, such as pair-Hidden Markov Models (pair-HMM), stochastic transducers, or probabilistic deterministic automata. Although the use of such models has lead to significant improvements of edit distance-based classification tasks, a new challenge has appeared on the horizon: How integrating background knowledge during the learning process? This is the subject matter of this paper in the case of (input,output) pairs of strings. We present a generalization of the pair-HMM in the form of a constrained state machine, where a transition between two states is driven by constraints fulfilled on the input string. Experimental results are provided on a task in molecular biology, aiming to detect transcription factor binding sites

HAL-UJM

Crossref

Hal - Université Grenoble Alpes

HAL AMU

HAL Université de Savoie

A Survey on Metric Learning for Feature Vectors and Structured Data

Author: Bellet Aurélien
Habrard Amaury
Sebban Marc
Publication venue
Publication date: 01/01/2013
Field of study

The need for appropriate ways to measure the distance or similarity between data is ubiquitous in machine learning, pattern recognition and data mining, but handcrafting such good metrics for specific problems is generally difficult. This has led to the emergence of metric learning, which aims at automatically learning a metric from data and has attracted a lot of interest in machine learning and related fields for the past ten years. This survey paper proposes a systematic review of the metric learning literature, highlighting the pros and cons of each approach. We pay particular attention to Mahalanobis distance metric learning, a well-studied and successful framework, but additionally present a wide range of methods that have recently emerged as powerful alternatives, including nonlinear metric learning, similarity learning and local metric learning. Recent trends and extensions, such as semi-supervised metric learning, metric learning for histogram data and the derivation of generalization guarantees, are also covered. Finally, this survey addresses metric learning for structured data, in particular edit distance learning, and attempts to give an overview of the remaining challenges in metric learning for the years to come.Comment: Technical report, 59 pages. Changes in v2: fixed typos and improved presentation. Changes in v3: fixed typos. Changes in v4: fixed typos and new method

arXiv.org e-Print Archive

HAL-UJM

Space-efficient Feature Maps for String Alignment Kernels

Author: CC Chang
G Cormode
H Lodhi
H Saigo
M Kanehisa
MC Ferris
RE Fan
S Kim
T Gärtner
T Hofmann
TF Smith
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2019
Field of study

String kernels are attractive data analysis tools for analyzing string data. Among them, alignment kernels are known for their high prediction accuracies in string classifications when tested in combination with SVM in various applications. However, alignment kernels have a crucial drawback in that they scale poorly due to their quadratic computation complexity in the number of input strings, which limits large-scale applications in practice. We address this need by presenting the first approximation for string alignment kernels, which we call space-efficient feature maps for edit distance with moves (SFMEDM), by leveraging a metric embedding named edit sensitive parsing (ESP) and feature maps (FMs) of random Fourier features (RFFs) for large-scale string analyses. The original FMs for RFFs consume a huge amount of memory proportional to the dimension d of input vectors and the dimension D of output vectors, which prohibits its large-scale applications. We present novel space-efficient feature maps (SFMs) of RFFs for a space reduction from O(dD) of the original FMs to O(d) of SFMs with a theoretical guarantee with respect to concentration bounds. We experimentally test SFMEDM on its ability to learn SVM for large-scale string classifications with various massive string data, and we demonstrate the superior performance of SFMEDM with respect to prediction accuracy, scalability and computation efficiency.Comment: Full version for ICDM'19 pape

arXiv.org e-Print Archive

Crossref

The IT University of Copenhagen's Repository