15,008 research outputs found
Tree Edit Distance Learning via Adaptive Symbol Embeddings
Metric learning has the aim to improve classification accuracy by learning a
distance measure which brings data points from the same class closer together
and pushes data points from different classes further apart. Recent research
has demonstrated that metric learning approaches can also be applied to trees,
such as molecular structures, abstract syntax trees of computer programs, or
syntax trees of natural language, by learning the cost function of an edit
distance, i.e. the costs of replacing, deleting, or inserting nodes in a tree.
However, learning such costs directly may yield an edit distance which violates
metric axioms, is challenging to interpret, and may not generalize well. In
this contribution, we propose a novel metric learning approach for trees which
we call embedding edit distance learning (BEDL) and which learns an edit
distance indirectly by embedding the tree nodes as vectors, such that the
Euclidean distance between those vectors supports class discrimination. We
learn such embeddings by reducing the distance to prototypical trees from the
same class and increasing the distance to prototypical trees from different
classes. In our experiments, we show that BEDL improves upon the
state-of-the-art in metric learning for trees on six benchmark data sets,
ranging from computer science over biomedical data to a natural-language
processing data set containing over 300,000 nodes.Comment: Paper at the International Conference of Machine Learning (2018),
2018-07-10 to 2018-07-15 in Stockholm, Swede
A Deep Network Model for Paraphrase Detection in Short Text Messages
This paper is concerned with paraphrase detection. The ability to detect
similar sentences written in natural language is crucial for several
applications, such as text mining, text summarization, plagiarism detection,
authorship authentication and question answering. Given two sentences, the
objective is to detect whether they are semantically identical. An important
insight from this work is that existing paraphrase systems perform well when
applied on clean texts, but they do not necessarily deliver good performance
against noisy texts. Challenges with paraphrase detection on user generated
short texts, such as Twitter, include language irregularity and noise. To cope
with these challenges, we propose a novel deep neural network-based approach
that relies on coarse-grained sentence modeling using a convolutional neural
network and a long short-term memory model, combined with a specific
fine-grained word-level similarity matching model. Our experimental results
show that the proposed approach outperforms existing state-of-the-art
approaches on user-generated noisy social media data, such as Twitter texts,
and achieves highly competitive performance on a cleaner corpus
- …