455 research outputs found
Distributed Representations of Sentences and Documents
Many machine learning algorithms require the input to be represented as a
fixed-length feature vector. When it comes to texts, one of the most common
fixed-length features is bag-of-words. Despite their popularity, bag-of-words
features have two major weaknesses: they lose the ordering of the words and
they also ignore semantics of the words. For example, "powerful," "strong" and
"Paris" are equally distant. In this paper, we propose Paragraph Vector, an
unsupervised algorithm that learns fixed-length feature representations from
variable-length pieces of texts, such as sentences, paragraphs, and documents.
Our algorithm represents each document by a dense vector which is trained to
predict words in the document. Its construction gives our algorithm the
potential to overcome the weaknesses of bag-of-words models. Empirical results
show that Paragraph Vectors outperform bag-of-words models as well as other
techniques for text representations. Finally, we achieve new state-of-the-art
results on several text classification and sentiment analysis tasks
Learning Simpler Language Models with the Differential State Framework
Learning useful information across long time lags is a critical and difficult
problem for temporal neural models in tasks such as language modeling. Existing
architectures that address the issue are often complex and costly to train. The
Differential State Framework (DSF) is a simple and high-performing design that
unifies previously introduced gated neural models. DSF models maintain
longer-term memory by learning to interpolate between a fast-changing
data-driven representation and a slowly changing, implicitly stable state. This
requires hardly any more parameters than a classical, simple recurrent network.
Within the DSF framework, a new architecture is presented, the Delta-RNN. In
language modeling at the word and character levels, the Delta-RNN outperforms
popular complex architectures, such as the Long Short Term Memory (LSTM) and
the Gated Recurrent Unit (GRU), and, when regularized, performs comparably to
several state-of-the-art baselines. At the subword level, the Delta-RNN's
performance is comparable to that of complex gated architectures.Comment: Edits/revisions applied throughout documen
Bag of Tricks for Efficient Text Classification
This paper explores a simple and efficient baseline for text classification.
Our experiments show that our fast text classifier fastText is often on par
with deep learning classifiers in terms of accuracy, and many orders of
magnitude faster for training and evaluation. We can train fastText on more
than one billion words in less than ten minutes using a standard multicore~CPU,
and classify half a million sentences among~312K classes in less than a minute
Efficient Estimation of Word Representations in Vector Space
We propose two novel model architectures for computing continuous vector
representations of words from very large data sets. The quality of these
representations is measured in a word similarity task, and the results are
compared to the previously best performing techniques based on different types
of neural networks. We observe large improvements in accuracy at much lower
computational cost, i.e. it takes less than a day to learn high quality word
vectors from a 1.6 billion words data set. Furthermore, we show that these
vectors provide state-of-the-art performance on our test set for measuring
syntactic and semantic word similarities
struc2vec: Learning Node Representations from Structural Identity
Structural identity is a concept of symmetry in which network nodes are
identified according to the network structure and their relationship to other
nodes. Structural identity has been studied in theory and practice over the
past decades, but only recently has it been addressed with representational
learning techniques. This work presents struc2vec, a novel and flexible
framework for learning latent representations for the structural identity of
nodes. struc2vec uses a hierarchy to measure node similarity at different
scales, and constructs a multilayer graph to encode structural similarities and
generate structural context for nodes. Numerical experiments indicate that
state-of-the-art techniques for learning node representations fail in capturing
stronger notions of structural identity, while struc2vec exhibits much superior
performance in this task, as it overcomes limitations of prior approaches. As a
consequence, numerical experiments indicate that struc2vec improves performance
on classification tasks that depend more on structural identity.Comment: 10 pages, KDD2017, Research Trac
Target Type Identification for Entity-Bearing Queries
Identifying the target types of entity-bearing queries can help improve
retrieval performance as well as the overall search experience. In this work,
we address the problem of automatically detecting the target types of a query
with respect to a type taxonomy. We propose a supervised learning approach with
a rich variety of features. Using a purpose-built test collection, we show that
our approach outperforms existing methods by a remarkable margin. This is an
extended version of the article published with the same title in the
Proceedings of SIGIR'17.Comment: Extended version of SIGIR'17 short paper, 5 page
- …