226 research outputs found
Distributed Representations of Sentences and Documents
Many machine learning algorithms require the input to be represented as a
fixed-length feature vector. When it comes to texts, one of the most common
fixed-length features is bag-of-words. Despite their popularity, bag-of-words
features have two major weaknesses: they lose the ordering of the words and
they also ignore semantics of the words. For example, "powerful," "strong" and
"Paris" are equally distant. In this paper, we propose Paragraph Vector, an
unsupervised algorithm that learns fixed-length feature representations from
variable-length pieces of texts, such as sentences, paragraphs, and documents.
Our algorithm represents each document by a dense vector which is trained to
predict words in the document. Its construction gives our algorithm the
potential to overcome the weaknesses of bag-of-words models. Empirical results
show that Paragraph Vectors outperform bag-of-words models as well as other
techniques for text representations. Finally, we achieve new state-of-the-art
results on several text classification and sentiment analysis tasks
IntentsKB: A Knowledge Base of Entity-Oriented Search Intents
We address the problem of constructing a knowledge base of entity-oriented
search intents. Search intents are defined on the level of entity types, each
comprising of a high-level intent category (property, website, service, or
other), along with a cluster of query terms used to express that intent. These
machine-readable statements can be leveraged in various applications, e.g., for
generating entity cards or query recommendations. By structuring
service-oriented search intents, we take one step towards making entities
actionable. The main contribution of this paper is a pipeline of components we
develop to construct a knowledge base of entity intents. We evaluate
performance both component-wise and end-to-end, and demonstrate that our
approach is able to generate high-quality data.Comment: Proceedings of the 27th ACM International Conference on Information
and Knowledge Management (CIKM'18), 2018. 4 pages. 2 figure
Bag of Tricks for Efficient Text Classification
This paper explores a simple and efficient baseline for text classification.
Our experiments show that our fast text classifier fastText is often on par
with deep learning classifiers in terms of accuracy, and many orders of
magnitude faster for training and evaluation. We can train fastText on more
than one billion words in less than ten minutes using a standard multicore~CPU,
and classify half a million sentences among~312K classes in less than a minute
Efficient Estimation of Word Representations in Vector Space
We propose two novel model architectures for computing continuous vector
representations of words from very large data sets. The quality of these
representations is measured in a word similarity task, and the results are
compared to the previously best performing techniques based on different types
of neural networks. We observe large improvements in accuracy at much lower
computational cost, i.e. it takes less than a day to learn high quality word
vectors from a 1.6 billion words data set. Furthermore, we show that these
vectors provide state-of-the-art performance on our test set for measuring
syntactic and semantic word similarities
struc2vec: Learning Node Representations from Structural Identity
Structural identity is a concept of symmetry in which network nodes are
identified according to the network structure and their relationship to other
nodes. Structural identity has been studied in theory and practice over the
past decades, but only recently has it been addressed with representational
learning techniques. This work presents struc2vec, a novel and flexible
framework for learning latent representations for the structural identity of
nodes. struc2vec uses a hierarchy to measure node similarity at different
scales, and constructs a multilayer graph to encode structural similarities and
generate structural context for nodes. Numerical experiments indicate that
state-of-the-art techniques for learning node representations fail in capturing
stronger notions of structural identity, while struc2vec exhibits much superior
performance in this task, as it overcomes limitations of prior approaches. As a
consequence, numerical experiments indicate that struc2vec improves performance
on classification tasks that depend more on structural identity.Comment: 10 pages, KDD2017, Research Trac
- …