20,025 research outputs found
Distributed Representations of Sentences and Documents
Many machine learning algorithms require the input to be represented as a
fixed-length feature vector. When it comes to texts, one of the most common
fixed-length features is bag-of-words. Despite their popularity, bag-of-words
features have two major weaknesses: they lose the ordering of the words and
they also ignore semantics of the words. For example, "powerful," "strong" and
"Paris" are equally distant. In this paper, we propose Paragraph Vector, an
unsupervised algorithm that learns fixed-length feature representations from
variable-length pieces of texts, such as sentences, paragraphs, and documents.
Our algorithm represents each document by a dense vector which is trained to
predict words in the document. Its construction gives our algorithm the
potential to overcome the weaknesses of bag-of-words models. Empirical results
show that Paragraph Vectors outperform bag-of-words models as well as other
techniques for text representations. Finally, we achieve new state-of-the-art
results on several text classification and sentiment analysis tasks
Exploring the viability of semi-automated document markup
Digital humanities scholarship has long acknowledged the abundant theoretical advantages of text encoding; more questionable is whether the advantages can, in practice and in general, outweigh the costs of the usually labor-intensive task of encoding. Markup of literary texts has not yet been undertaken on a scale large enough to realize many of its potential applications and benefits. If we can reduce the human labor required to encode texts, libraries and their users can take greater advantage of the hosts of texts being produced by various mass digitization projects, and can focus more attention on implementing tools that use underlying encodings. How far can automation take an encoding effort? And what implications might that have for libraries and their users? Compelled by such questions, this paper explores the viability of semi-automated text encodingunpublishednot peer reviewe
- …