102,363 research outputs found
Distributed Representations of Sentences and Documents
Many machine learning algorithms require the input to be represented as a
fixed-length feature vector. When it comes to texts, one of the most common
fixed-length features is bag-of-words. Despite their popularity, bag-of-words
features have two major weaknesses: they lose the ordering of the words and
they also ignore semantics of the words. For example, "powerful," "strong" and
"Paris" are equally distant. In this paper, we propose Paragraph Vector, an
unsupervised algorithm that learns fixed-length feature representations from
variable-length pieces of texts, such as sentences, paragraphs, and documents.
Our algorithm represents each document by a dense vector which is trained to
predict words in the document. Its construction gives our algorithm the
potential to overcome the weaknesses of bag-of-words models. Empirical results
show that Paragraph Vectors outperform bag-of-words models as well as other
techniques for text representations. Finally, we achieve new state-of-the-art
results on several text classification and sentiment analysis tasks
Comparative Analysis of Word Embeddings for Capturing Word Similarities
Distributed language representation has become the most widely used technique
for language representation in various natural language processing tasks. Most
of the natural language processing models that are based on deep learning
techniques use already pre-trained distributed word representations, commonly
called word embeddings. Determining the most qualitative word embeddings is of
crucial importance for such models. However, selecting the appropriate word
embeddings is a perplexing task since the projected embedding space is not
intuitive to humans. In this paper, we explore different approaches for
creating distributed word representations. We perform an intrinsic evaluation
of several state-of-the-art word embedding methods. Their performance on
capturing word similarities is analysed with existing benchmark datasets for
word pairs similarities. The research in this paper conducts a correlation
analysis between ground truth word similarities and similarities obtained by
different word embedding methods.Comment: Part of the 6th International Conference on Natural Language
Processing (NATP 2020
SNU_IDS at SemEval-2018 Task 12: Sentence Encoder with Contextualized Vectors for Argument Reasoning Comprehension
We present a novel neural architecture for the Argument Reasoning
Comprehension task of SemEval 2018. It is a simple neural network consisting of
three parts, collectively judging whether the logic built on a set of given
sentences (a claim, reason, and warrant) is plausible or not. The model
utilizes contextualized word vectors pre-trained on large machine translation
(MT) datasets as a form of transfer learning, which can help to mitigate the
lack of training data. Quantitative analysis shows that simply leveraging LSTMs
trained on MT datasets outperforms several baselines and non-transferred
models, achieving accuracies of about 70% on the development set and about 60%
on the test set.Comment: SemEval 201
- …