62,647 research outputs found
Query Expansion with Locally-Trained Word Embeddings
Continuous space word embeddings have received a great deal of attention in
the natural language processing and machine learning communities for their
ability to model term similarity and other relationships. We study the use of
term relatedness in the context of query expansion for ad hoc information
retrieval. We demonstrate that word embeddings such as word2vec and GloVe, when
trained globally, underperform corpus and query specific embeddings for
retrieval tasks. These results suggest that other tasks benefiting from global
embeddings may also benefit from local embeddings
A Semantics-Based Measure of Emoji Similarity
Emoji have grown to become one of the most important forms of communication
on the web. With its widespread use, measuring the similarity of emoji has
become an important problem for contemporary text processing since it lies at
the heart of sentiment analysis, search, and interface design tasks. This paper
presents a comprehensive analysis of the semantic similarity of emoji through
embedding models that are learned over machine-readable emoji meanings in the
EmojiNet knowledge base. Using emoji descriptions, emoji sense labels and emoji
sense definitions, and with different training corpora obtained from Twitter
and Google News, we develop and test multiple embedding models to measure emoji
similarity. To evaluate our work, we create a new dataset called EmoSim508,
which assigns human-annotated semantic similarity scores to a set of 508
carefully selected emoji pairs. After validation with EmoSim508, we present a
real-world use-case of our emoji embedding models using a sentiment analysis
task and show that our models outperform the previous best-performing emoji
embedding model on this task. The EmoSim508 dataset and our emoji embedding
models are publicly released with this paper and can be downloaded from
http://emojinet.knoesis.org/.Comment: This paper is accepted at Web Intelligence 2017 as a full paper, In
2017 IEEE/WIC/ACM International Conference on Web Intelligence (WI). Leipzig,
Germany: ACM, 201
Discovering missing Wikipedia inter-language links by means of cross-lingual word sense disambiguation
Wikipedia is a very popular online multilingual encyclopedia that contains millions of articles covering most written languages. Wikipedia pages contain monolingual hypertext links to other pages, as well as inter-language links to the corresponding pages in other languages. These inter-language links, however, are not always complete.
We present a prototype for a cross-lingual link discovery tool that discovers missing Wikipedia inter-language links to corresponding pages in other languages for ambiguous nouns. Although the framework of our approach is language-independent, we built a prototype for our application using Dutch as an input language and Spanish, Italian, English, French and German as target languages. The input for our system is a set of Dutch pages for a given ambiguous noun, and the output of the system is a set of links to the corresponding pages in our five target languages.
Our link discovery application contains two submodules. In a first step all pages are retrieved that contain a translation (in our five target languages) of the ambiguous word in the page title (Greedy crawler module), whereas in a second step all corresponding pages are linked between the focus language (being Dutch in our case) and the five target languages (Cross-lingual web page linker module). We consider this second step as a disambiguation task and apply a cross-lingual Word Sense Disambiguation framework to determine whether two pages refer to the same content or not
- …