17,844 research outputs found
A Comparison of Approaches for Measuring Cross-Lingual Similarity of Wikipedia Articles
Wikipedia has been used as a source of comparable texts
for a range of tasks, such as Statistical Machine Translation and CrossLanguage
Information Retrieval. Articles written in different languages
on the same topic are often connected through inter-language-links. However,
the extent to which these articles are similar is highly variable and
this may impact on the use of Wikipedia as a comparable resource. In this
paper we compare various language-independent methods for measuring
cross-lingual similarity: character n-grams, cognateness, word count ratio,
and an approach based on outlinks. These approaches are compared
against a baseline utilising MT resources. Measures are also compared
to human judgements of similarity using a manually created resource
containing 700 pairs of Wikipedia articles (in 7 language pairs). Results
indicate that a combination of language-independent models (char-ngrams,
outlinks and word-count ratio) is highly effective for identifying
cross-lingual similarity and performs comparably to language-dependent
models (translation and monolingual analysis).The work of the first author was in the framework of the Tacardi research project (TIN2012-38523-C02-00). The work of the fourth author was in the framework of the DIANA-Applications (TIN2012-38603-C02-01) and WIQ-EI IRSES (FP7 Marie Curie No. 269180) research projects.Barrón Cedeño, LA.; Paramita, ML.; Clough, P.; Rosso, P. (2014). A Comparison of Approaches for Measuring Cross-Lingual Similarity of Wikipedia Articles. En Advances in Information Retrieval. Springer Verlag (Germany). 424-429. https://doi.org/10.1007/978-3-319-06028-6_36S424429Adafre, S., de Rijke, M.: Finding Similar Sentences across Multiple Languages in Wikipedia. In: Proc. of the 11th Conf. of the European Chapter of the Association for Computational Linguistics, pp. 62–69 (2006)Dumais, S., Letsche, T., Littman, M., Landauer, T.: Automatic Cross-Language Retrieval Using Latent Semantic Indexing. In: AAAI 1997 Spring Symposium Series: Cross-Language Text and Speech Retrieval, Stanford University, pp. 24–26 (1997)Filatova, E.: Directions for exploiting asymmetries in multilingual Wikipedia. In: Proc. of the Third Intl. Workshop on Cross Lingual Information Access: Addressing the Information Need of Multilingual Societies, Boulder, CO (2009)Levow, G.A., Oard, D., Resnik, P.: Dictionary-Based Techniques for Cross-Language Information Retrieval. Information Processing and Management: Special Issue on Cross-Language Information Retrieval 41(3), 523–547 (2005)Mcnamee, P., Mayfield, J.: Character N-Gram Tokenization for European Language Text Retrieval. Information Retrieval 7(1-2), 73–97 (2004)Mihalcea, R.: Using Wikipedia for Automatic Word Sense Disambiguation. In: Proc. of NAACL 2007. ACL, Rochester (2007)Mohammadi, M., GhasemAghaee, N.: Building Bilingual Parallel Corpora based on Wikipedia. In: Second Intl. Conf. on Computer Engineering and Applications., vol. 2, pp. 264–268 (2010)Munteanu, D., Fraser, A., Marcu, D.: Improved Machine Translation Performace via Parallel Sentence Extraction from Comparable Corpora. In: Proc. of the Human Language Technology and North American Association for Computational Linguistics Conf (HLT/NAACL 2004), Boston, MA (2004)Nguyen, D., Overwijk, A., Hauff, C., Trieschnigg, D.R.B., Hiemstra, D., de Jong, F.: WikiTranslate: Query Translation for Cross-Lingual Information Retrieval Using Only Wikipedia. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Peñas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 58–65. Springer, Heidelberg (2009)Paramita, M.L., Clough, P.D., Aker, A., Gaizauskas, R.: Correlation between Similarity Measures for Inter-Language Linked Wikipedia Articles. In: Calzolari, E.A. (ed.) Proc. of the 8th Intl. Language Resources and Evaluation (LREC 2012), pp. 790–797. ELRA, Istanbul (2012)Potthast, M., Stein, B., Anderka, M.: A Wikipedia-Based Multilingual Retrieval Model. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 522–530. Springer, Heidelberg (2008)Simard, M., Foster, G.F., Isabelle, P.: Using Cognates to Align Sentences in Bilingual Corpora. In: Proc. of the Fourth Intl. Conf. on Theoretical and Methodological Issues in Machine Translation (1992)Steinberger, R., Pouliquen, B., Hagman, J.: Cross-lingual Document Similarity Calculation Using the Multilingual Thesaurus EUROVOC. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 415–424. Springer, Heidelberg (2002)Toral, A., Muñoz, R.: A proposal to automatically build and maintain gazetteers for Named Entity Recognition using Wikipedia. In: Proc. of the EACL Workshop on New Text 2006. Association for Computational Linguistics, Trento (2006
Document expansion for image retrieval
Successful information retrieval requires e�ective matching
between the user's search request and the contents of relevant
documents. Often the request entered by a user may
not use the same topic relevant terms as the authors' of the
documents. One potential approach to address problems
of query-document term mismatch is document expansion
to include additional topically relevant indexing terms in a
document which may encourage its retrieval when relevant
to queries which do not match its original contents well. We
propose and evaluate a new document expansion method
using external resources. While results of previous research
have been inconclusive in determining the impact of document
expansion on retrieval e�ectiveness, our method is
shown to work e�ectively for text-based image retrieval of
short image annotation documents. Our approach uses the
Okapi query expansion algorithm as a method for document
expansion. We further show improved performance can be
achieved by using a \document reduction" approach to include
only the signi�cant terms in a document in the expansion
process. Our experiments on the WikipediaMM task at
ImageCLEF 2008 show an increase of 16.5% in mean average
precision (MAP) compared to a variation of Okapi BM25 retrieval
model. To compare document expansion with query
expansion, we also test query expansion from an external resource
which leads an improvement by 9.84% in MAP over
our baseline. Our conclusion is that the document expansion
with document reduction and in combination with query expansion
produces the overall best retrieval results for shortlength
document retrieval. For this image retrieval task, we
also concluded that query expansion from external resource
does not outperform the document expansion method
Crosslingual Document Embedding as Reduced-Rank Ridge Regression
There has recently been much interest in extending vector-based word
representations to multiple languages, such that words can be compared across
languages. In this paper, we shift the focus from words to documents and
introduce a method for embedding documents written in any language into a
single, language-independent vector space. For training, our approach leverages
a multilingual corpus where the same concept is covered in multiple languages
(but not necessarily via exact translations), such as Wikipedia. Our method,
Cr5 (Crosslingual reduced-rank ridge regression), starts by training a
ridge-regression-based classifier that uses language-specific bag-of-word
features in order to predict the concept that a given document is about. We
show that, when constraining the learned weight matrix to be of low rank, it
can be factored to obtain the desired mappings from language-specific
bags-of-words to language-independent embeddings. As opposed to most prior
methods, which use pretrained monolingual word vectors, postprocess them to
make them crosslingual, and finally average word vectors to obtain document
vectors, Cr5 is trained end-to-end and is thus natively crosslingual as well as
document-level. Moreover, since our algorithm uses the singular value
decomposition as its core operation, it is highly scalable. Experiments show
that our method achieves state-of-the-art performance on a crosslingual
document retrieval task. Finally, although not trained for embedding sentences
and words, it also achieves competitive performance on crosslingual sentence
and word retrieval tasks.Comment: In The Twelfth ACM International Conference on Web Search and Data
Mining (WSDM '19
Preliminary results in tag disambiguation using DBpedia
The availability of tag-based user-generated content for a variety of Web resources (music, photos, videos, text, etc.) has largely increased in the last years. Users can assign tags freely and then use them to share and retrieve information. However, tag-based sharing and retrieval is not optimal due to the fact that tags are plain text labels without an explicit or formal meaning, and hence polysemy and synonymy should be dealt with appropriately. To ameliorate these problems, we propose a context-based tag disambiguation algorithm that selects the meaning of a tag among a set of candidate DBpedia entries, using a common information retrieval similarity measure. The most similar DBpedia en-try is selected as the one representing the meaning of the tag. We describe and analyze some preliminary results, and discuss about current challenges in this area
Reading Wikipedia to Answer Open-Domain Questions
This paper proposes to tackle open- domain question answering using Wikipedia
as the unique knowledge source: the answer to any factoid question is a text
span in a Wikipedia article. This task of machine reading at scale combines the
challenges of document retrieval (finding the relevant articles) with that of
machine comprehension of text (identifying the answer spans from those
articles). Our approach combines a search component based on bigram hashing and
TF-IDF matching with a multi-layer recurrent neural network model trained to
detect answers in Wikipedia paragraphs. Our experiments on multiple existing QA
datasets indicate that (1) both modules are highly competitive with respect to
existing counterparts and (2) multitask learning using distant supervision on
their combination is an effective complete system on this challenging task.Comment: ACL2017, 10 page
Semantic Tagging on Historical Maps
Tags assigned by users to shared content can be ambiguous. As a possible
solution, we propose semantic tagging as a collaborative process in which a
user selects and associates Web resources drawn from a knowledge context. We
applied this general technique in the specific context of online historical
maps and allowed users to annotate and tag them. To study the effects of
semantic tagging on tag production, the types and categories of obtained tags,
and user task load, we conducted an in-lab within-subject experiment with 24
participants who annotated and tagged two distinct maps. We found that the
semantic tagging implementation does not affect these parameters, while
providing tagging relationships to well-defined concept definitions. Compared
to label-based tagging, our technique also gathers positive and negative
tagging relationships. We believe that our findings carry implications for
designers who want to adopt semantic tagging in other contexts and systems on
the Web.Comment: 10 page
A Survey of Volunteered Open Geo-Knowledge Bases in the Semantic Web
Over the past decade, rapid advances in web technologies, coupled with
innovative models of spatial data collection and consumption, have generated a
robust growth in geo-referenced information, resulting in spatial information
overload. Increasing 'geographic intelligence' in traditional text-based
information retrieval has become a prominent approach to respond to this issue
and to fulfill users' spatial information needs. Numerous efforts in the
Semantic Geospatial Web, Volunteered Geographic Information (VGI), and the
Linking Open Data initiative have converged in a constellation of open
knowledge bases, freely available online. In this article, we survey these open
knowledge bases, focusing on their geospatial dimension. Particular attention
is devoted to the crucial issue of the quality of geo-knowledge bases, as well
as of crowdsourced data. A new knowledge base, the OpenStreetMap Semantic
Network, is outlined as our contribution to this area. Research directions in
information integration and Geographic Information Retrieval (GIR) are then
reviewed, with a critical discussion of their current limitations and future
prospects
- …