2,404 research outputs found
Discovering missing Wikipedia inter-language links by means of cross-lingual word sense disambiguation
Wikipedia is a very popular online multilingual encyclopedia that contains millions of articles covering most written languages. Wikipedia pages contain monolingual hypertext links to other pages, as well as inter-language links to the corresponding pages in other languages. These inter-language links, however, are not always complete.
We present a prototype for a cross-lingual link discovery tool that discovers missing Wikipedia inter-language links to corresponding pages in other languages for ambiguous nouns. Although the framework of our approach is language-independent, we built a prototype for our application using Dutch as an input language and Spanish, Italian, English, French and German as target languages. The input for our system is a set of Dutch pages for a given ambiguous noun, and the output of the system is a set of links to the corresponding pages in our five target languages.
Our link discovery application contains two submodules. In a first step all pages are retrieved that contain a translation (in our five target languages) of the ambiguous word in the page title (Greedy crawler module), whereas in a second step all corresponding pages are linked between the focus language (being Dutch in our case) and the five target languages (Cross-lingual web page linker module). We consider this second step as a disambiguation task and apply a cross-lingual Word Sense Disambiguation framework to determine whether two pages refer to the same content or not
Recommended from our members
Linking Textual Resources to Support Information Discovery
A vast amount of information is today stored in the form of textual documents, many of which are available online. These documents come from different sources and are of different types. They include newspaper articles, books, corporate reports, encyclopedia entries and research papers. At a semantic level, these documents contain knowledge, which was created by explicitly connecting information and expressing it in the form of a natural language. However, a significant amount of knowledge is not explicitly stated in a single document, yet can be derived or discovered by researching, i.e. accessing, comparing, contrasting and analysing, information from multiple documents. Carrying out this work using traditional search interfaces is tedious due to information overload and the difficulty of formulating queries that would help us to discover information we are not aware of.
In order to support this exploratory process, we need to be able to effectively navigate between related pieces of information across documents. While information can be connected using manually curated cross-document links, this approach not only does not scale, but cannot systematically assist us in the discovery of sometimes non-obvious (hidden) relationships. Consequently, there is a need for automatic approaches to link discovery.
This work studies how people link content, investigates the properties of different link types, presents new methods for automatic link discovery and designs a system in which link discovery is applied on a collection of millions of documents to improve access to public knowledge
Analysing entity context in multilingual Wikipedia to support entity-centric retrieval applications
Representation of influential entities, such as famous people and multinational corporations, on the Web can vary across languages, reflecting language-specific entity aspects as well as divergent views on these entities in different communities. A systematic analysis of language specific entity contexts can provide a better overview of the existing aspects and support entity-centric retrieval applications over multilingual Web data. An important source of cross-lingual information about influential entities is Wikipedia — an online community-created encyclopaedia — containing more than 280 language editions. In this paper we focus on the extraction and analysis of the language-specific entity contexts from different Wikipedia language editions over multilingual data. We discuss alternative ways such contexts can be built, including graph-based and article-based contexts. Furthermore, we analyse the similarities and the differences in these contexts in a case study including 80 entities and five Wikipedia language editions
- …