140 research outputs found
Recommended from our members
KMI, The Open University at NTCIR-9 CrossLink: Cross-Lingual Link Discovery in Wikipedia using explicit semantic analysis
This paper describes the methods used in the submission of Knowledge Media institute (KMI), The Open University to the NTCIR-9 Cross-Lingual Link Discovery (CLLD)task entitled CrossLink. KMI submitted four runs for link discovery from English to Chinese; however, the developed methods, which utilise Explicit Semantic Analysis (ESA), are applicable also to other language combinations. Three of the runs are based on exploiting the existing cross-lingual mapping between different versions of Wikipedia articles. In the fourth run, we assume information about the mapping is not available. Our methods achieved encouraging results and we describe in detail how their performance can be further improved. Finally, we discuss two important issues in link discovery: the evaluation methodology and the applicability of the developed methods across dfferent textual collections
DCU at NTCIR-10 cross-lingual link discovery (CrossLink-2) task
DCU participated in the English to Chinese (C2E) and Chinese to English (C2E) subtasks of the NTCIR 10 CrossLink-2 Cross-lingual Link Discovery (CLLD) task. Our strategy for each query involved extracting potential link anchors as n-gram strings, cleaning of potential anchor strings, and anchor expansion and ranking to select a set of anchors for the query. Potential anchors were translated using Google Translate, and a standard information retrieval technique used to create links between anchors and the top 5 ranked retrieved items selected as potential links for each anchor. We submitted a total of four runs for E2C CLLD and C2E CLLD. We describe our method and results for file-to-file level and anchor-to-file level evaluation
Recommended from our members
Simple yet effective methods for cross-lingual link discovery (CLLD) - KMI @ NTCIR-10 CrossLink-2
Cross-Lingual Link Discovery (CLLD) aims to automatically find links between documents written in different languages. In this paper, we first present a relatively simple yet effective methods for CLLD in Wiki collections, explaining the fndings that motivated their design. Our methods (team KMI) achieved in the NTCIR-10 CrossLink-2 evaluation the best overall results in the English to Chinese, Japanese and Korean (E2CJK) task and were the top performers in the Chinese, Japanese, Korean to English task (CJK2E)1 [Tang et al.,2013]. Though tested on these language combinations, the methods are language agnostic and can be easily applied to any other language combination with sufficient corpora and available pre-processing tools. In the second part of the paper, we provide an in depth analysis of the nature of the task, the evaluation metrics and the impact of the system components on the overall CLLD performance. We believe a good understanding of these aspects is the key to improving CLLD systems in the future
Recommended from our members
Linking Textual Resources to Support Information Discovery
A vast amount of information is today stored in the form of textual documents, many of which are available online. These documents come from different sources and are of different types. They include newspaper articles, books, corporate reports, encyclopedia entries and research papers. At a semantic level, these documents contain knowledge, which was created by explicitly connecting information and expressing it in the form of a natural language. However, a significant amount of knowledge is not explicitly stated in a single document, yet can be derived or discovered by researching, i.e. accessing, comparing, contrasting and analysing, information from multiple documents. Carrying out this work using traditional search interfaces is tedious due to information overload and the difficulty of formulating queries that would help us to discover information we are not aware of.
In order to support this exploratory process, we need to be able to effectively navigate between related pieces of information across documents. While information can be connected using manually curated cross-document links, this approach not only does not scale, but cannot systematically assist us in the discovery of sometimes non-obvious (hidden) relationships. Consequently, there is a need for automatic approaches to link discovery.
This work studies how people link content, investigates the properties of different link types, presents new methods for automatic link discovery and designs a system in which link discovery is applied on a collection of millions of documents to improve access to public knowledge
Cross-language Information Retrieval
Two key assumptions shape the usual view of ranked retrieval: (1) that the
searcher can choose words for their query that might appear in the documents
that they wish to see, and (2) that ranking retrieved documents will suffice
because the searcher will be able to recognize those which they wished to find.
When the documents to be searched are in a language not known by the searcher,
neither assumption is true. In such cases, Cross-Language Information Retrieval
(CLIR) is needed. This chapter reviews the state of the art for CLIR and
outlines some open research questions.Comment: 49 pages, 0 figure
Evaluating Information Retrieval and Access Tasks
This open access book summarizes the first two decades of the NII Testbeds and Community for Information access Research (NTCIR). NTCIR is a series of evaluation forums run by a global team of researchers and hosted by the National Institute of Informatics (NII), Japan. The book is unique in that it discusses not just what was done at NTCIR, but also how it was done and the impact it has achieved. For example, in some chapters the reader sees the early seeds of what eventually grew to be the search engines that provide access to content on the World Wide Web, today’s smartphones that can tailor what they show to the needs of their owners, and the smart speakers that enrich our lives at home and on the move. We also get glimpses into how new search engines can be built for mathematical formulae, or for the digital record of a lived human life. Key to the success of the NTCIR endeavor was early recognition that information access research is an empirical discipline and that evaluation therefore lay at the core of the enterprise. Evaluation is thus at the heart of each chapter in this book. They show, for example, how the recognition that some documents are more important than others has shaped thinking about evaluation design. The thirty-three contributors to this volume speak for the many hundreds of researchers from dozens of countries around the world who together shaped NTCIR as organizers and participants. This book is suitable for researchers, practitioners, and students—anyone who wants to learn about past and present evaluation efforts in information retrieval, information access, and natural language processing, as well as those who want to participate in an evaluation task or even to design and organize one
Using Explicit Semantic Analysis to Link in Multi-Lingual Document Collections
Udržování prolinkování dokumentů v ryhle rostoucích kolekcích je problematické. To je dále zvětšeno vícejazyčností těchto kolekcí. Navrhujeme použít Explicitní Sémantickou Analýzu k identifikaci relevantních dokumentů a linků napříč jazyky, bez použití strojového překladu. Navrhli jsme a implementovali několik přistupů v prototypu linkovacího systému. Evaluace byla provedena na Čínské, České, Anglické a Španělské Wikipedii. Diskutujeme evaluační metodologii pro linkovací systémy, a hodnotíme souhlasnost mezi odkazy v různých jazykoých verzích Wikipedie. Hodnotíme vlastnosti Explicitní Sémantické Analýzy důležité pro její praktické použití.Keeping links in quickly growing document collections up-to-date is problematic, which is exacerbated by their multi-linguality. We utilize Explicit Semantic Analysis to help identify relevant documents and links across languages without machine translation. We designed and implemented several approaches as a part of our link discovery system. Evaluation was conducted on Chinese, Czech, English and Spanish Wikipedia. Also, we discuss the evaluation methodology for such systems and assess the agreement between links on different versions of Wikipedia. In addition, we evaluate properties of Explicit Semantic Analysis which are important for its practical use.
Meeting of the MINDS: an information retrieval research agenda
Since its inception in the late 1950s, the field of Information Retrieval (IR) has developed tools that help people find, organize, and analyze information. The key early influences on the field are well-known. Among them are H. P. Luhn's pioneering work, the development of the vector space retrieval model by Salton and his students, Cleverdon's development of the Cranfield experimental methodology, Spärck Jones' development of idf, and a series of probabilistic retrieval models by Robertson and Croft. Until the development of the WorldWideWeb (Web), IR was of greatest interest to professional information analysts such as librarians, intelligence analysts, the legal community, and the pharmaceutical industry
- …