23 research outputs found

    A model for information retrieval driven by conceptual spaces

    Get PDF
    A retrieval model describes the transformation of a query into a set of documents. The question is: what drives this transformation? For semantic information retrieval type of models this transformation is driven by the content and structure of the semantic models. In this case, Knowledge Organization Systems (KOSs) are the semantic models that encode the meaning employed for monolingual and cross-language retrieval. The focus of this research is the relationship between these meanings’ representations and their role and potential in augmenting existing retrieval models effectiveness. The proposed approach is unique in explicitly interpreting a semantic reference as a pointer to a concept in the semantic model that activates all its linked neighboring concepts. It is in fact the formalization of the information retrieval model and the integration of knowledge resources from the Linguistic Linked Open Data cloud that is distinctive from other approaches. The preprocessing of the semantic model using Formal Concept Analysis enables the extraction of conceptual spaces (formal contexts)that are based on sub-graphs from the original structure of the semantic model. The types of conceptual spaces built in this case are limited by the KOSs structural relations relevant to retrieval: exact match, broader, narrower, and related. They capture the definitional and relational aspects of the concepts in the semantic model. Also, each formal context is assigned an operational role in the flow of processes of the retrieval system enabling a clear path towards the implementations of monolingual and cross-lingual systems. By following this model’s theoretical description in constructing a retrieval system, evaluation results have shown statistically significant results in both monolingual and bilingual settings when no methods for query expansion were used. The test suite was run on the Cross-Language Evaluation Forum Domain Specific 2004-2006 collection with additional extensions to match the specifics of this model

    Cross-language Information Retrieval

    Full text link
    Two key assumptions shape the usual view of ranked retrieval: (1) that the searcher can choose words for their query that might appear in the documents that they wish to see, and (2) that ranking retrieved documents will suffice because the searcher will be able to recognize those which they wished to find. When the documents to be searched are in a language not known by the searcher, neither assumption is true. In such cases, Cross-Language Information Retrieval (CLIR) is needed. This chapter reviews the state of the art for CLIR and outlines some open research questions.Comment: 49 pages, 0 figure

    Toponym Disambiguation in Information Retrieval

    Full text link
    In recent years, geography has acquired a great importance in the context of Information Retrieval (IR) and, in general, of the automated processing of information in text. Mobile devices that are able to surf the web and at the same time inform about their position are now a common reality, together with applications that can exploit this data to provide users with locally customised information, such as directions or advertisements. Therefore, it is important to deal properly with the geographic information that is included in electronic texts. The majority of such kind of information is contained as place names, or toponyms. Toponym ambiguity represents an important issue in Geographical Information Retrieval (GIR), due to the fact that queries are geographically constrained. There has been a struggle to nd speci c geographical IR methods that actually outperform traditional IR techniques. Toponym ambiguity may constitute a relevant factor in the inability of current GIR systems to take advantage from geographical knowledge. Recently, some Ph.D. theses have dealt with Toponym Disambiguation (TD) from di erent perspectives, from the development of resources for the evaluation of Toponym Disambiguation (Leidner (2007)) to the use of TD to improve geographical scope resolution (Andogah (2010)). The Ph.D. thesis presented here introduces a TD method based on WordNet and carries out a detailed study of the relationship of Toponym Disambiguation to some IR applications, such as GIR, Question Answering (QA) and Web retrieval. The work presented in this thesis starts with an introduction to the applications in which TD may result useful, together with an analysis of the ambiguity of toponyms in news collections. It could not be possible to study the ambiguity of toponyms without studying the resources that are used as placename repositories; these resources are the equivalent to language dictionaries, which provide the di erent meanings of a given word.Buscaldi, D. (2010). Toponym Disambiguation in Information Retrieval [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/8912Palanci

    Improved cross-language information retrieval via disambiguation and vocabulary discovery

    Get PDF
    Cross-lingual information retrieval (CLIR) allows people to find documents irrespective of the language used in the query or document. This thesis is concerned with the development of techniques to improve the effectiveness of Chinese-English CLIR. In Chinese-English CLIR, the accuracy of dictionary-based query translation is limited by two major factors: translation ambiguity and the presence of out-of-vocabulary (OOV) terms. We explore alternative methods for translation disambiguation, and demonstrate new techniques based on a Markov model and the use of web documents as a corpus to provide context for disambiguation. This simple disambiguation technique has proved to be extremely robust and successful. Queries that seek topical information typically contain OOV terms that may not be found in a translation dictionary, leading to inappropriate translations and consequent poor retrieval performance. Our novel OOV term translation method is based on the Chinese authorial practice of including unfamiliar English terms in both languages. It automatically extracts correct translations from the web and can be applied to both Chinese-English and English-Chinese CLIR. Our OOV translation technique does not rely on prior segmentation and is thus free from seg mentation error. It leads to a significant improvement in CLIR effectiveness and can also be used to improve Chinese segmentation accuracy. Good quality translation resources, especially bilingual dictionaries, are valuable resources for effective CLIR. We developed a system to facilitate construction of a large-scale translation lexicon of Chinese-English OOV terms using the web. Experimental results show that this method is reliable and of practical use in query translation. In addition, parallel corpora provide a rich source of translation information. We have also developed a system that uses multiple features to identify parallel texts via a k-nearest-neighbour classifier, to automatically collect high quality parallel Chinese-English corpora from the web. These two automatic web mining systems are highly reliable and easy to deploy. In this research, we provided new ways to acquire linguistic resources using multilingual content on the web. These linguistic resources not only improve the efficiency and effectiveness of Chinese-English cross-language web retrieval; but also have wider applications than CLIR

    Language technologies for a multilingual Europe

    Get PDF
    This volume of the series “Translation and Multilingual Natural Language Processing” includes most of the papers presented at the Workshop “Language Technology for a Multilingual Europe”, held at the University of Hamburg on September 27, 2011 in the framework of the conference GSCL 2011 with the topic “Multilingual Resources and Multilingual Applications”, along with several additional contributions. In addition to an overview article on Machine Translation and two contributions on the European initiatives META-NET and Multilingual Web, the volume includes six full research articles. Our intention with this workshop was to bring together various groups concerned with the umbrella topics of multilingualism and language technology, especially multilingual technologies. This encompassed, on the one hand, representatives from research and development in the field of language technologies, and, on the other hand, users from diverse areas such as, among others, industry, administration and funding agencies. The Workshop “Language Technology for a Multilingual Europe” was co-organised by the two GSCL working groups “Text Technology” and “Machine Translation” (http://gscl.info) as well as by META-NET (http://www.meta-net.eu)

    Mixed-Language Arabic- English Information Retrieval

    Get PDF
    Includes abstract.Includes bibliographical references.This thesis attempts to address the problem of mixed querying in CLIR. It proposes mixed-language (language-aware) approaches in which mixed queries are used to retrieve most relevant documents, regardless of their languages. To achieve this goal, however, it is essential firstly to suppress the impact of most problems that are caused by the mixed-language feature in both queries and documents and which result in biasing the final ranked list. Therefore, a cross-lingual re-weighting model was developed. In this cross-lingual model, term frequency, document frequency and document length components in mixed queries are estimated and adjusted, regardless of languages, while at the same time the model considers the unique mixed-language features in queries and documents, such as co-occurring terms in two different languages. Furthermore, in mixed queries, non-technical terms (mostly those in non-English language) would likely overweight and skew the impact of those technical terms (mostly those in English) due to high document frequencies (and thus low weights) of the latter terms in their corresponding collection (mostly the English collection). Such phenomenon is caused by the dominance of the English language in scientific domains. Accordingly, this thesis also proposes reasonable re-weighted Inverse Document Frequency (IDF) so as to moderate the effect of overweighted terms in mixed queries

    Evaluating Information Retrieval and Access Tasks

    Get PDF
    This open access book summarizes the first two decades of the NII Testbeds and Community for Information access Research (NTCIR). NTCIR is a series of evaluation forums run by a global team of researchers and hosted by the National Institute of Informatics (NII), Japan. The book is unique in that it discusses not just what was done at NTCIR, but also how it was done and the impact it has achieved. For example, in some chapters the reader sees the early seeds of what eventually grew to be the search engines that provide access to content on the World Wide Web, today’s smartphones that can tailor what they show to the needs of their owners, and the smart speakers that enrich our lives at home and on the move. We also get glimpses into how new search engines can be built for mathematical formulae, or for the digital record of a lived human life. Key to the success of the NTCIR endeavor was early recognition that information access research is an empirical discipline and that evaluation therefore lay at the core of the enterprise. Evaluation is thus at the heart of each chapter in this book. They show, for example, how the recognition that some documents are more important than others has shaped thinking about evaluation design. The thirty-three contributors to this volume speak for the many hundreds of researchers from dozens of countries around the world who together shaped NTCIR as organizers and participants. This book is suitable for researchers, practitioners, and students—anyone who wants to learn about past and present evaluation efforts in information retrieval, information access, and natural language processing, as well as those who want to participate in an evaluation task or even to design and organize one

    Language technologies for a multilingual Europe

    Get PDF
    This volume of the series “Translation and Multilingual Natural Language Processing” includes most of the papers presented at the Workshop “Language Technology for a Multilingual Europe”, held at the University of Hamburg on September 27, 2011 in the framework of the conference GSCL 2011 with the topic “Multilingual Resources and Multilingual Applications”, along with several additional contributions. In addition to an overview article on Machine Translation and two contributions on the European initiatives META-NET and Multilingual Web, the volume includes six full research articles. Our intention with this workshop was to bring together various groups concerned with the umbrella topics of multilingualism and language technology, especially multilingual technologies. This encompassed, on the one hand, representatives from research and development in the field of language technologies, and, on the other hand, users from diverse areas such as, among others, industry, administration and funding agencies. The Workshop “Language Technology for a Multilingual Europe” was co-organised by the two GSCL working groups “Text Technology” and “Machine Translation” (http://gscl.info) as well as by META-NET (http://www.meta-net.eu)

    TC3 III

    Get PDF
    This volume of the series “Translation and Multilingual Natural Language Processing” includes most of the papers presented at the Workshop “Language Technology for a Multilingual Europe”, held at the University of Hamburg on September 27, 2011 in the framework of the conference GSCL 2011 with the topic “Multilingual Resources and Multilingual Applications”, along with several additional contributions. In addition to an overview article on Machine Translation and two contributions on the European initiatives META-NET and Multilingual Web, the volume includes six full research articles. Our intention with this workshop was to bring together various groups concerned with the umbrella topics of multilingualism and language technology, especially multilingual technologies. This encompassed, on the one hand, representatives from research and development in the field of language technologies, and, on the other hand, users from diverse areas such as, among others, industry, administration and funding agencies. The Workshop “Language Technology for a Multilingual Europe” was co-organised by the two GSCL working groups “Text Technology” and “Machine Translation” (http://gscl.info) as well as by META-NET (http://www.meta-net.eu)
    corecore