87,696 research outputs found

    Searching to Translate and Translating to Search: When Information Retrieval Meets Machine Translation

    Get PDF
    With the adoption of web services in daily life, people have access to tremendous amounts of information, beyond any human's reading and comprehension capabilities. As a result, search technologies have become a fundamental tool for accessing information. Furthermore, the web contains information in multiple languages, introducing another barrier between people and information. Therefore, search technologies need to handle content written in multiple languages, which requires techniques to account for the linguistic differences. Information Retrieval (IR) is the study of search techniques, in which the task is to find material relevant to a given information need. Cross-Language Information Retrieval (CLIR) is a special case of IR when the search takes place in a multi-lingual collection. Of course, it is not helpful to retrieve content in languages the user cannot understand. Machine Translation (MT) studies the translation of text from one language into another efficiently (within a reasonable amount of time) and effectively (fluent and retaining the original meaning), which helps people understand what is being written, regardless of the source language. Putting these together, we observe that search and translation technologies are part of an important user application, calling for a better integration of search (IR) and translation (MT), since these two technologies need to work together to produce high-quality output. In this dissertation, the main goal is to build better connections between IR and MT, for which we present solutions to two problems: Searching to translate explores approximate search techniques for extracting bilingual data from multilingual Wikipedia collections to train better translation models. Translating to search explores the integration of a modern statistical MT system into the cross-language search processes. In both cases, our best-performing approach yielded improvements over strong baselines for a variety of language pairs. Finally, we propose a general architecture, in which various components of IR and MT systems can be connected together into a feedback loop, with potential improvements to both search and translation tasks. We hope that the ideas presented in this dissertation will spur more interest in the integration of search and translation technologies

    A qualitative analysis of the Wikipedia N-Substate Algorithm's Enhancement Terms

    Full text link
    [EN] Automatic Search Query Enhancement (ASQE) is the process of modifying a user submitted search query and identifying terms that can be added or removed to enhance the relevance of documents retrieved from a search engine. ASQE differs from other enhancement approaches as no human interaction is required. ASQE algorithms typically rely on a source of a priori knowledge to aid the process of identifying relevant enhancement terms. This paper describes the results of a qualitative analysis of the enhancement terms generated by the Wikipedia NSubstate Algorithm (WNSSA) for ASQE. The WNSSA utilises Wikipedia as the sole source of a priori knowledge during the query enhancement process. As each Wikipedia article typically represents a single topic, during the enhancement process of the WNSSA, a mapping is performed between the user’s original search query and Wikipedia articles relevant to the query. If this mapping is performed correctly, a collection of potentially relevant terms and acronyms are accessible for ASQE. This paper reviews the results of a qualitative analysis process performed for the individual enhancement term generated for each of the 50 test topics from the TREC-9 Web Topic collection. The contributions of this paper include: (a) a qualitative analysis of generated WNSSA search query enhancement terms and (b) an analysis of the concepts represented in the TREC-9 Web Topics, detailing interpretation issues during query-to-Wikipedia article mapping performed by the WNSSA.Goslin, K.; Hofmann, M. (2019). A qualitative analysis of the Wikipedia N-Substate Algorithm's Enhancement Terms. Journal of Computer-Assisted Linguistic Research. 3(3):67-77. https://doi.org/10.4995/jclr.2019.11159SWORD677733Asfari, Ounas, Doan, Bich-liĂȘn, Bourda, Yolaine and Sansonnet, Jean-Paul. 2009. "Personalized Access to Information by Query Reformulation Based on the State of the Current Task and User Profile." Paper presented at Third International Conference on Advances in Semantic Processing, 113-116. IEEE. https://doi.org/10.1109/SEMAPRO.2009.17Bazzanella, Barbara, Stoermer, Heiko, and Bouquet, Paolo. 2010. "Searching for individual entities: A query analysis.", Paper presented at International Conference on Information Reuse & Integration, 115-120. IEEE. https://doi.org/10.1109/IRI.2010.5558955Gao, Jianfeng, Xu , Gu and Xu, Jinxi. 2013. Query expansion using path-constrained random walks. Paper presented at 36th international ACM SIGIR conference on Research and development in information retrieval (SIGIR '13), 563-572. ACM. https://doi.org/10.1145/2484028.2484058Goslin, Kyle, Hofmann, Markus. 2017. "A Comparison of Automatic Search Query Enhancement Algorithms That Utilise Wikipedia as a Source of A Priori Knowledge." Paper presented at 9th Annual Meeting of the Forum for Information Retrieval Evaluation (FIRE'17), 6-13. ACM. https://doi.org/10.1145/3158354.3158356Goslin, Kyle, Hofmann, Markus. 2018. "A Wikipedia powered state-based approach to automatic search query enhancement." Journal of Information Processing & Management 54(4), 726-739. Elsevier. https://doi.org/10.1016/j.ipm.2017.10.001Jansen, Bernard, Spink, Amanda, Bateman, Judy and Saracevic, Tefko. 1998. "Real life information retrieval: a study of user queries on the Web." Paper presented at ACM SIGIR Forum 32, 5-17. ACM. https://doi.org/10.1145/281250.281253Mastora, Anna, Monopoli, Maria and Kapidakis, Sarantos. 2008. "Term selection patterns for formulating queries: a User study focused on term semantics." Paper presented at Third International Conference on Digital Information Management, 125-130. IEEE. https://doi.org/10.1109/ICDIM.2008.4746747Ogilvie, Paul, Voorhees, Ellen and Callan, Jamie. 2009. "On the number of terms used in automatic query expansion." Journal of Information Retrieval 12(6): 666. Springer. https://doi.org/10.1007/s10791-009-9104-1Voorhees, Ellen M. 1994. "Query expansion using lexical-semantic relations." Paper presented at the 17th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '94), 61-69. Springer-Verlag. https://doi.org/10.1007/978-1-4471-2099-5_

    Integrating musicology's heterogeneous data sources for better exploration

    No full text
    Musicologists have to consult an extraordinarily heterogeneous body of primary and secondary sources during all stages of their research. Many of these sources are now available online, but the historical dispersal of material across libraries and archives has now been replaced by segregation of data and metadata into a plethora of online repositories. This segregation hinders the intelligent manipulation of metadata, and means that extracting large tranches of basic factual information or running multi-part search queries is still enormously and needlessly time consuming. To counter this barrier to research, the “musicSpace” project is experimenting with integrating access to many of musicology’s leading data sources via a modern faceted browsing interface that utilises Semantic Web and Web2.0 technologies such as RDF and AJAX. This will make previously intractable search queries tractable, enable musicologists to use their time more efficiently, and aid the discovery of potentially significant information that users did not think to look for. This paper outlines our work to date

    Finding Structured and Unstructured Features to Improve the Search Result of Complex Question

    Get PDF
    -Recently, search engine got challenge deal with such a natural language questions. Sometimes, these questions are complex questions. A complex question is a question that consists several clauses, several intentions or need long answer. In this work we proposed that finding structured features and unstructured features of questions and using structured data and unstructured data could improve the search result of complex questions. According to those, we will use two approaches, IR approach and structured retrieval, QA template. Our framework consists of three parts. Question analysis, Resource Discovery and Analysis The Relevant Answer. In Question Analysis we used a few assumptions, and tried to find structured and unstructured features of the questions. Structured feature refers to Structured data and unstructured feature refers to unstructured data. In the resource discovery we integrated structured data (relational database) and unstructured data (webpage) to take the advantaged of two kinds of data to improve and reach the relevant answer. We will find the best top fragments from context of the webpage In the Relevant Answer part, we made a score matching between the result from structured data and unstructured data, then finally used QA template to reformulate the question. In the experiment result, it shows that using structured feature and unstructured feature and using both structured and unstructured data, using approach IR and QA template could improve the search result of complex questions

    A Survey of Volunteered Open Geo-Knowledge Bases in the Semantic Web

    Full text link
    Over the past decade, rapid advances in web technologies, coupled with innovative models of spatial data collection and consumption, have generated a robust growth in geo-referenced information, resulting in spatial information overload. Increasing 'geographic intelligence' in traditional text-based information retrieval has become a prominent approach to respond to this issue and to fulfill users' spatial information needs. Numerous efforts in the Semantic Geospatial Web, Volunteered Geographic Information (VGI), and the Linking Open Data initiative have converged in a constellation of open knowledge bases, freely available online. In this article, we survey these open knowledge bases, focusing on their geospatial dimension. Particular attention is devoted to the crucial issue of the quality of geo-knowledge bases, as well as of crowdsourced data. A new knowledge base, the OpenStreetMap Semantic Network, is outlined as our contribution to this area. Research directions in information integration and Geographic Information Retrieval (GIR) are then reviewed, with a critical discussion of their current limitations and future prospects

    Living Knowledge

    Get PDF
    Diversity, especially manifested in language and knowledge, is a function of local goals, needs, competences, beliefs, culture, opinions and personal experience. The Living Knowledge project considers diversity as an asset rather than a problem. With the project, foundational ideas emerged from the synergic contribution of different disciplines, methodologies (with which many partners were previously unfamiliar) and technologies flowed in concrete diversity-aware applications such as the Future Predictor and the Media Content Analyser providing users with better structured information while coping with Web scale complexities. The key notions of diversity, fact, opinion and bias have been defined in relation to three methodologies: Media Content Analysis (MCA) which operates from a social sciences perspective; Multimodal Genre Analysis (MGA) which operates from a semiotic perspective and Facet Analysis (FA) which operates from a knowledge representation and organization perspective. A conceptual architecture that pulls all of them together has become the core of the tools for automatic extraction and the way they interact. In particular, the conceptual architecture has been implemented with the Media Content Analyser application. The scientific and technological results obtained are described in the following

    Multimedia search without visual analysis: the value of linguistic and contextual information

    Get PDF
    This paper addresses the focus of this special issue by analyzing the potential contribution of linguistic content and other non-image aspects to the processing of audiovisual data. It summarizes the various ways in which linguistic content analysis contributes to enhancing the semantic annotation of multimedia content, and, as a consequence, to improving the effectiveness of conceptual media access tools. A number of techniques are presented, including the time-alignment of textual resources, audio and speech processing, content reduction and reasoning tools, and the exploitation of surface features

    CHORUS Deliverable 2.1: State of the Art on Multimedia Search Engines

    Get PDF
    Based on the information provided by European projects and national initiatives related to multimedia search as well as domains experts that participated in the CHORUS Think-thanks and workshops, this document reports on the state of the art related to multimedia content search from, a technical, and socio-economic perspective. The technical perspective includes an up to date view on content based indexing and retrieval technologies, multimedia search in the context of mobile devices and peer-to-peer networks, and an overview of current evaluation and benchmark inititiatives to measure the performance of multimedia search engines. From a socio-economic perspective we inventorize the impact and legal consequences of these technical advances and point out future directions of research
    • 

    corecore