8,300 research outputs found

    Exploiting Query Structure and Document Structure to Improve Document Retrieval Effectiveness

    Get PDF
    In this paper we present a systematic analysis of document retrieval using unstructured and structured queries within the score region algebra (SRA) structured retrieval framework. The behavior of diÂŽerent retrieval models, namely Boolean, tf.idf, GPX, language models, and Okapi, is tested using the transparent SRA framework in our three-level structured retrieval system called TIJAH. The retrieval models are implemented along four elementary retrieval aspects: element and term selection, element score computation, score combination, and score propagation. The analysis is performed on a numerous experiments evaluated on TREC and CLEF collections, using manually generated unstructured and structured queries. Unstructured queries range from the short title queries to long title + description + narrative queries. For generating structured queries we exploit the knowledge of the document structure and the content used to semantically describe or classify documents. We show that such structured information can be utilized in retrieval engines to give more precise answers to user queries then when using unstructured queries

    Generating Synthetic Data for Neural Keyword-to-Question Models

    Full text link
    Search typically relies on keyword queries, but these are often semantically ambiguous. We propose to overcome this by offering users natural language questions, based on their keyword queries, to disambiguate their intent. This keyword-to-question task may be addressed using neural machine translation techniques. Neural translation models, however, require massive amounts of training data (keyword-question pairs), which is unavailable for this task. The main idea of this paper is to generate large amounts of synthetic training data from a small seed set of hand-labeled keyword-question pairs. Since natural language questions are available in large quantities, we develop models to automatically generate the corresponding keyword queries. Further, we introduce various filtering mechanisms to ensure that synthetic training data is of high quality. We demonstrate the feasibility of our approach using both automatic and manual evaluation. This is an extended version of the article published with the same title in the Proceedings of ICTIR'18.Comment: Extended version of ICTIR'18 full paper, 11 page

    Spoken content retrieval: A survey of techniques and technologies

    Get PDF
    Speech media, that is, digital audio and video containing spoken content, has blossomed in recent years. Large collections are accruing on the Internet as well as in private and enterprise settings. This growth has motivated extensive research on techniques and technologies that facilitate reliable indexing and retrieval. Spoken content retrieval (SCR) requires the combination of audio and speech processing technologies with methods from information retrieval (IR). SCR research initially investigated planned speech structured in document-like units, but has subsequently shifted focus to more informal spoken content produced spontaneously, outside of the studio and in conversational settings. This survey provides an overview of the field of SCR encompassing component technologies, the relationship of SCR to text IR and automatic speech recognition and user interaction issues. It is aimed at researchers with backgrounds in speech technology or IR who are seeking deeper insight on how these fields are integrated to support research and development, thus addressing the core challenges of SCR

    A Comparative analysis: QA evaluation questions versus real-world queries

    Get PDF
    This paper presents a comparative analysis of user queries to a web search engine, questions to a Q&A service (answers.com), and questions employed in question answering (QA) evaluations at TREC and CLEF. The analysis shows that user queries to search engines contain mostly content words (i.e. keywords) but lack structure words (i.e. stopwords) and capitalization. Thus, they resemble natural language input after case folding and stopword removal. In contrast, topics for QA evaluation and questions to answers.com mainly consist of fully capitalized and syntactically well-formed questions. Classification experiments using a na¨Ĺve Bayes classifier show that stopwords play an important role in determining the expected answer type. A classification based on stopwords is considerably more accurate (47.5% accuracy) than a classification based on all query words (40.1% accuracy) or on content words (33.9% accuracy). To simulate user input, questions are preprocessed by case folding and stopword removal. Additional classification experiments aim at reconstructing the syntactic wh-word frame of a question, i.e. the embedding of the interrogative word. Results indicate that this part of questions can be reconstructed with moderate accuracy (25.7%), but for a classification problem with a much larger number of classes compared to classifying queries by expected answer type (2096 classes vs. 130 classes). Furthermore, eliminating stopwords can lead to multiple reconstructed questions with a different or with the opposite meaning (e.g. if negations or temporal restrictions are included). In conclusion, question reconstruction from short user queries can be seen as a new realistic evaluation challenge for QA systems

    Highly focused document retrieval in aerospace engineering : user interaction design and evaluation

    Get PDF
    Purpose – This paper seeks to describe the preliminary studies (on both users and data), the design and evaluation of the K-Search system for searching legacy documents in aerospace engineering. Real-world reports of jet engine maintenance challenge the current indexing practice, while real users’ tasks require retrieving the information in the proper context. K-Search is currently in use in Rolls-Royce plc and has evolved to include other tools for knowledge capture and management. Design/methodology/approach – Semantic Web techniques have been used to automatically extract information from the reports while maintaining the original context, allowing a more focused retrieval than with more traditional techniques. The paper combines semantic search with classical information retrieval to increase search effectiveness. An innovative user interface has been designed to take advantage of this hybrid search technique. The interface is designed to allow a flexible and personal approach to searching legacy data. Findings – The user evaluation showed that the system is effective and well received by users. It also shows that different people look at the same data in different ways and make different use of the same system depending on their individual needs, influenced by their job profile and personal attitude. Research limitations/implications – This study focuses on a specific case of an enterprise working in aerospace engineering. Although the findings are likely to be shared with other engineering domains (e.g. mechanical, electronic), the study does not expand the evaluation to different settings. Originality/value – The study shows how real context of use can provide new and unexpected challenges to researchers and how effective solutions can then be adopted and used in organizations.</p

    Automatic tagging and geotagging in video collections and communities

    Get PDF
    Automatically generated tags and geotags hold great promise to improve access to video collections and online communi- ties. We overview three tasks offered in the MediaEval 2010 benchmarking initiative, for each, describing its use scenario, definition and the data set released. For each task, a reference algorithm is presented that was used within MediaEval 2010 and comments are included on lessons learned. The Tagging Task, Professional involves automatically matching episodes in a collection of Dutch television with subject labels drawn from the keyword thesaurus used by the archive staff. The Tagging Task, Wild Wild Web involves automatically predicting the tags that are assigned by users to their online videos. Finally, the Placing Task requires automatically assigning geo-coordinates to videos. The specification of each task admits the use of the full range of available information including user-generated metadata, speech recognition transcripts, audio, and visual features
    • …
    corecore