35 research outputs found

    University of Glasgow at WebCLEF 2005: experiments in per-field normalisation and language specific stemming

    Get PDF
    We participated in the WebCLEF 2005 monolingual task. In this task, a search system aims to retrieve relevant documents from a multilingual corpus of Web documents from Web sites of European governments. Both the documents and the queries are written in a wide range of European languages. A challenge in this setting is to detect the language of documents and topics, and to process them appropriately. We develop a language specific technique for applying the correct stemming approach, as well as for removing the correct stopwords from the queries. We represent documents using three fields, namely content, title, and anchor text of incoming hyperlinks. We use a technique called per-field normalisation, which extends the Divergence From Randomness (DFR) framework, to normalise the term frequencies, and to combine them across the three fields. We also employ the length of the URL path of Web documents. The ranking is based on combinations of both the language specific stemming, if applied, and the per-field normalisation. We use our Terrier platform for all our experiments. The overall performance of our techniques is outstanding, achieving the overall top four performing runs, as well as the top performing run without metadata in the monolingual task. The best run only uses per-field normalisation, without applying stemming

    Building simulated queries for known-item topics: an analysis using six european languages

    Get PDF
    There has been increased interest in the use of simulated queries for evaluation and estimation purposes in Information Retrieval. However, there are still many unaddressed issues regarding their usage and impact on evaluation because their quality, in terms of retrieval performance, is unlike real queries. In this paper, we focus on methods for building simulated known-item topics and explore their quality against real known-item topics. Using existing generation models as our starting point, we explore factors which may influence the generation of the known-item topic. Informed by this detailed analysis (on six European languages) we propose a model with improved document and term selection properties, showing that simulated known-item topics can be generated that are comparable to real known-item topics. This is a significant step towards validating the potential usefulness of simulated queries: for evaluation purposes, and because building models of querying behavior provides a deeper insight into the querying process so that better retrieval mechanisms can be developed to support the user

    MIRACLE at GeoCLEF Query Parsing 2007: Extraction and Classification of Geographical Information

    Get PDF
    This paper describes the participation of MIRACLE research consortium at the Query Parsing task of GeoCLEF 2007. Our system is composed of three main modules. First, the Named Geo-entity Identifier, whose objective is to perform the geo-entity identification and tagging, i.e., to extract the “where” component of the geographical query, should there be any. This module is based on a gazetteer built up from the Geonames geographical database and carries out a sequential process in three steps that consist on geo-entity recognition, geo-entity selection and query tagging. Then, the Query Analyzer parses this tagged query to identify the “what” and “geo-relation” components by means of a rule-based grammar. Finally, a two-level multiclassifier first decides whether the query is indeed a geographical query and, should it be positive, then determines the query type according to the type of information that the user is supposed to be looking for: map, yellow page or information. According to a strict evaluation criterion where a match should have all fields correct, our system reaches a precision value of 42.8% and a recall of 56.6% and our submission is ranked 1st out of 6 participants in the task. A detailed evaluation of the confusion matrixes reveal that some extra effort must be invested in “user-oriented” disambiguation techniques to improve the first level binary classifier for detecting geographical queries, as it is a key component to eliminate many false-positives

    MIRACLE at GeoCLEF Query Parsing 2007: Extraction and Classification of Geographical Information

    Get PDF
    This paper describes the participation of MIRACLE research consortium at the Query Parsing task of GeoCLEF 2007. Our system is composed of three main modules. The first one is the Named Geo-entity Identifier, whose objective is to perform the geo-entity identification and tagging, i.e., to extract the “where” component of the geographical query, if there is any. Then, the Query Analyzer parses this tagged query to identify the “what” and “geo-relation” components by means of a rule-based grammar. Finally, a two-level multiclassifier first decides whether the query is indeed a geographical query and, should it be positive, then determines the query type according to the type of information that the user is supposed to be looking for: map, yellow page or information

    Report of MIRACLE team for the Ad-Hoc track in CLEF 2006

    Get PDF
    This paper presents the 2006 MIRACLE’s team approach to the AdHoc Information Retrieval track. The experiments for this campaign keep on testing our IR approach. First, a baseline set of runs is obtained, including standard components: stemming, transforming, filtering, entities detection and extracting, and others. Then, a extended set of runs is obtained using several types of combinations of these baseline runs. The improvements introduced for this campaign have been a few ones: we have used an entity recognition and indexing prototype tool into our tokenizing scheme, and we have run more combining experiments for the robust multilingual case than in previous campaigns. However, no significative improvements have been achieved. For the this campaign, runs were submitted for the following languages and tracks: - Monolingual: Bulgarian, French, Hungarian, and Portuguese. - Bilingual: English to Bulgarian, French, Hungarian, and Portuguese; Spanish to French and Portuguese; and French to Portuguese. - Robust monolingual: German, English, Spanish, French, Italian, and Dutch. - Robust bilingual: English to German, Italian to Spanish, and French to Dutch. - Robust multilingual: English to robust monolingual languages. We still need to work harder to improve some aspects of our processing scheme, being the most important, to our knowledge, the entities recognition and normalization

    MIRACLE at ImageCLEFphoto 2007: Evaluation of Merging Strategies for Multilingual and Multimedia Information Retrieval.

    Get PDF
    This paper describes the participation of MIRACLE research consortium at the ImageCLEF Photographic Retrieval task of ImageCLEF 2007. For this campaign, the main purpose of our experiments was to thoroughly study different merging strategies, i.e. methods of combination of textual and visual retrieval techniques. Whereas we have applied all the well known techniques which had already been used in previous campaigns, for both textual and visual components of the system, our research has primarily focused on the idea of performing all possible combinations of those techniques in order to evaluate which ones may offer the best results and analyze if the combined results may improve (in terms of MAP) the individual ones

    An overview of the linguistic resources used in cross-language question answering systems in CLEF Conference

    Get PDF
    The development of the Semantic Web requires great economic and human effort. Consequently, it is very useful to create mechanisms and tools that facilitate its expansion. From the standpoint of information retrieval (hereafter IR), access to the contents of the Semantic Web can be favored by the use of natural language, as it is much simpler and faster for the user to engage in his habitual form of expression. The growing popularity of Internet and the wide availability of web informative resources for general audiences are a fairly recent phenomenon, although man´s need to hurdle the language barrier and communicate with others is as old as the history of mankind. The World Wide Web, also known as WWW, together with the growing globalization of companies and organizations, and the increase of the non-English speaking audience, entails the demand for tools allowing users to secure information from a wide range of resources. Yet the underlying linguistic restrictions are often overlooked by researchers and designers. Against this background, a key characteristic to be evaluated in terms of the efficiency of IR systems is its capacity to allow users find a corpus of documents in different languages, and to facilitate the relevant information despite limited linguistic competence regarding the target language

    MIRACLE at Ad-Hoc CLEF 2005: Merging and Combining Without Using a Single Approach

    Get PDF
    This paper presents the 2005 Miracle’s team approach to the Ad-Hoc Information Retrieval tasks. The goal for the experiments this year was twofold: to continue testing the effect of combination approaches on information retrieval tasks, and improving our basic processing and indexing tools, adapting them to new languages with strange encoding schemes. The starting point was a set of basic components: stemming, transforming, filtering, proper nouns extraction, paragraph extraction, and pseudo-relevance feedback. Some of these basic components were used in different combinations and order of application for document indexing and for query processing. Second-order combinations were also tested, by averaging or selective combination of the documents retrieved by different approaches for a particular query. In the multilingual track, we concentrated our work on the merging process of the results of monolingual runs to get the overall multilingual result, relying on available translations. In both cross-lingual tracks, we have used available translation resources, and in some cases we have used a combination approach

    An overview of the linguistic resources used in cross-language question answering systems in CLEF Conference

    Get PDF
    The development of the Semantic Web requires great economic and human effort. Consequently, it is very useful to create mechanisms and tools that facilitate its expansion. From the standpoint of information retrieval (hereafter IR), access to the contents of the Semantic Web can be favored by the use of natural language, as it is much simpler and faster for the user to engage in his habitual form of expression. The growing popularity of Internet and the wide availability of web informative resources for general audiences are a fairly recent phenomenon, although man´s need to hurdle the language barrier and communicate with others is as old as the history of mankind. The World Wide Web, also known as WWW, together with the growing globalization of companies and organizations, and the increase of the non-English speaking audience, entails the demand for tools allowing users to secure information from a wide range of resources. Yet the underlying linguistic restrictions are often overlooked by researchers and designers. Against this background, a key characteristic to be evaluated in terms of the efficiency of IR systems is its capacity to allow users find a corpus of documents in different languages, and to facilitate the relevant information despite limited linguistic competence regarding the target language

    An Overview of the Linguistic Resources used in Cross-Language Question Answering Systems in CLEF Conference

    Get PDF
    The development of the Semantic Web requires great economic and human effort. Consequently, it is very useful to create mechanisms and tools that facilitate its expansion. From the standpoint of information retrieval (hereafter IR), access to the contents of the Semantic Web can be favored by the use of natural language, as it is much simpler and faster for the user to engage in his habitual form of expression. The growing popularity of Internet and the wide availability of web informative resources for general audiences are a fairly recent phenomenon, although man´s need to hurdle the language barrier and communicate with others is as old as the history of mankind. The World Wide Web, also known as WWW, together with the growing globalization of companies and organizations, and the increase of the non-English speaking audience, entails the demand for tools allowing users to secure information from a wide range of resources. Yet the underlying linguistic restrictions are often overlooked by researchers and designers. Against this background, a key characteristic to be evaluated in terms of the efficiency of IR systems is its capacity to allow users find a corpus of documents in different languages, and to facilitate the relevant information despite limited linguistic competence regarding the target language
    corecore