10 research outputs found

    The State-of-the-arts in Focused Search

    Get PDF
    The continuous influx of various text data on the Web requires search engines to improve their retrieval abilities for more specific information. The need for relevant results to a user’s topic of interest has gone beyond search for domain or type specific documents to more focused result (e.g. document fragments or answers to a query). The introduction of XML provides a format standard for data representation, storage, and exchange. It helps focused search to be carried out at different granularities of a structured document with XML markups. This report aims at reviewing the state-of-the-arts in focused search, particularly techniques for topic-specific document retrieval, passage retrieval, XML retrieval, and entity ranking. It is concluded with highlight of open problems

    The State-of-the-arts in Focused Search

    Get PDF

    Efficient creation and incremental maintenance of the hopi index for complex xml document collections

    Get PDF
    The HOPI index, a connection index for XML documents based on the concept of a 2–hop cover, provides space – and time–efficient reachability tests along the ancestor, descendant, and link axes to support path expressions with wildcards in XML search engines. This paper presents enhanced algorithms for building HOPI, shows how to augment the index with distance information, and discusses incremental index maintenance. Our experiments show substantial improvements over the existing divide-and-conquer algorithm for index creation, low space overhead for including distance information in the index, and efficient updates

    Seventh Biennial Report : June 2003 - March 2005

    No full text

    Searching for Entities: When Retrieval Meets Extraction

    Get PDF
    Retrieving entities from inside of documents, instead of searching for documents or web pages themselves, has become an active topic in both commercial search systems and academic information retrieval research area. Taking into account information needs about entities represented as descriptions with targeted answer entity types, entity search tasks are to return ranked lists of answer entities from unstructured texts, such as news or web pages. Although it works in the same environment as document retrieval, entity retrieval tasks require finer-grained answers entities which need more syntactic and semantic analyses on germane documents than document retrieval. This work proposes a two-layer probability model for addressing this task, which integrates germane document identification and answer entity extraction. Germane document identification retrieves highly related germane documents containing answer entities, while answer entity extraction finds answer entities by utilizing syntactic or linguistic information from those documents. This work theoretically demonstrates the integration of germane document identification and answer entity extraction for the entity retrieval task with the probability model. Moreover, this probability approach helps to reduce the overall retrieval complexity while maintaining high accuracy in locating answer entities. Serial studies are conducted in this dissertation on both germane document identification and answer entity extraction. The learning to rank method is investigated for germane document identification. This method first constructs a model on the training data set using query features, document features, similarity features and rank features. Then the model estimates the probability of the germane documents on testing data sets with the learned model. The experiment indicates that the learning to rank method is significantly better than the baseline systems, which treat germane document identification as a conventional document retrieval problem. The answer entity extraction method aims to correctly extract the answer entities from the germane documents. The methods of answer entity extraction without contexts (such as named entity recognition tools for extraction and knowledge base for extraction) and answer entity extraction with contexts (such as tables/lists as contexts and subject-verb-object structures as contexts) are investigated. These methods individually, however, can extract only parts of answer entities. The method of treating the answer entity extraction problem as a classification problem with the features from the above extraction methods runs significantly better than any of the individual extraction methods

    Recherche d'information dans les documents XML : prise en compte des liens pour la sélection d'éléments pertinents

    Get PDF
    156 p. : ill. ; 30 cmNotre travail se situe dans le contexte de la recherche d'information (RI), plus particulièrement la recherche d'information dans des documents semi structurés de type XML. L'exploitation efficace des documents XML disponibles doit prendre en compte la dimension structurelle. Cette dimension a conduit à l'émergence de nouveaux défis dans le domaine de la RI. Contrairement aux approches classiques de RI qui mettent l'accent sur la recherche des contenus non structurés, la RI XML combine à la fois des informations textuelles et structurelles pour effectuer différentes tâches de recherche. Plusieurs approches exploitant les types d'évidence ont été proposées et sont principalement basées sur les modèles classiques de RI, adaptées à des documents XML. La structure XML a été utilisée pour fournir un accès ciblé aux documents, en retournant des composants de document (par exemple, sections, paragraphes, etc.), au lieu de retourner tout un document en réponse une requête de l'utilisateur. En RI traditionnelle, la mesure de similarité est généralement basée sur l'information textuelle. Elle permetle classement des documents en fonction de leur degré de pertinence en utilisant des mesures comme:" similitude terme " ou " probabilité terme ". Cependant, d'autres sources d'évidence peuvent être considérées pour rechercher des informations pertinentes dans les documents. Par exemple, les liens hypertextes ont été largement exploités dans le cadre de la RI sur le Web.Malgré leur popularité dans le contexte du Web, peud'approchesexploitant cette source d'évidence ont été proposées dans le contexte de la RI XML. Le but de notre travail est de proposer des approches pour l'utilisation de liens comme une source d'évidencedans le cadre de la recherche d'information XML. Cette thèse vise à apporter des réponses aux questions de recherche suivantes : 1. Peut-on considérer les liens comme une source d'évidence dans le contexte de la RIXML? 2. Est-ce que l'utilisation de certains algorithmes d'analyse de liensdans le contexte de la RI XML améliore la qualité des résultats, en particulier dans le cas de la collection Wikipedia? 3. Quels types de liens peuvent être utilisés pour améliorer le mieux la pertinence des résultats de recherche? 4. Comment calculer le score lien des différents éléments retournés comme résultats de recherche? Doit-on considérer lesliens de type "document-document" ou plus précisément les liens de type "élément-élément"? Quel est le poids des liens de navigation par rapport aux liens hiérarchiques? 5. Quel est l'impact d'utilisation de liens dans le contexte global ou local? 6. Comment intégrer le score lien dans le calcul du score final des éléments XML retournés? 7. Quel est l'impact de la qualité des premiers résultats sur le comportement des formules proposées? Pour répondre à ces questions, nous avons mené une étude statistique, sur les résultats de recherche retournés par le système de recherche d'information"DALIAN", qui a clairement montré que les liens représentent un signe de pertinence des éléments dans le contexte de la RI XML, et cecien utilisant la collection de test fournie par INEX. Aussi, nous avons implémenté trois algorithmes d'analyse des liens (Pagerank, HITS et SALSA) qui nous ont permis de réaliser une étude comparative montrant que les approches "query-dependent" sont les meilleures par rapport aux approches "global context" . Nous avons proposé durant cette thèse trois formules de calcul du score lien: Le premièreest appelée "Topical Pagerank"; la seconde est la formule : "distance-based"; et la troisième est :"weighted links based". Nous avons proposé aussi trois formules de combinaison, à savoir, la formule linéaire, la formule Dempster-Shafer et la formule fuzzy-based. Enfin, nous avons mené une série d'expérimentations. Toutes ces expérimentations ont montré que: les approches proposées ont permis d'améliorer la pertinence des résultats pour les différentes configurations testées; les approches "query-dependent" sont les meilleurescomparées aux approches global context; les approches exploitant les liens de type "élément-élément"ont obtenu de bons résultats; les formules de combinaison qui se basent sur le principe de l'incertitude pour le calcul des scores finaux des éléments XML permettent de réaliser de bonnes performance

    Effective retrieval to support learning

    Get PDF
    To use digital resources to support learning, we need to be able to retrieve them. This thesis introduces a new area of research within information retrieval, the retrieval of educational resources from the Web. Successful retrieval of educational resources requires an understanding of how the resources being searched are managed, how searchers interact with those resources and the systems that manage them, and the needs of the people searching. As such, we began by investigating how resources are managed and reused in a higher education setting. This investigation involved running four focus groups with 23 participants, 26 interviews and a survey. The second part of this work is motivated by one of our initial findings; when people look for educational resources, they prefer to search the World Wide Web using a public search engine. This finding suggests users searching for educational resources may be more satisfied with search engine results if only those resources likely to support learning are presented. To provide satisfactory result sets, resources that are unlikely to support learning should not be present. A filter to detect material that is likely to support learning would therefore be useful. Information retrieval systems are often evaluated using the Cranfield method, which compares system performance with a ground truth provided by human judgments. We propose a method of evaluating systems that filter educational resources based on this method. By demonstrating that judges can agree on which resources are educational, we establish that a single human judge for each resource provides a sufficient ground truth. Machine learning techniques are commonly used to classify resources. We investigate how machine learning can be used to classify resources retrieved from the Web as likely or unlikely to support learning. We found that reasonable classification performance can be achieved using text extracted from resources in conjunction with Naïve Bayes, AdaBoost, and Random Forest classifiers. We also found that attributes developed from the structural elements—hyperlinks and headings found in a resource—did not substantially improve classification to support learning. We found that reasonable classification performance can be achieved using text extracted from resources in conjunction with Naïve Bayes, AdaBoost, and Random Forest classifiers. We also found that attributes developed from the structural elements—hyperlinks and headings found in a resource—did not substantially improve classification over simply using the text

    Evaluation of effective XML information retrieval

    Get PDF
    XML is being adopted as a common storage format in scientific data repositories, digital libraries, and on the World Wide Web. Accordingly, there is a need for content-oriented XML retrieval systems that can efficiently and effectively store, search and retrieve information from XML document collections. Unlike traditional information retrieval systems where whole documents are usually indexed and retrieved as information units, XML retrieval systems typically index and retrieve document components of varying granularity. To evaluate the effectiveness of such systems, test collections where relevance assessments are provided according to an XML-specific definition of relevance are necessary. Such test collections have been built during four rounds of the INitiative for the Evaluation of XML Retrieval (INEX). There are many different approaches to XML retrieval; most approaches either extend full-text information retrieval systems to handle XML retrieval, or use database technologies that incorporate existing XML standards to handle both XML presentation and retrieval. We present a hybrid approach to XML retrieval that combines text information retrieval features with XML-specific features found in a native XML database. Results from our experiments on the INEX 2003 and 2004 test collections demonstrate the usefulness of applying our hybrid approach to different XML retrieval tasks. A realistic definition of relevance is necessary for meaningful comparison of alternative XML retrieval approaches. The three relevance definitions used by INEX since 2002 comprise two relevance dimensions, each based on topical relevance. We perform an extensive analysis of the two INEX 2004 and 2005 relevance definitions, and show that assessors and users find them difficult to understand. We propose a new definition of relevance for XML retrieval, and demonstrate that a relevance scale based on this definition is useful for XML retrieval experiments. Finding the appropriate approach to evaluate XML retrieval effectiveness is the subject of ongoing debate within the XML information retrieval research community. We present an overview of the evaluation methodologies implemented in the current INEX metrics, which reveals that the metrics follow different assumptions and measure different XML retrieval behaviours. We propose a new evaluation metric for XML retrieval and conduct an extensive analysis of the retrieval performance of simulated runs to show what is measured. We compare the evaluation behaviour obtained with the new metric to the behaviours obtained with two of the official INEX 2005 metrics, and demonstrate that the new metric can be used to reliably evaluate XML retrieval effectiveness. To analyse the effectiveness of XML retrieval in different application scenarios, we use evaluation measures in our new metric to investigate the behaviour of XML retrieval approaches under the following two scenarios: the ad-hoc retrieval scenario, exploring the activities carried out as part of the INEX 2005 Ad-hoc track; and the multimedia retrieval scenario, exploring the activities carried out as part of the INEX 2005 Multimedia track. For both application scenarios we show that, although different values for retrieval parameters are needed to achieve the optimal performance, the desired textual or multimedia information can be effectively located using a combination of XML retrieval approaches
    corecore