113 research outputs found

    Document Translation for Cross-Language Text Retrieval at the University of Maryland

    Get PDF
    The University of Maryland participated in three TREC-6 tasks: ad hoc retrieval, cross-language retrieval, and spoken document retrieval. The principal focus of the work was evaluation of a cross-language text retrieval technique based on fully automatic machine translation. The results show that approaches based on document translation can be approximately as effective as approaches based on query translation, but that additional work will be needed to develop a solid basis for choosing between the two in specific applications. Ad hoc and spoken document retrieval results are also presented

    Exploiting Query Structure and Document Structure to Improve Document Retrieval Effectiveness

    Get PDF
    In this paper we present a systematic analysis of document retrieval using unstructured and structured queries within the score region algebra (SRA) structured retrieval framework. The behavior of di®erent retrieval models, namely Boolean, tf.idf, GPX, language models, and Okapi, is tested using the transparent SRA framework in our three-level structured retrieval system called TIJAH. The retrieval models are implemented along four elementary retrieval aspects: element and term selection, element score computation, score combination, and score propagation. The analysis is performed on a numerous experiments evaluated on TREC and CLEF collections, using manually generated unstructured and structured queries. Unstructured queries range from the short title queries to long title + description + narrative queries. For generating structured queries we exploit the knowledge of the document structure and the content used to semantically describe or classify documents. We show that such structured information can be utilized in retrieval engines to give more precise answers to user queries then when using unstructured queries

    A multi-collection latent topic model for federated search

    Get PDF
    Collection selection is a crucial function, central to the effectiveness and efficiency of a federated information retrieval system. A variety of solutions have been proposed for collection selection adapting proven techniques used in centralised retrieval. This paper defines a new approach to collection selection that models the topical distribution in each collection. We describe an extended version of latent Dirichletallocation that uses a hierarchical hyperprior to enable the different topical distributions found in each collection to be modelled. Under the model, resources are ranked based on the topical relationship between query and collection. By modelling collections in a low dimensional topic space, we can implicitly smooth their term-based characterisation with appropriate terms from topically related samples, thereby dealing with the problem of missing vocabulary within the samples. An important advantage of adopting this hierarchical model over current approaches is that the model generalises well to unseen documents given small samples of each collection. The latent structure of each collection can therefore be estimated well despite imperfect information for each collection such as sampled documents obtained through query-based sampling. Experiments demonstrate that this new, fully integrated topical model is more robust than current state of the art collection selection algorithm

    A design of Personal Information Push-Delivery System on the Internet

    Get PDF
    The Internet provides a powerful disseminative ability for users to acquire information more efficiently and quickly. However, an increasingly large scale of data induces certain problems as users face a more serious information overload situation. By using an information retrieval technique, information push-delivery provides a good solution for users to acquire rich information from the Internet. In fact, providing personal service for users is one of the mo st important issues in an electronic commerce (EC) environment. In order to increase interaction between themselves and customers, many enterprises provide personal services to improve management performance and competitiveness. However, since the customers have different preferences for information received from the Internet, it seems necessary to design a personal information system to guarantee that the customers can receive the desired information. In this study, the fuzzy retrieval and similarity measurement techniques are applied to design a personal information push-delivery system. The data resulting from testing a group of students at Da-Yeh University, Changhua, Taiwan, shows that the satisfaction degree for the received information for all participants was 70%. These results indicate that the proposed system can effectively provide correct and interesting information to users

    A Taxonomy of Information Retrieval Models and Tools

    Get PDF
    Information retrieval is attracting significant attention due to the exponential growth of the amount of information available in digital format. The proliferation of information retrieval objects, including algorithms, methods, technologies, and tools, makes it difficult to assess their capabilities and features and to understand the relationships that exist among them. In addition, the terminology is often confusing and misleading, as different terms are used to denote the same, or similar, tasks. This paper proposes a taxonomy of information retrieval models and tools and provides precise definitions for the key terms. The taxonomy consists of superimposing two views: a vertical taxonomy, that classifies IR models with respect to a set of basic features, and a horizontal taxonomy, which classifies IR systems and services with respect to the tasks they support. The aim is to provide a framework for classifying existing information retrieval models and tools and a solid point to assess future developments in the field

    On the effect of INQUERY term-weighting scheme on query-sensitive similarity measures

    Get PDF
    Cluster-based information retrieval systems often use a similarity measure to compute the association among text documents. In this thesis, we focus on a class of similarity measures named Query-Sensitive Similarity (QSS) measures. Recent studies have shown QSS measures to positively influence the outcome of a clustering procedure. These studies have used QSS measures in conjunction with the ltc term-weighting scheme. Several term-weighting schemes have superseded the ltc term-weighing scheme and demonstrated better retrieval performance relative to the latter. We test whether introducing one of these schemes, INQUERY, will offer any benefit over the ltc scheme when used in the context of QSS measures. The testing procedure uses the Nearest Neighbor (NN) test to quantify the clustering effectiveness of QSS measures and the corresponding term-weighting scheme. The NN tests are applied on certain standard test document collections and the results are tested for statistical significance. On analyzing results of the NN test relative to those obtained for the ltc scheme, we find several instances where the INQUERY scheme improves the clustering effectiveness of QSS measures. To be able to apply the NN test, we designed a software test framework, Ferret, by complementing the features provided by dtSearch, a search engine. The test framework automates the generation of NN coefficients by processing standard test document collection data. We provide an insight into the construction and working of the Ferret test framework

    Virtual WWW Documents: a Concept to Explicit the Structure of WWW Sites

    Get PDF
    http://www.emse.fr/~beigbeder/PUBLIS/1999-BCS-IRSG-p185-doan-v1.pdfInternational audienceThis paper shows a new concept of a virtual WWW document (VWD), as a set of WWW pages representing a logical information space, generally dealing with one particular domain. The VWD is described using metadata in the XML syntax and will be accessed through a metadata.class file, stored at the root level of WWW sites. We'll suggest how the VWD can improve information retrieval on the WWW and reduce the network load generated by the robots. We describe a prototype implemented in JAVA, within an application in the environmental domain. The exchanges of such metadata lay in a flexible architecture based on two kinds of robots : generalists and specialists that collect and organize this metadata, in order to localize the resources on the WWW. They will contribute to the overall auto-organizing information process by exchanging their indices, therefore forwarding their knowledge each other
    corecore