21 research outputs found

    Experiments in terabyte searching, genomic retrieval and novelty detection for TREC 2004

    Get PDF
    In TREC2004, Dublin City University took part in three tracks, Terabyte (in collaboration with University College Dublin), Genomic and Novelty. In this paper we will discuss each track separately and present separate conclusions from this work. In addition, we present a general description of a text retrieval engine that we have developed in the last year to support our experiments into large scale, distributed information retrieval, which underlies all of the track experiments described in this document

    Enhancing access to the Bibliome: the TREC 2004 Genomics Track

    Get PDF
    BACKGROUND: The goal of the TREC Genomics Track is to improve information retrieval in the area of genomics by creating test collections that will allow researchers to improve and better understand failures of their systems. The 2004 track included an ad hoc retrieval task, simulating use of a search engine to obtain documents about biomedical topics. This paper describes the Genomics Track of the Text Retrieval Conference (TREC) 2004, a forum for evaluation of IR research systems, where retrieval in the genomics domain has recently begun to be assessed. RESULTS: A total of 27 research groups submitted 47 different runs. The most effective runs, as measured by the primary evaluation measure of mean average precision (MAP), used a combination of domain-specific and general techniques. The best MAP obtained by any run was 0.4075. Techniques that expanded queries with gene name lists as well as words from related articles had the best efficacy. However, many runs performed more poorly than a simple baseline run, indicating that careful selection of system features is essential. CONCLUSION: Various approaches to ad hoc retrieval provide a diversity of efficacy. The TREC Genomics Track and its test collection resources provide tools that allow improvement in information retrieval systems

    A study of selection noise in collaborative web search

    Get PDF
    Collaborative Web search uses the past search behaviour (queries and selections) of a community of users to promote search results that are relevant to the community. The extent to which these promotions are likely to be relevant depends on how reliably past search behaviour can be captured. We consider this issue by analysing the results of collaborative Web search in circumstances where the behaviour of searchers is unreliable

    Detection and management of redundancy for information retrieval

    Get PDF
    The growth of the web, authoring software, and electronic publishing has led to the emergence of a new type of document collection that is decentralised, amorphous, dynamic, and anarchic. In such collections, redundancy is a significant issue. Documents can spread and propagate across such collections without any control or moderation. Redundancy can interfere with the information retrieval process, leading to decreased user amenity in accessing information from these collections, and thus must be effectively managed. The precise definition of redundancy varies with the application. We restrict ourselves to documents that are co-derivative: those that share a common heritage, and hence contain passages of common text. We explore document fingerprinting, a well-known technique for the detection of co-derivative document pairs. Our new lossless fingerprinting algorithm improves the effectiveness of a range of document fingerprinting approaches. We empirically show that our algorithm can be highly effective at discovering co-derivative document pairs in large collections. We study the occurrence and management of redundancy in a range of application domains. On the web, we find that document fingerprinting is able to identify widespread redundancy, and that this redundancy has a significant detrimental effect on the quality of search results. Based on user studies, we suggest that redundancy is most appropriately managed as a postprocessing step on the ranked list and explain how and why this should be done. In the genomic area of sequence homology search, we explain why the existing techniques for redundancy discovery are increasingly inefficient, and present a critique of the current approaches to redundancy management. We show how document fingerprinting with a modified version of our algorithm provides significant efficiency improvements, and propose a new approach to redundancy management based on wildcards. We demonstrate that our scheme provides the benefits of existing techniques but does not have their deficiencies. Redundancy in distributed information retrieval systems - where different parts of the collection are searched by autonomous servers - cannot be effectively managed using traditional fingerprinting techniques. We thus propose a new data structure, the grainy hash vector, for redundancy detection and management in this environment. We show in preliminary tests that the grainy hash vector is able to accurately detect a good proportion of redundant document pairs while maintaining low resource usage

    TRECVID 2004 experiments in Dublin City University

    Get PDF
    In this paper, we describe our experiments for TRECVID 2004 for the Search task. In the interactive search task, we developed two versions of a video search/browse system based on the Físchlár Digital Video System: one with text- and image-based searching (System A); the other with only image (System B). These two systems produced eight interactive runs. In addition we submitted ten fully automatic supplemental runs and two manual runs. A.1, Submitted Runs: • DCUTREC13a_{1,3,5,7} for System A, four interactive runs based on text and image evidence. • DCUTREC13b_{2,4,6,8} for System B, also four interactive runs based on image evidence alone. • DCUTV2004_9, a manual run based on filtering faces from an underlying text search engine for certain queries. • DCUTV2004_10, a manual run based on manually generated queries processed automatically. • DCU_AUTOLM{1,2,3,4,5,6,7}, seven fully automatic runs based on language models operating over ASR text transcripts and visual features. • DCUauto_{01,02,03}, three fully automatic runs based on exploring the benefits of multiple sources of text evidence and automatic query expansion. A.2, In the interactive experiment it was confirmed that text and image based retrieval outperforms an image-only system. In the fully automatic runs, DCUauto_{01,02,03}, it was found that integrating ASR, CC and OCR text into the text ranking outperforms using ASR text alone. Furthermore, applying automatic query expansion to the initial results of ASR, CC, OCR text further increases performance (MAP), though not at high rank positions. For the language model-based fully automatic runs, DCU_AUTOLM{1,2,3,4,5,6,7}, we found that interpolated language models perform marginally better than other tested language models and that combining image and textual (ASR) evidence was found to marginally increase performance (MAP) over textual models alone. For our two manual runs we found that employing a face filter disimproved MAP when compared to employing textual evidence alone and that manually generated textual queries improved MAP over fully automatic runs, though the improvement was marginal. A.3, Our conclusions from our fully automatic text based runs suggest that integrating ASR, CC and OCR text into the retrieval mechanism boost retrieval performance over ASR alone. In addition, a text-only Language Modelling approach such as DCU_AUTOLM1 will outperform our best conventional text search system. From our interactive runs we conclude that textual evidence is an important lever for locating relevant content quickly, but that image evidence, if used by experienced users can aid retrieval performance. A.4, We learned that incorporating multiple text sources improves over ASR alone and that an LM approach which integrates shot text, neighbouring shots and entire video contents provides even better retrieval performance. These findings will influence how we integrate textual evidence into future Video IR systems. It was also found that a system based on image evidence alone can perform reasonably and given good query images can aid retrieval performance

    Novelty Detection by Latent Semantic Indexing

    Get PDF
    As a new topic in text mining, novelty detection is a natural extension of information retrieval systems, or search engines. Aiming at refining raw search results by filtering out old news and saving only the novel messages, it saves modern people from the nightmare of information overload. One of the difficulties in novelty detection is the inherent ambiguity of language, which is the carrier of information. Among the sources of ambiguity, synonymy proves to be a notable factor. To address this issue, previous studies mainly employed WordNet, a lexical database which can be perceived as a thesaurus. Rather than borrowing a dictionary, we proposed a statistical approach employing Latent Semantic Indexing (LSI) to learn semantic relationship automatically with the help of language resources. To apply LSI which involves matrix factorization, an immediate problem is that the dataset in novelty detection is dynamic and changing constantly. As an imitation of real-world scenario, texts are ranked in chronological order and examined one by one. Each text is only compared with those having appeared earlier, while later ones remain unknown. As a result, the data matrix starts as a one-row vector representing the first report, and has a new row added at the bottom every time we read a new document. Such a changing dataset makes it hard to employ matrix methods directly. Although LSI has long been acknowledged as an effective text mining method when considering semantic structure, it has never been used in novelty detection, nor have other statistical treatments. We tried to change this situation by introducing external text source to build the latent semantic space, onto which the incoming news vectors were projected. We used the Reuters-21578 dataset and the TREC data as sources of latent semantic information. Topics were divided into years and types in order to take the differences between them into account. Results showed that LSI, though very effective in traditional information retrieval tasks, had only a slight improvement to the performances for some data types. The extent of improvement depended on the similarity between news data and external information. A probing into the co-occurrence matrix attributed such a limited performance to the unique features of microblogs. Their short sentence lengths and restricted dictionary made it very hard to recover and exploit latent semantic information via traditional data structure

    Index ordering by query-independent measures

    Get PDF
    There is an ever-increasing amount of data that is being produced from various data sources — this data must then be organised effectively if we hope to search though it. Traditional information retrieval approaches search through all available data in a particular collection in order to find the most suitable results, however, for particularly large collections this may be extremely time consuming. Our purposed solution to this problem is to only search a limited amount of the collection at query-time, in order to speed this retrieval process up. Although, in doing this we aim to limit the loss in retrieval efficacy (in terms of accuracy of results). The way we aim to do this is to firstly identify the most “important” documents within the collection, and then sort the documents within the collection in order of their "importance” in the collection. In this way we can choose to limit the amount of information to search through, by eliminating the documents of lesser importance, which should not only make the search more efficient, but should also limit any loss in retrieval accuracy. In this thesis we investigate various different query-independent methods that may indicate the importance of a document in a collection. The more accurate the measure is at determining an important document, the more effectively we can eliminate documents from the retrieval process - improving the query-throughput of the system, as well as providing a high level of accuracy in the returned results. The effectiveness of these approaches are evaluated using the datasets provided by the terabyte track at the Text REtreival Conference (TREC)

    Físchlár-DiamondTouch: collaborative video searching on a table

    Full text link

    Statistical Modeling to Information Retrieval for Searching from Big Text Data and Higher Order Inference for Reliability

    Get PDF
    This thesis examined two research projects: probabilistic information retrieval modeling and third-order inference on reliability. In the first part of this dissertation, two research topics in the information retrieval are carried out and experimented on large-scale text data set. First, we conduct an in-depth study of relationship between information of document length and document relevance to user need. Two statistical methods are proposed which incorporates document length as a substantial weighting factor to achieve higher retrieval performance. Second, we utilize the property of survival function to propose a cost-based re-ranking method to promote ranking diversity for biomedical information retrieval, and to model the proximity between query terms to improve retrieval performance. Through extensive experiments on standard TREC collections, our proposed models perform significantly better than the classical probabilistic information retrieval models. In the second part of this dissertation, a small sample asymptotic method is proposed for higher order inference in the stress-strength reliability model, R=P(Y<X), where X and Y are independently distributed. A penalized likelihood method is proposed to handle the numerical complications of maximizing the constrained likelihood model. Simulation studies are conducted on two distributions: Burr type X distribution and exponentiated exponential distribution. Results from simulation studies show that the proposed method is very accurate even when the sample sizes are small

    Novelty detection in topic tracking

    Get PDF
    Ankara : The Department of Computer Engineering and the Institute of Engineering and Science of Bilkent University, 2010.Thesis (Master's) -- Bilkent University, 2010.Includes bibliographical references leaves 51-56.News portals provide many services to the news consumers such as information retrieval, personalized information filtering, summarization and news clustering. Additionally, many news portals using multiple sources enable their users to evaluate developments from different perspectives by richening the content. However, increasing number of sources and incoming news makes it difficult for news consumers to find news of their interest in news portals. Different types of organizational operations are applied to ease browsing over the news for this reason. New event detection and tracking (NEDT) is one of these operations which aims to organize news with respect to the events that they report. NEDT may not also be enough by itself to satisfy the news consumers’ needs because of the repetitions of information that may occur in the tracking news of a topic due to usage of multiple sources. In this thesis, we investigate usage of novelty detection (ND) in tracking news of a topic. For this aim, we built a Turkish ND experimental collection, BilNov, consisting of 59 topics with an average of 51 tracking news. We propose usage of three methods; cosine similarity-based ND method, language model-based ND method and cover coefficient-based ND method. Additionally, we experiment on category-based threshold learning which has not been worked on previously in ND literature. We also provide some experimental pointers for ND in Turkish such as restriction of document vector lengths and smoothing methods. Finally, we experiment on TREC Novelty Track 2004 dataset. Experiments conducted by using BilNov show that language model-based ND method outperforms other two methods significantly and category-based threshold learning has promising results when compared to general threshold learning.Aksoy, CemM.S
    corecore