4 research outputs found

    Overlap-aware global df estimation in distributed information retrieval systems

    No full text
    Peer-to-Peer (P2P) search engines and other forms of distributed information retrieval (IR) are gaining momentum. Unlike in centralized IR, it is difficult and expensive to compute statistical measures about the entire document collection as it is widely distributed across many computers in a highly dynamic network. On the other hand, such network-wide statistics, most notably, global document frequencies of the individual terms, would be highly beneficial for ranking global search results that are compiled from different peers. This paper develops an efficient and scalable method for estimating global document frequencies in a large-scale, highly dynamic P2P network with autonomous peers. The main difficulty that is addressed in this paper is that the local collections of different peers may arbitrarily overlap, as many peers may choose to gather popular documents that fall into their specific interest profile. Our method is based on hash sketches as an underlying technique for compact data synopses, and exploits specific properties of hash sketches for duplicate elimination in the counting process. We report on experiments with real Web data that demonstrate the accuracy of our estimation method and also the benefit for better search result ranking

    The application of workflows to digital heritage systems

    Get PDF
    Digital heritage systems usually handle a rich and varied mix of digital objects, accompanied by complex and intersecting workflows and processes. However, they usually lack effective workflow management within their components as evident in the lack of integrated solutions that include workflow components. There are a number of reasons for this limitation in workflow management utilization including some technical challenges, the unique nature of each digital resource and the challenges imposed by the environments and infrastructure in which such systems operate. This thesis investigates the concept of utilizing Workflow Management Systems (WfMS) within Digital Library Systems, and more specifically in online Digital Heritage Resources. The research work conducted involved the design and development of a novel experimental WfMS to test the viability of effective workflow management on the complex processes that exist in digital library and heritage resources. This rarely studied area of interest is covered by analyzing evolving workflow management technologies and paradigms. The different operational and technological aspects of these systems are evaluated while focusing on the areas that traditional systems often fail to address. A digital heritage resource was created to test a novel concept called DISPLAYS (Digital Library Services for Playing with Antiquity and Shared Heritage), which provides digital heritage content: creation, archival, exposition, presentation and interaction services for digital heritage collections. Based on DISPLAYS, a specific digital heritage resource was created to validate its concept and, more importantly, to act as a test bed to validate workflow management for digital heritage resources. This DISPLAYS type system implementation was called the Reanimating Cultural Heritage resource, for which three core components are the archival, retrieval and presentation components. To validate workflow management and its concepts, another limited version of these reanimating cultural heritage components was implemented within a workflow management host to test if the workflow technology is a viable choice for managing control and dataflow within a digital heritage system: this was successfully proved

    Scalability of findability: decentralized search and retrieval in large information networks

    Get PDF
    Amid the rapid growth of information today is the increasing challenge for people to survive and navigate its magnitude. Dynamics and heterogeneity of large information spaces such as the Web challenge information retrieval in these environments. Collection of information in advance and centralization of IR operations are hardly possible because systems are dynamic and information is distributed. While monolithic search systems continue to struggle with scalability problems of today, the future of search likely requires a decentralized architecture where many information systems can participate. As individual systems interconnect to form a global structure, finding relevant information in distributed environments transforms into a problem concerning not only information retrieval but also complex networks. Understanding network connectivity will provide guidance on how decentralized search and retrieval methods can function in these information spaces. The dissertation studies one aspect of scalability challenges facing classic information retrieval models and presents a decentralized, organic view of information systems pertaining to search in large scale networks. It focuses on the impact of network structure on search performance and investigates a phenomenon we refer to as the Clustering Paradox, in which the topology of interconnected systems imposes a scalability limit. Experiments involving large scale benchmark collections provide evidence on the Clustering Paradox in the IR context. In an increasingly large, distributed environment, decentralized searches for relevant information can continue to function well only when systems interconnect in certain ways. Relying on partial indexes of distributed systems, some level of network clustering enables very efficient and effective discovery of relevant information in large scale networks. Increasing or reducing network clustering degrades search performances. Given this specific level of network clustering, search time is well explained by a poly-logarithmic relation to network size, indicating a high scalability potential for searching in a continuously growing information space

    Trust-aware information retrieval in peer-to-peer environments

    Get PDF
    Information Retrieval in P2P environments (P2PIR) has become an active field of research due to the observation that P2P architectures have the potential to become as appealing as traditional centralised architectures. P2P networks are formed with voluntary peers that exchange information and accomplish various tasks. Some of them may be malicious peers spreading untrustworthy resources. However, existing P2PIR systems only focus on finding relevant documents, while trustworthiness of documents and document providers has been ignored. Without prior experience and knowledge about the network, users run the risk to review,download and use untrustworthy documents, even if these documents are relevant. The work presented in this dissertation provide the first integrated framework for trust-aware Information Retrieval in P2P environments, which can retrieve not only relevant but also trustworthy documents. The proposed content trust models extend an existing P2P trust management system, PeerTrust, in the context of P2PIR to compute the trust values of documents and document providers for given queries. A method is proposed to estimate global term statistics which are integrated with existing relevance-based approaches for document ranking and peer selection. Different approaches are explored to find optimal parametersettings in the proposed trust-aware P2PIR systems. Moreover, system architectures and data management protocols are designed to implement the proposed trust-aware P2PIR systems in structured P2P networks. The experimental evaluation demonstrates that P2PIR can benefit from trust-aware P2PIR systems significantly. It can importantly reduce the possibility of untrustworthy documents in the top-ranked result list. The proposed estimated global term statistics can provide acceptable and competitive retrieval accuracy within different P2PIR scenarios.EThOS - Electronic Theses Online ServiceORSSchool ScholarshipGBUnited Kingdo
    corecore