15 research outputs found

    Learning to merge search results for efficient Distributed Information Retrieval

    Get PDF
    Merging search results from different servers is a major problem in Distributed Information Retrieval. We used Regression-SVM and Ranking-SVM which would learn a function that merges results based on information that is readily available: i.e. the ranks, titles, summaries and URLs contained in the results pages. By not downloading additional information, such as the full document, we decrease bandwidth usage. CORI and Round Robin merging were used as our baselines; surprisingly, our results show that the SVM-methods do not improve over those baselines

    Updating collection representations for federated search

    Get PDF
    To facilitate the search for relevant information across a set of online distributed collections, a federated information retrieval system typically represents each collection, centrally, by a set of vocabularies or sampled documents. Accurate retrieval is therefore related to how precise each representation reflects the underlying content stored in that collection. As collections evolve over time, collection representations should also be updated to reflect any change, however, a current solution has not yet been proposed. In this study we examine both the implications of out-of-date representation sets on retrieval accuracy, as well as proposing three different policies for managing necessary updates. Each policy is evaluated on a testbed of forty-four dynamic collections over an eight-week period. Our findings show that out-of-date representations significantly degrade performance overtime, however, adopting a suitable update policy can minimise this problem

    Overlap-aware global df estimation in distributed information retrieval systems

    No full text
    Peer-to-Peer (P2P) search engines and other forms of distributed information retrieval (IR) are gaining momentum. Unlike in centralized IR, it is difficult and expensive to compute statistical measures about the entire document collection as it is widely distributed across many computers in a highly dynamic network. On the other hand, such network-wide statistics, most notably, global document frequencies of the individual terms, would be highly beneficial for ranking global search results that are compiled from different peers. This paper develops an efficient and scalable method for estimating global document frequencies in a large-scale, highly dynamic P2P network with autonomous peers. The main difficulty that is addressed in this paper is that the local collections of different peers may arbitrarily overlap, as many peers may choose to gather popular documents that fall into their specific interest profile. Our method is based on hash sketches as an underlying technique for compact data synopses, and exploits specific properties of hash sketches for duplicate elimination in the counting process. We report on experiments with real Web data that demonstrate the accuracy of our estimation method and also the benefit for better search result ranking

    Investigation on Applying Modular Ontology to Statistical Language Model for Information Retrieval

    Get PDF
    The objective of this research is to provide a novel approach to improving retrieval performance by exploiting Ontology with the statistical language model (SLM). The proposed methods consist of two major processes, namely ontology-based query expansion (OQE) and ontology-based document classification (ODC). Research experiments have required development of an independent search tool that can combine the OQE and ODC in a traditional SLM-based information retrieval (IR) process using a Web document collection. This research considers the ongoing challenges of modular ontology enhanced SLM-based search and addresses three contribution aspects. The first concerns how to apply modular ontology to query expansion, in a bespoke language model search tool (LMST). The second considers how to incorporate OQE with the language model to improve the search performance. The third examines how to manipulate such semantic-based document classification to improve the smoothing accuracy. The role of ontology in the research is to provide formally described domains of interest that serve as context, to enhance system query effectiveness

    Decentralized link analysis in peer-to-peer web search networks

    Get PDF
    Analyzing the authority or reputation of entities that are connected by a graph structure and ranking these entities is an important issue that arises in the Web, in Web 2.0 communities, and in other applications. The problem is typically addressed by computing the dominant eigenvector of a matrix that is suitably derived from the underlying graph, or by performing a full spectral decomposition of the matrix. Although such analyses could be performed by a centralized server, there are good reasons that suggest running theses computations in a decentralized manner across many peers, like scalability, privacy, censorship, etc. There exist a number of approaches for speeding up the analysis by partitioning the graph into disjoint fragments. However, such methods are not suitable for a peer-to-peer network, where overlap among the fragments might occur. In addition, peer-to-peer approaches need to consider network characteristics, such as peers unaware of other peers' contents, susceptibility to malicious attacks, and network dynamics (so-called churn). In this thesis we make the following major contributions. We present JXP, a decentralized algorithm for computing authority scores of entities distributed in a peer-to-peer (P2P) network that allows peers to have overlapping content and requires no a priori knowledge of other peers' content. We also show the benets of JXP in the Minerva distributed Web search engine. We present an extension of JXP, coined TrustJXP, that contains a reputation model in order to deal with misbehaving peers. We present another extension of JXP, that handles dynamics on peer-to-peer networks, as well as an algorithm for estimating the current number of entities in the network. This thesis also presents novel methods for embedding JXP in peer-to-peer networks and applications. We present an approach for creating links among peers, forming semantic overlay networks, where peers are free to decide which connections they create and which they want to avoid based on various usefulness estimators. We show how peer-to-peer applications, like the JXP algorithm, can greatly benet from these additional semantic relations.Die Berechnung von AutoritĂ€ts- oder Reputationswerten fĂŒr Knoten eines Graphen, welcher verschiedene EntitĂ€ten verknĂŒpft, ist von großem Interesse in Web-Anwendungen, z.B. in der Analyse von Hyperlinkgraphen, Web 2.0 Portalen, sozialen Netzen und anderen Anwendungen. Die Lösung des Problems besteht oftmals im Kern aus der Berechnung des dominanten Eigenvektors einer Matrix, die vom zugrunde liegenden Graphen abgeleitet wird. Obwohl diese Analysen in einer zentralisierten Art und Weise berechnet werden können, gibt es gute GrĂŒnde, diese Berechnungen auf mehrere Knoten eines Netzwerkes zu verteilen, insbesondere bezĂŒglich Skalierbarkeit, Datenschutz und Zensur. In der Literatur finden sich einige Methoden, welche die Berechnung beschleunigen, indem der zugrunde liegende Graph in nicht ĂŒberlappende Teilgraphen zerlegt wird. Diese Annahme ist in Peer-to-Peer-System allerdings nicht realistisch, da die einzelnen Peers ihre Graphen in einer nicht synchronisierten Weise erzeugen, was inhĂ€rent zu starken oder weniger starken Überlappungen der Graphen fĂŒhrt. DarĂŒber hinaus sind Peer-to-Peer-Systeme per Definition ein lose gekoppelter Zusammenschluss verschiedener Benutzer (Peers), verteilt im ganzen Internet, so dass Netzwerkcharakteristika, Netzwerkdynamik und mögliche Attacken krimineller Benutzer unbedingt berĂŒcksichtigt werden mĂŒssen. In dieser Arbeit liefern wir die folgenden grundlegenden BeitrĂ€ge. Wir prĂ€sentieren JXP, einen verteilten Algorithmus fĂŒr die Berechnung von AutoritĂ€tsmaßen ĂŒber EntitĂ€ten in einem Peer-to-Peer Netzwerk. Wir prĂ€sentieren Trust-JXP, eine Erweiterung von JXP, ausgestattet mit einem Modell zur Berechnung von Reputationswerten, die benutzt werden, um bösartig agierende Benutzer zu identizieren. Wir betrachten, wie JXP robust gegen VerĂ€nderungen des Netzwerkes gemacht werden kann und wie die Anzahl der verschiedenen EntitĂ€ten im Netzwerk effizient geschĂ€tzt werden kann. DarĂŒber hinaus beschreiben wir in dieser Arbeit neuartige AnsĂ€tze, JXP in bestehende Peer-to-Peer-Netzwerke einzubinden. Wir prĂ€sentieren eine Methode, mit deren Hilfe Peers entscheiden können, welche Verbindungen zu anderen Peers von Nutzen sind und welche Verbindungen vermieden werden sollen. Diese Methode basiert auf verschiedenen QualitĂ€tsindikatoren, und wir zeigen, wie Peer-to-Peer-Anwendungen, zum Beispiel JXP, von diesen zusĂ€tzlichen Relationen profitieren können

    Advanced methods for query routing in peer-to-peer information retrieval

    Get PDF
    One of the most challenging problems in peer-to-peer networks is query routing: effectively and efficiently identifying peers that can return high-quality local results for a given query. Existing methods from the areas of distributed information retrieval and metasearch engines do not adequately address the peculiarities of a peer-to-peer network. The main contributions of this thesis are as follows: 1. Methods for query routing that take into account the mutual overlap of different peers\u27; collections, 2. Methods for query routing that take into account the correlations between multiple terms, 3. Comparative evaluation of different query routing methods. Our experiments confirm the superiority of our novel query routing methods over the prior state-of-the-art, in particular in the context of peer-to-peer Web search.Eines der drĂ€ngendsten Probleme in Peer-to-Peer-Netzwerken ist Query-Routing: das effektive und effiziente Identifizieren solcher Peers, die qualitativ hochwertige lokale Ergebnisse zu einer gegebenen Anfrage liefern können. Die bisher bekannten Verfahren aus dem Bereich der verteilten Informationssuche sowie der Metasuchmaschinen werden den Besonderheiten von Peer-to-Peer-Netzwerken nicht gerecht. Die HautbeitrĂ€ge dieser Arbeit teilen sich in folgende Schwerpunkte: 1. Query-Routing unter BerĂŒcksichtigung der gegenseitigen ĂŒberlappung der Kollektionen verschiedener Peers, 2. Query-Routing unter BerĂŒcksichtigung der Korrelationen zwischen verschiedenen Termen, 3. Vergleichende Evaluierung verschiedener Methoden zum Query-Routing. Unsere Experimente bestĂ€tigen die Überlegenheit der in dieser Arbeit entwickelten Verfahren gegenĂŒber den bisher bekannten Verfahren, insbesondere im Kontext von Peer-to-Peer-Websuche

    Trust-aware information retrieval in peer-to-peer environments

    Get PDF
    Information Retrieval in P2P environments (P2PIR) has become an active field of research due to the observation that P2P architectures have the potential to become as appealing as traditional centralised architectures. P2P networks are formed with voluntary peers that exchange information and accomplish various tasks. Some of them may be malicious peers spreading untrustworthy resources. However, existing P2PIR systems only focus on finding relevant documents, while trustworthiness of documents and document providers has been ignored. Without prior experience and knowledge about the network, users run the risk to review,download and use untrustworthy documents, even if these documents are relevant. The work presented in this dissertation provide the first integrated framework for trust-aware Information Retrieval in P2P environments, which can retrieve not only relevant but also trustworthy documents. The proposed content trust models extend an existing P2P trust management system, PeerTrust, in the context of P2PIR to compute the trust values of documents and document providers for given queries. A method is proposed to estimate global term statistics which are integrated with existing relevance-based approaches for document ranking and peer selection. Different approaches are explored to find optimal parametersettings in the proposed trust-aware P2PIR systems. Moreover, system architectures and data management protocols are designed to implement the proposed trust-aware P2PIR systems in structured P2P networks. The experimental evaluation demonstrates that P2PIR can benefit from trust-aware P2PIR systems significantly. It can importantly reduce the possibility of untrustworthy documents in the top-ranked result list. The proposed estimated global term statistics can provide acceptable and competitive retrieval accuracy within different P2PIR scenarios.EThOS - Electronic Theses Online ServiceORSSchool ScholarshipGBUnited Kingdo
    corecore