3,828 research outputs found

    Peer to Peer Information Retrieval: An Overview

    Get PDF
    Peer-to-peer technology is widely used for file sharing. In the past decade a number of prototype peer-to-peer information retrieval systems have been developed. Unfortunately, none of these have seen widespread real- world adoption and thus, in contrast with file sharing, information retrieval is still dominated by centralised solutions. In this paper we provide an overview of the key challenges for peer-to-peer information retrieval and the work done so far. We want to stimulate and inspire further research to overcome these challenges. This will open the door to the development and large-scale deployment of real-world peer-to-peer information retrieval systems that rival existing centralised client-server solutions in terms of scalability, performance, user satisfaction and freedom

    Fast Succinct Retrieval and Approximate Membership Using Ribbon

    Get PDF
    A retrieval data structure for a static function f: S → {0,1}^r supports queries that return f(x) for any x ∈ S. Retrieval data structures can be used to implement a static approximate membership query data structure (AMQ), i.e., a Bloom filter alternative, with false positive rate 2^{-r}. The information-theoretic lower bound for both tasks is r|S| bits. While succinct theoretical constructions using (1+o(1))r|S| bits were known, these could not achieve very small overheads in practice because they have an unfavorable space-time tradeoff hidden in the asymptotic costs or because small overheads would only be reached for physically impossible input sizes. With bumped ribbon retrieval (BuRR), we present the first practical succinct retrieval data structure. In an extensive experimental evaluation BuRR achieves space overheads well below 1% while being faster than most previously used retrieval data structures (typically with space overheads at least an order of magnitude larger) and faster than classical Bloom filters (with space overhead ≥ 44%). This efficiency, including favorable constants, stems from a combination of simplicity, word parallelism, and high locality. We additionally describe homogeneous ribbon filter AMQs, which are even simpler and faster at the price of slightly larger space overhead

    The Parallelism Motifs of Genomic Data Analysis

    Get PDF
    Genomic data sets are growing dramatically as the cost of sequencing continues to decline and small sequencing devices become available. Enormous community databases store and share this data with the research community, but some of these genomic data analysis problems require large scale computational platforms to meet both the memory and computational requirements. These applications differ from scientific simulations that dominate the workload on high end parallel systems today and place different requirements on programming support, software libraries, and parallel architectural design. For example, they involve irregular communication patterns such as asynchronous updates to shared data structures. We consider several problems in high performance genomics analysis, including alignment, profiling, clustering, and assembly for both single genomes and metagenomes. We identify some of the common computational patterns or motifs that help inform parallelization strategies and compare our motifs to some of the established lists, arguing that at least two key patterns, sorting and hashing, are missing

    Discreet - Pub/Sub for Edge Systems

    Get PDF
    The number of devices connected to the Internet has been growing exponentially over the last few years. Today, the amount of information available to users has reached a point that makes it impossible to consume it all, showing that we need better ways to filter what kind of information is sent our way. At the same time, while users are online and access all this information, their actions are also being collected, scrutinized and commercialized with little regard for privacy. This thesis addresses those issues in the context of a decentralized Publish/Subscribe solution for edge systems. Working at the edge of the Internet aims to prevent centralized control from a single entity and lessen the chance of abuse. Our goal was to devise a solution that achieves efficient message delivery, with good load-balancing properties, without revealing its participants subscription interests to preserve user privacy. Our solution uses cryptography and probabilistic data sets as a way to obfuscate event topics and user subscriptions. We modeled a cooperative solution, where publisher and subscriber nodes work in concert to route events among themselves, by leveraging a onehop structured overlay. By using an experimental evaluation, we attest the scalability and general performance of the proposed algorithms, including latency, false negative and false positive rates, and other useful metrics.O número de aparelhos ligados a Internet têm vindo a crescer exponencialmente ao longo dos últimos anos. Hoje em dia, a quantidade de informação que os utilizadores têm disponível, chegou a um ponto que torna impossível o seu total consumo. Isto leva a que seja necessário encontrarmos melhores formas de filtrar a informação que recebemos. Ao mesmo tempo, as ações do utilizadores estão a ser recolhidas, examinadas e comercializadas, sem qualquer respeito pela privacidade. Esta tese trata destes assuntos no contexto de um sistema Publish/Subscribe descentralizado, para sistemas na periferia. O objectivo de operar na preferia da Internet está em prevenir o controlo centralizado por uma única entidade e diminuir a oportunidade para abusos. O nosso objectivo foi conceber uma solução que realiza entrega de mensagens eficientemente, com boas propriedades na distribuição de carga e sem revelar on interesses dos participantes, de forma a preservar a sua privacidade. A nossa solução usa criptografia e estruturas de dados probabilísticas, como uma forma de ofuscar os tópicos dos eventos e as subscrições dos utilizadores. Modelamos o sistema com o objectivo de ser uma solução cooperativa, onde ambos os tipos de nós Editores e Assinantes trabalham em concertadamente para encaminhar eventos entre eles, ao fazerem uso de uma estrutura de rede sobreposta com um salto. Fazendo uma avaliação experimental testámos a escalabilidade e o desempenho geral dos algoritmos propostos, incluindo a latência, falsos negativos, falsos positivos e outras métricas úteis

    Random hypergraphs for hashing-based data structures

    Get PDF
    This thesis concerns dictionaries and related data structures that rely on providing several random possibilities for storing each key. Imagine information on a set S of m = |S| keys should be stored in n memory locations, indexed by [n] = {1,…,n}. Each object x [ELEMENT OF] S is assigned a small set e(x) [SUBSET OF OR EQUAL TO] [n] of locations by a random hash function, independent of other objects. Information on x must then be stored in the locations from e(x) only. It is possible that too many objects compete for the same locations, in particular if the load c = m/n is high. Successfully storing all information may then be impossible. For most distributions of e(x), however, success or failure can be predicted very reliably, since the success probability is close to 1 for loads c less than a certain load threshold c^* and close to 0 for loads greater than this load threshold. We mainly consider two types of data structures: • A cuckoo hash table is a dictionary data structure where each key x [ELEMENT OF] S is stored together with an associated value f(x) in one of the memory locations with an index from e(x). The distribution of e(x) is controlled by the hashing scheme. We analyse three known hashing schemes, and determine their exact load thresholds. The schemes are unaligned blocks, double hashing and a scheme for dynamically growing key sets. • A retrieval data structure also stores a value f(x) for each x [ELEMENT OF] S. This time, the values stored in the memory locations from e(x) must satisfy a linear equation that characterises the value f(x). The resulting data structure is extremely compact, but unusual. It cannot answer questions of the form “is y [ELEMENT OF] S?”. Given a key y it returns a value z. If y [ELEMENT OF] S, then z = f(y) is guaranteed, otherwise z may be an arbitrary value. We consider two new hashing schemes, where the elements of e(x) are contained in one or two contiguous blocks. This yields good access times on a word RAM and high cache efficiency. An important question is whether these types of data structures can be constructed in linear time. The success probability of a natural linear time greedy algorithm exhibits, once again, threshold behaviour with respect to the load c. We identify a hashing scheme that leads to a particularly high threshold value in this regard. In the mathematical model, the memory locations [n] correspond to vertices, and the sets e(x) for x [ELEMENT OF] S correspond to hyperedges. Three properties of the resulting hypergraphs turn out to be important: peelability, solvability and orientability. Therefore, large parts of this thesis examine how hyperedge distribution and load affects the probabilities with which these properties hold and derive corresponding thresholds. Translated back into the world of data structures, we achieve low access times, high memory efficiency and low construction times. We complement and support the theoretical results by experiments.Diese Arbeit behandelt Wörterbücher und verwandte Datenstrukturen, die darauf aufbauen, mehrere zufällige Möglichkeiten zur Speicherung jedes Schlüssels vorzusehen. Man stelle sich vor, Information über eine Menge S von m = |S| Schlüsseln soll in n Speicherplätzen abgelegt werden, die durch [n] = {1,…,n} indiziert sind. Jeder Schlüssel x [ELEMENT OF] S bekommt eine kleine Menge e(x) [SUBSET OF OR EQUAL TO] [n] von Speicherplätzen durch eine zufällige Hashfunktion unabhängig von anderen Schlüsseln zugewiesen. Die Information über x darf nun ausschließlich in den Plätzen aus e(x) untergebracht werden. Es kann hierbei passieren, dass zu viele Schlüssel um dieselben Speicherplätze konkurrieren, insbesondere bei hoher Auslastung c = m/n. Eine erfolgreiche Speicherung der Gesamtinformation ist dann eventuell unmöglich. Für die meisten Verteilungen von e(x) lässt sich Erfolg oder Misserfolg allerdings sehr zuverlässig vorhersagen, da für Auslastung c unterhalb eines gewissen Auslastungsschwellwertes c* die Erfolgswahrscheinlichkeit nahezu 1 ist und für c jenseits dieses Auslastungsschwellwertes nahezu 0 ist. Hauptsächlich werden wir zwei Arten von Datenstrukturen betrachten: • Eine Kuckucks-Hashtabelle ist eine Wörterbuchdatenstruktur, bei der jeder Schlüssel x [ELEMENT OF] S zusammen mit einem assoziierten Wert f(x) in einem der Speicherplätze mit Index aus e(x) gespeichert wird. Die Verteilung von e(x) wird hierbei vom Hashing-Schema festgelegt. Wir analysieren drei bekannte Hashing-Schemata und bestimmen erstmals deren exakte Auslastungsschwellwerte im obigen Sinne. Die Schemata sind unausgerichtete Blöcke, Doppel-Hashing sowie ein Schema für dynamisch wachsenden Schlüsselmengen. • Auch eine Retrieval-Datenstruktur speichert einen Wert f(x) für alle x [ELEMENT OF] S. Diesmal sollen die Werte in den Speicherplätzen aus e(x) eine lineare Gleichung erfüllen, die den Wert f(x) charakterisiert. Die entstehende Datenstruktur ist extrem platzsparend, aber ungewöhnlich: Sie ist ungeeignet um Fragen der Form „ist y [ELEMENT OF] S?“ zu beantworten. Bei Anfrage eines Schlüssels y wird ein Ergebnis z zurückgegeben. Falls y [ELEMENT OF] S ist, so ist z = f(y) garantiert, andernfalls darf z ein beliebiger Wert sein. Wir betrachten zwei neue Hashing-Schemata, bei denen die Elemente von e(x) in einem oder in zwei zusammenhängenden Blöcken liegen. So werden gute Zugriffszeiten auf Word-RAMs und eine hohe Cache-Effizienz erzielt. Eine wichtige Frage ist, ob Datenstrukturen obiger Art in Linearzeit konstruiert werden können. Die Erfolgswahrscheinlichkeit eines naheliegenden Greedy-Algorithmus weist abermals ein Schwellwertverhalten in Bezug auf die Auslastung c auf. Wir identifizieren ein Hashing-Schema, das diesbezüglich einen besonders hohen Schwellwert mit sich bringt. In der mathematischen Modellierung werden die Speicherpositionen [n] als Knoten und die Mengen e(x) für x [ELEMENT OF] S als Hyperkanten aufgefasst. Drei Eigenschaften der entstehenden Hypergraphen stellen sich dann als zentral heraus: Schälbarkeit, Lösbarkeit und Orientierbarkeit. Weite Teile dieser Arbeit beschäftigen sich daher mit den Wahrscheinlichkeiten für das Vorliegen dieser Eigenschaften abhängig von Hashing Schema und Auslastung, sowie mit entsprechenden Schwellwerten. Eine Rückübersetzung der Ergebnisse liefert dann Datenstrukturen mit geringen Anfragezeiten, hoher Speichereffizienz und geringen Konstruktionszeiten. Die theoretischen Überlegungen werden dabei durch experimentelle Ergebnisse ergänzt und gestützt

    Nearest Neighbors Using Compact Sparse Codes

    Get PDF
    International audienceIn this paper, we propose a novel scheme for approximate nearest neighbor (ANN) retrieval based on dictionary learning and sparse coding. Our key innovation is to build compact codes, dubbed SpANN codes, using the active set of sparse coded data. These codes are then used to index an inverted file table for fast retrieval. The active sets are often found to be sensitive to small differences among data points, resulting in only near duplicate retrieval. We show that this sensitivity is related to the coherence of the dictionary; small coherence resulting in better retrieval. To this end, we propose a novel dictionary learning formulation with incoherence constraints and an efficient method to solve it. Experiments are conducted on two state-of-the-art computer vision datasets with 1M data points and show an order of magnitude improvement in retrieval accuracy without sacrificing memory and query time compared to the state-of-the-art methods

    Coupling of heterotrophic bacteria to phytoplankton bloom development at different pCO<sub>2</sub> levels: a mesocosm study

    Get PDF
    The predicted rise in anthropogenic CO2 emissions will increase CO2 concentrations and decrease seawater pH in the upper ocean. Recent studies have revealed effects of pCO2 induced changes in seawater chemistry on a variety of marine life forms, in particular calcifying organisms. To test whether the predicted increase in pCO2 will directly or indirectly (via changes in phytoplankton dynamics) affect abundance, activities, and community composition of heterotrophic bacteria during phytoplankton bloom development, we have aerated mesocosms with CO2 to obtain triplicates with three different partial pressures of CO2 (pCO2): 350 μatm (1×CO2), 700 μatm (2×CO2) and 1050 μatm (3×CO2). The development of a phytoplankton bloom was initiated by the addition of nitrate and phosphate. In accordance to an elevated carbon to nitrogen drawdown at increasing pCO2, bacterial production (BPP) of free-living and attached bacteria as well as cell-specific BPP (csBPP) of attached bacteria were related to the C:N ratio of suspended matter. These relationships significantly differed among treatments. However, bacterial abundance and activities were not statistically different among treatments. Solely community structure of free-living bacteria changed with pCO2 whereas that of attached bacteria seemed to be independent of pCO2 but tightly coupled to phytoplankton bloom development. Our findings imply that changes in pCO2, although reflected by changes in community structure of free-living bacteria, do not directly affect bacterial activity. Furthermore, bacterial activity and dynamics of heterotrophic bacteria, especially of attached bacteria, were tightly correlated to phytoplankton development and, hence, may also potentially depend on changes in pCO2
    corecore