    Non-Mergeable Sketching for Cardinality Estimation

    Cardinality estimation is perhaps the simplest non-trivial statistical problem that can be solved via sketching. Industrially-deployed sketches like HyperLogLog, MinHash, and PCSA are mergeable, which means that large data sets can be sketched in a distributed environment, and then merged into a single sketch of the whole data set. In the last decade a variety of sketches have been developed that are non-mergeable, but attractive for other reasons. They are simpler, their cardinality estimates are strictly unbiased, and they have substantially lower variance. We evaluate sketching schemes on a reasonably level playing field, in terms of their memory-variance product (MVP). E.g., a sketch that occupies 5m bits and whose relative variance is 2/m (standard error ?{2/m}) has an MVP of 10. Our contributions are as follows. - Cohen [Edith Cohen, 2015] and Ting [Daniel Ting, 2014] independently discovered what we call the {Martingale transform} for converting a mergeable sketch into a non-mergeable sketch. We present a simpler way to analyze the limiting MVP of Martingale-type sketches. - Pettie and Wang proved that the Fishmonger sketch [Seth Pettie and Dingyu Wang, 2021] has the best MVP, H?/I? ? 1.98, among a class of mergeable sketches called "linearizable" sketches. (H? and I? are precisely defined constants.) We prove that the Martingale transform is optimal in the non-mergeable world, and that Martingale Fishmonger in particular is optimal among linearizable sketches, with an MVP of H?/2 ? 1.63. E.g., this is circumstantial evidence that to achieve 1% standard error, we cannot do better than a 2 kilobyte sketch. - Martingale Fishmonger is neither simple nor practical. We develop a new mergeable sketch called Curtain that strikes a nice balance between simplicity and efficiency, and prove that Martingale Curtain has limiting MVP? 2.31. It can be updated with O(1) memory accesses and it has lower empirical variance than Martingale LogLog, a practical non-mergeable version of HyperLogLog

    Preserving user privacy in social media data processing

    Social media data is used for analytics, e.g., in science, authorities or the industry. Privacy is often considered a secondary problem. However, protecting the privacy of social media users is demanded by laws and ethics. In order to prevent subsequent abuse, theft or public exposure of collected datasets, privacy-aware data processing is crucial. This dissertation presents a concept to process social media data with social media user’s privacy in mind. It features a data storage concept based on the cardinality estimator HyperLogLog to store social media data, so that it is not possible to extract individual items from it, but only to estimate the cardinality of items within a certain set, plus running set operations over multiple sets to extend analytical ranges. Applying this method requires to define the scope of the result before even gathering the data. This prevents the data from being misused for other purposes at a later point in time and thus follows the privacy by design principles. This work further shows methods to increase privacy through the implementation of abstraction layers. An included case study demonstrates the presented methods to be suitable for application in the field.:1 Introduction 1.1 Problem 1.2 Research objectives 1.3 Document structure 2 Related work 2.1 The notion of privacy 2.2 Privacy by design 2.3 Differential privacy 2.4 Geoprivacy 2.5 Probabilistic Data Structures 3 Concept and methods 3.1 Collateral data 3.2 Disposable data 3.3 Cardinality estimation 3.4 Data precision 3.5 Extendability 3.6 Abstraction 3.7 Time consideration 4 Summary of publications 4.1 HyperLogLog Introduction 4.2 VOST Case Study 4.3 Real-time Streaming 4.4 Abstraction Layers 4.5 VGIscience Book Chapter 4.6 Supplementary Software Materials 5 Discussion 5.1 Prevent accidental data disclosure 5.2 Feasibility in the field 5.3 Adjustability for different use cases 5.4 Limitations of HLL 5.5 Security 5.6 Outlook and further research 6 Conclusion Appendix References Publication

    Indexing methods for web archives

    There have been numerous efforts recently to digitize previously published content and preserving born-digital content leading to the widespread growth of large text reposi- tories. Web archives are such continuously growing text collections which contain ver- sions of documents spanning over long time periods. Web archives present many op- portunities for historical, cultural and political analyses. Consequently there is a grow- ing need for tools which can efficiently access and search them. In this work, we are interested in indexing methods for supporting text-search work- loads over web archives like time-travel queries and phrase queries. To this end we make the following contributions: • Time-travel queries are keyword queries with a temporal predicate, e.g., “mpii saarland” @ [06/2009], which return versions of documents in the past. We in- troduce a novel index organization strategy, called index sharding, for efficiently supporting time-travel queries without incurring additional index-size blowup. We also propose index-maintenance approaches which scale to such continuously growing collections. • We develop query-optimization techniques for time-travel queries called partition selection which maximizes recall at any given query-execution stage. • We propose indexing methods to support phrase queries, e.g., “to be or not to be that is the question”. We index multi-word sequences and devise novel query- optimization methods over the indexed sequences to efficiently answer phrase queries. We demonstrate the superior performance of our approaches over existing methods by extensive experimentation on real-world web archives.In der jüngsten Vergangenheit gab es zahlreiche Bemühungen zuvor veröffentlichte Inhalte zu digitalisieren und elektronisch erstellte Inhalte zu erhalten. Dies führte zu einem weit verbreitenden Anstieg großer Textdatenbestände. Webarchive sind eine solche Art konstant ansteigender Textdatensammlung. Sie enthalten mehrere Versionen von Dokumenten, welche sich über längere Zeiträume erstrecken. Darüber hinaus bieten sie viele Möglichkeiten für historische, kulturelle und politische Analysen. Infolgedessen gibt es einen wachsenden Bedarf an Werkzeugen, die eine effiziente Suche in Webarchiven und einen effizienten Zugriff auf die Daten erlauben. Der Fokus dieser Arbeit liegt auf Indexierungsverfahren, um die Arbeitslast von Textsuche auf Webarchiven zu unterstützen, wie zum Beispiel time-travel queries oder phrase queries. Zu diesem Zweck leisten wir folgende Beiträge: • Time-travel queries sind Suchwortanfragen mit einem temporalen Prädikat. Zum Beispiel liefert die Anfrage “mpii saarland” @ [06/2009] Versionen des Dokuments aus der Vergangenheit als Ergebnis. Zur effizienten Unterstützung solcher Anfragen ohne die Indexgröße aufzublasen, stellen wir eine neue Strategie zur Organisation von Indizes dar, so genanntes index sharding. Des Weiteren schlagen wir Wartungsverfahren für Indizes vor, die für solch konstant wachsende Datensätze skalieren. • WirentwickelnTechnikenzurAnfrageoptimierungvontime-travelqueries, nachstehend partition selection genannt. Diese maximieren den Recall in jeder Phase der Anfrageverarbeitung. • Wir stellen Indexierungsmethoden vor, die phrase queries unterstützen, z. B. “Sein oder Nichtsein, das ist hier die Frage”. Wir indexieren Sequenzen bestehend aus mehreren Wörtern und entwerfen neue Optimierungsverfahren für die indexierten Sequenzen, um phrase queries effizient zu beantworten. Die Performanz dieser Verfahren wird anhand von ausführlichen Experimenten auf realen Webarchiven demonstriert

    Advanced methods for query routing in peer-to-peer information retrieval

    One of the most challenging problems in peer-to-peer networks is query routing: effectively and efficiently identifying peers that can return high-quality local results for a given query. Existing methods from the areas of distributed information retrieval and metasearch engines do not adequately address the peculiarities of a peer-to-peer network. The main contributions of this thesis are as follows: 1. Methods for query routing that take into account the mutual overlap of different peers\u27; collections, 2. Methods for query routing that take into account the correlations between multiple terms, 3. Comparative evaluation of different query routing methods. Our experiments confirm the superiority of our novel query routing methods over the prior state-of-the-art, in particular in the context of peer-to-peer Web search.Eines der drängendsten Probleme in Peer-to-Peer-Netzwerken ist Query-Routing: das effektive und effiziente Identifizieren solcher Peers, die qualitativ hochwertige lokale Ergebnisse zu einer gegebenen Anfrage liefern können. Die bisher bekannten Verfahren aus dem Bereich der verteilten Informationssuche sowie der Metasuchmaschinen werden den Besonderheiten von Peer-to-Peer-Netzwerken nicht gerecht. Die Hautbeiträge dieser Arbeit teilen sich in folgende Schwerpunkte: 1. Query-Routing unter Berücksichtigung der gegenseitigen überlappung der Kollektionen verschiedener Peers, 2. Query-Routing unter Berücksichtigung der Korrelationen zwischen verschiedenen Termen, 3. Vergleichende Evaluierung verschiedener Methoden zum Query-Routing. Unsere Experimente bestätigen die Überlegenheit der in dieser Arbeit entwickelten Verfahren gegenüber den bisher bekannten Verfahren, insbesondere im Kontext von Peer-to-Peer-Websuche
