151 research outputs found

    Modernizing Password Usage In Computing

    Get PDF
    A study of password usage and crypotography in computing culminates in the development of a password manager that improves users' password security. PassMan offers two-factor encrypted storage of user passwords and account information via the Yubikey, a common hardware authentication device, login auto-typing, password strength calculation, and customizable password generation. *Includes CD

    Resiliency Mechanisms for In-Memory Column Stores

    Get PDF
    The key objective of database systems is to reliably manage data, while high query throughput and low query latency are core requirements. To date, database research activities mostly concentrated on the second part. However, due to the constant shrinking of transistor feature sizes, integrated circuits become more and more unreliable and transient hardware errors in the form of multi-bit flips become more and more prominent. In a more recent study (2013), in a large high-performance cluster with around 8500 nodes, a failure rate of 40 FIT per DRAM device was measured. For their system, this means that every 10 hours there occurs a single- or multi-bit flip, which is unacceptably high for enterprise and HPC scenarios. Causes can be cosmic rays, heat, or electrical crosstalk, with the latter being exploited actively through the RowHammer attack. It was shown that memory cells are more prone to bit flips than logic gates and several surveys found multi-bit flip events in main memory modules of today's data centers. Due to the shift towards in-memory data management systems, where all business related data and query intermediate results are kept solely in fast main memory, such systems are in great danger to deliver corrupt results to their users. Hardware techniques can not be scaled to compensate the exponentially increasing error rates. In other domains, there is an increasing interest in software-based solutions to this problem, but these proposed methods come along with huge runtime and/or storage overheads. These are unacceptable for in-memory data management systems. In this thesis, we investigate how to integrate bit flip detection mechanisms into in-memory data management systems. To achieve this goal, we first build an understanding of bit flip detection techniques and select two error codes, AN codes and XOR checksums, suitable to the requirements of in-memory data management systems. The most important requirement is effectiveness of the codes to detect bit flips. We meet this goal through AN codes, which exhibit better and adaptable error detection capabilities than those found in today's hardware. The second most important goal is efficiency in terms of coding latency. We meet this by introducing a fundamental performance improvements to AN codes, and by vectorizing both chosen codes' operations. We integrate bit flip detection mechanisms into the lowest storage layer and the query processing layer in such a way that the remaining data management system and the user can stay oblivious of any error detection. This includes both base columns and pointer-heavy index structures such as the ubiquitous B-Tree. Additionally, our approach allows adaptable, on-the-fly bit flip detection during query processing, with only very little impact on query latency. AN coding allows to recode intermediate results with virtually no performance penalty. We support our claims by providing exhaustive runtime and throughput measurements throughout the whole thesis and with an end-to-end evaluation using the Star Schema Benchmark. To the best of our knowledge, we are the first to present such holistic and fast bit flip detection in a large software infrastructure such as in-memory data management systems. Finally, most of the source code fragments used to obtain the results in this thesis are open source and freely available.:1 INTRODUCTION 1.1 Contributions of this Thesis 1.2 Outline 2 PROBLEM DESCRIPTION AND RELATED WORK 2.1 Reliable Data Management on Reliable Hardware 2.2 The Shift Towards Unreliable Hardware 2.3 Hardware-Based Mitigation of Bit Flips 2.4 Data Management System Requirements 2.5 Software-Based Techniques For Handling Bit Flips 2.5.1 Operating System-Level Techniques 2.5.2 Compiler-Level Techniques 2.5.3 Application-Level Techniques 2.6 Summary and Conclusions 3 ANALYSIS OF CODING TECHNIQUES 3.1 Selection of Error Codes 3.1.1 Hamming Coding 3.1.2 XOR Checksums 3.1.3 AN Coding 3.1.4 Summary and Conclusions 3.2 Probabilities of Silent Data Corruption 3.2.1 Probabilities of Hamming Codes 3.2.2 Probabilities of XOR Checksums 3.2.3 Probabilities of AN Codes 3.2.4 Concrete Error Models 3.2.5 Summary and Conclusions 3.3 Throughput Considerations 3.3.1 Test Systems Descriptions 3.3.2 Vectorizing Hamming Coding 3.3.3 Vectorizing XOR Checksums 3.3.4 Vectorizing AN Coding 3.3.5 Summary and Conclusions 3.4 Comparison of Error Codes 3.4.1 Effectiveness 3.4.2 Efficiency 3.4.3 Runtime Adaptability 3.5 Performance Optimizations for AN Coding 3.5.1 The Modular Multiplicative Inverse 3.5.2 Faster Softening 3.5.3 Faster Error Detection 3.5.4 Comparison to Original AN Coding 3.5.5 The Multiplicative Inverse Anomaly 3.6 Summary 4 BIT FLIP DETECTING STORAGE 4.1 Column Store Architecture 4.1.1 Logical Data Types 4.1.2 Storage Model 4.1.3 Data Representation 4.1.4 Data Layout 4.1.5 Tree Index Structures 4.1.6 Summary 4.2 Hardened Data Storage 4.2.1 Hardened Physical Data Types 4.2.2 Hardened Lightweight Compression 4.2.3 Hardened Data Layout 4.2.4 UDI Operations 4.2.5 Summary and Conclusions 4.3 Hardened Tree Index Structures 4.3.1 B-Tree Verification Techniques 4.3.2 Justification For Further Techniques 4.3.3 The Error Detecting B-Tree 4.4 Summary 5 BIT FLIP DETECTING QUERY PROCESSING 5.1 Column Store Query Processing 5.2 Bit Flip Detection Opportunities 5.2.1 Early Onetime Detection 5.2.2 Late Onetime Detection 5.2.3 Continuous Detection 5.2.4 Miscellaneous Processing Aspects 5.2.5 Summary and Conclusions 5.3 Hardened Intermediate Results 5.3.1 Materialization of Hardened Intermediates 5.3.2 Hardened Bitmaps 5.4 Summary 6 END-TO-END EVALUATION 6.1 Prototype Implementation 6.1.1 AHEAD Architecture 6.1.2 Diversity of Physical Operators 6.1.3 One Concrete Operator Realization 6.1.4 Summary and Conclusions 6.2 Performance of Individual Operators 6.2.1 Selection on One Predicate 6.2.2 Selection on Two Predicates 6.2.3 Join Operators 6.2.4 Grouping and Aggregation 6.2.5 Delta Operator 6.2.6 Summary and Conclusions 6.3 Star Schema Benchmark Queries 6.3.1 Query Runtimes 6.3.2 Improvements Through Vectorization 6.3.3 Storage Overhead 6.3.4 Summary and Conclusions 6.4 Error Detecting B-Tree 6.4.1 Single Key Lookup 6.4.2 Key Value-Pair Insertion 6.5 Summary 7 SUMMARY AND CONCLUSIONS 7.1 Future Work A APPENDIX A.1 List of Golden As A.2 More on Hamming Coding A.2.1 Code examples A.2.2 Vectorization BIBLIOGRAPHY LIST OF FIGURES LIST OF TABLES LIST OF LISTINGS LIST OF ACRONYMS LIST OF SYMBOLS LIST OF DEFINITION

    Random hypergraphs for hashing-based data structures

    Get PDF
    This thesis concerns dictionaries and related data structures that rely on providing several random possibilities for storing each key. Imagine information on a set S of m = |S| keys should be stored in n memory locations, indexed by [n] = {1,…,n}. Each object x [ELEMENT OF] S is assigned a small set e(x) [SUBSET OF OR EQUAL TO] [n] of locations by a random hash function, independent of other objects. Information on x must then be stored in the locations from e(x) only. It is possible that too many objects compete for the same locations, in particular if the load c = m/n is high. Successfully storing all information may then be impossible. For most distributions of e(x), however, success or failure can be predicted very reliably, since the success probability is close to 1 for loads c less than a certain load threshold c^* and close to 0 for loads greater than this load threshold. We mainly consider two types of data structures: • A cuckoo hash table is a dictionary data structure where each key x [ELEMENT OF] S is stored together with an associated value f(x) in one of the memory locations with an index from e(x). The distribution of e(x) is controlled by the hashing scheme. We analyse three known hashing schemes, and determine their exact load thresholds. The schemes are unaligned blocks, double hashing and a scheme for dynamically growing key sets. • A retrieval data structure also stores a value f(x) for each x [ELEMENT OF] S. This time, the values stored in the memory locations from e(x) must satisfy a linear equation that characterises the value f(x). The resulting data structure is extremely compact, but unusual. It cannot answer questions of the form “is y [ELEMENT OF] S?”. Given a key y it returns a value z. If y [ELEMENT OF] S, then z = f(y) is guaranteed, otherwise z may be an arbitrary value. We consider two new hashing schemes, where the elements of e(x) are contained in one or two contiguous blocks. This yields good access times on a word RAM and high cache efficiency. An important question is whether these types of data structures can be constructed in linear time. The success probability of a natural linear time greedy algorithm exhibits, once again, threshold behaviour with respect to the load c. We identify a hashing scheme that leads to a particularly high threshold value in this regard. In the mathematical model, the memory locations [n] correspond to vertices, and the sets e(x) for x [ELEMENT OF] S correspond to hyperedges. Three properties of the resulting hypergraphs turn out to be important: peelability, solvability and orientability. Therefore, large parts of this thesis examine how hyperedge distribution and load affects the probabilities with which these properties hold and derive corresponding thresholds. Translated back into the world of data structures, we achieve low access times, high memory efficiency and low construction times. We complement and support the theoretical results by experiments.Diese Arbeit behandelt Wörterbücher und verwandte Datenstrukturen, die darauf aufbauen, mehrere zufällige Möglichkeiten zur Speicherung jedes Schlüssels vorzusehen. Man stelle sich vor, Information über eine Menge S von m = |S| Schlüsseln soll in n Speicherplätzen abgelegt werden, die durch [n] = {1,…,n} indiziert sind. Jeder Schlüssel x [ELEMENT OF] S bekommt eine kleine Menge e(x) [SUBSET OF OR EQUAL TO] [n] von Speicherplätzen durch eine zufällige Hashfunktion unabhängig von anderen Schlüsseln zugewiesen. Die Information über x darf nun ausschließlich in den Plätzen aus e(x) untergebracht werden. Es kann hierbei passieren, dass zu viele Schlüssel um dieselben Speicherplätze konkurrieren, insbesondere bei hoher Auslastung c = m/n. Eine erfolgreiche Speicherung der Gesamtinformation ist dann eventuell unmöglich. Für die meisten Verteilungen von e(x) lässt sich Erfolg oder Misserfolg allerdings sehr zuverlässig vorhersagen, da für Auslastung c unterhalb eines gewissen Auslastungsschwellwertes c* die Erfolgswahrscheinlichkeit nahezu 1 ist und für c jenseits dieses Auslastungsschwellwertes nahezu 0 ist. Hauptsächlich werden wir zwei Arten von Datenstrukturen betrachten: • Eine Kuckucks-Hashtabelle ist eine Wörterbuchdatenstruktur, bei der jeder Schlüssel x [ELEMENT OF] S zusammen mit einem assoziierten Wert f(x) in einem der Speicherplätze mit Index aus e(x) gespeichert wird. Die Verteilung von e(x) wird hierbei vom Hashing-Schema festgelegt. Wir analysieren drei bekannte Hashing-Schemata und bestimmen erstmals deren exakte Auslastungsschwellwerte im obigen Sinne. Die Schemata sind unausgerichtete Blöcke, Doppel-Hashing sowie ein Schema für dynamisch wachsenden Schlüsselmengen. • Auch eine Retrieval-Datenstruktur speichert einen Wert f(x) für alle x [ELEMENT OF] S. Diesmal sollen die Werte in den Speicherplätzen aus e(x) eine lineare Gleichung erfüllen, die den Wert f(x) charakterisiert. Die entstehende Datenstruktur ist extrem platzsparend, aber ungewöhnlich: Sie ist ungeeignet um Fragen der Form „ist y [ELEMENT OF] S?“ zu beantworten. Bei Anfrage eines Schlüssels y wird ein Ergebnis z zurückgegeben. Falls y [ELEMENT OF] S ist, so ist z = f(y) garantiert, andernfalls darf z ein beliebiger Wert sein. Wir betrachten zwei neue Hashing-Schemata, bei denen die Elemente von e(x) in einem oder in zwei zusammenhängenden Blöcken liegen. So werden gute Zugriffszeiten auf Word-RAMs und eine hohe Cache-Effizienz erzielt. Eine wichtige Frage ist, ob Datenstrukturen obiger Art in Linearzeit konstruiert werden können. Die Erfolgswahrscheinlichkeit eines naheliegenden Greedy-Algorithmus weist abermals ein Schwellwertverhalten in Bezug auf die Auslastung c auf. Wir identifizieren ein Hashing-Schema, das diesbezüglich einen besonders hohen Schwellwert mit sich bringt. In der mathematischen Modellierung werden die Speicherpositionen [n] als Knoten und die Mengen e(x) für x [ELEMENT OF] S als Hyperkanten aufgefasst. Drei Eigenschaften der entstehenden Hypergraphen stellen sich dann als zentral heraus: Schälbarkeit, Lösbarkeit und Orientierbarkeit. Weite Teile dieser Arbeit beschäftigen sich daher mit den Wahrscheinlichkeiten für das Vorliegen dieser Eigenschaften abhängig von Hashing Schema und Auslastung, sowie mit entsprechenden Schwellwerten. Eine Rückübersetzung der Ergebnisse liefert dann Datenstrukturen mit geringen Anfragezeiten, hoher Speichereffizienz und geringen Konstruktionszeiten. Die theoretischen Überlegungen werden dabei durch experimentelle Ergebnisse ergänzt und gestützt

    New data structures and algorithms for the efficient management of large spatial datasets

    Get PDF
    [Resumen] En esta tesis estudiamos la representación eficiente de matrices multidimensionales, presentando nuevas estructuras de datos compactas para almacenar y procesar grids en distintos ámbitos de aplicación. Proponemos varias estructuras de datos estáticas y dinámicas para la representación de matrices binarias o de enteros y estudiamos aplicaciones a la representación de datos raster en Sistemas de Información Geográfica, bases de datos RDF, etc. En primer lugar proponemos una colección de estructuras de datos estáticas para la representación de matrices binarias y de enteros: 1) una nueva representación de matrices binarias con grandes grupos de valores uniformes, con aplicaciones a la representación de datos raster binarios; 2) una nueva estructura de datos para representar matrices multidimensionales; 3) una nueva estructura de datos para representar matrices de enteros con soporte para consultas top-k de rango. También proponemos una nueva representación dinámica de matrices binarias, una nueva estructura de datos que proporciona las mismas funcionalidades que nuestras propuestas estáticas pero también soporta cambios en la matriz. Nuestras estructuras de datos pueden utilizarse en distintos dominios. Proponemos variantes específicas y combinaciones de nuestras propuestas para representar grafos temporales, bases de datos RDF, datos raster binarios o generales y datos raster temporales. También proponemos un nuevo algoritmo para consultar conjuntamente un conjuto de datos raster (almacenado usando nuestras propuestas) y un conjunto de datos vectorial almacenado en una estructura de datos clásica, mostrando que nuestra propuesta puede ser más rápida y usar menos espacio que otras alternativas. Nuestras representaciones proporcionan interesantes trade-offs y son competitivas en espacio y tiempos de consulta con representaciones habituales en los diferentes dominios.[Resumo] Nesta tese estudiamos a representación eficiente de matrices multidimensionais, presentando novas estruturas de datos compactas para almacenar e procesar grids en distintos ámbitos de aplicación. Propoñemos varias estruturas de datos estáticas e dinámicas para a representación de matrices binarias ou de enteiros e estudiamos aplicacións á representación de datos raster en Sistemas de Información Xeográfica, bases de datos RDF, etc. En primeiro lugar propoñemos unha colección de estruturas de datos estáticas para a representación de matrices binarias e de enteiros: 1) unha nova representación de matrices binarias con grandes grupos de valores uniformes, con aplicacións á representación de datos raster binarios; 2) unha nova estrutura de datos para representar matrices multidimensionais; 3) unha nova estrutura de datos para representar matrices de enteiros con soporte para consultas top-k. Tamén propoñemos unha nova representación dinámica de matrices binarias, unha nova estrutura de datos que proporciona as mesmas funcionalidades que as nosas propostas estáticas pero tamén soporta cambios na matriz. As nosas estruturas de datos poden utilizarse en distintos dominios. Propoñemos variantes específicas e combinacións das nosas propostas para representar grafos temporais, bases de datos RDF, datos raster binarios ou xerais e datos raster temporais. Tamén propoñemos un novo algoritmo para consultar conxuntamente datos raster (almacenados usando as nosas propostas) con datos vectoriais almacenados nunha estrutura de datos clásica, amosando que a nosa proposta pode ser máis rápida e usar menos espazo que outras alternativas. As nosas representacións proporcionan interesantes trade-offs e son competitivas en espazo e tempos de consulta con representacións habituais nos diferentes dominios.[Abstract] In this thesis we study the efficient representation of multidimensional grids, presenting new compact data structures to store and query grids in different application domains. We propose several static and dynamic data structures for the representation of binary grids and grids of integers, and study applications to the representation of raster data in Geographic Information Systems, RDF databases, etc. We first propose a collection of static data structures for the representation of binary grids and grids of integers: 1) a new representation of bi-dimensional binary grids with large clusters of uniform values, with applications to the representation of binary raster data; 2) a new data structure to represent multidimensional binary grids; 3) a new data structure to represent grids of integers with support for top-k range queries. We also propose a new dynamic representation of binary grids, a new data structure that provides the same functionalities that our static representations of binary grids but also supports changes in the grid. Our data structures can be used in several application domains. We propose specific variants and combinations of our generic proposals to represent temporal graphs, RDF databases, OLAP databases, binary or general raster data, and temporal raster data. We also propose a new algorithm to jointly query a raster dataset (stored using our representations) and a vectorial dataset stored in a classic data structure, showing that our proposal can be faster and require less space than the usual alternatives. Our representations provide interesting trade-offs and are competitive in terms of space and query times with usual representations in the different domains

    Understanding and advancing PDE-based image compression

    Get PDF
    This thesis is dedicated to image compression with partial differential equations (PDEs). PDE-based codecs store only a small amount of image points and propagate their information into the unknown image areas during the decompression step. For certain classes of images, PDE-based compression can already outperform the current quasi-standard, JPEG2000. However, the reasons for this success are not yet fully understood, and PDE-based compression is still in a proof-of-concept stage. With a probabilistic justification for anisotropic diffusion, we contribute to a deeper insight into design principles for PDE-based codecs. Moreover, by analysing the interaction between efficient storage methods and image reconstruction with diffusion, we can rank PDEs according to their practical value in compression. Based on these observations, we advance PDE-based compression towards practical viability: First, we present a new hybrid codec that combines PDE- and patch-based interpolation to deal with highly textured images. Furthermore, a new video player demonstrates the real-time capacities of PDE-based image interpolation and a new region of interest coding algorithm represents important image areas with high accuracy. Finally, we propose a new framework for diffusion-based image colourisation that we use to build an efficient codec for colour images. Experiments on real world image databases show that our new method is qualitatively competitive to current state-of-the-art codecs.Diese Dissertation ist der Bildkompression mit partiellen Differentialgleichungen (PDEs, partial differential equations) gewidmet. PDE-Codecs speichern nur einen geringen Anteil aller Bildpunkte und transportieren deren Information in fehlende Bildregionen. In einigen Fällen kann PDE-basierte Kompression den aktuellen Quasi-Standard, JPEG2000, bereits schlagen. Allerdings sind die Gründe für diesen Erfolg noch nicht vollständig erforscht, und PDE-basierte Kompression befindet sich derzeit noch im Anfangsstadium. Wir tragen durch eine probabilistische Rechtfertigung anisotroper Diffusion zu einem tieferen Verständnis PDE-basierten Codec-Designs bei. Eine Analyse der Interaktion zwischen effizienten Speicherverfahren und Bildrekonstruktion erlaubt es uns, PDEs nach ihrem Nutzen für die Kompression zu beurteilen. Anhand dieser Einsichten entwickeln wir PDE-basierte Kompression hinsichtlich ihrer praktischen Nutzbarkeit weiter: Wir stellen einen Hybrid-Codec für hochtexturierte Bilder vor, der umgebungsbasierte Interpolation mit PDEs kombiniert. Ein neuer Video-Dekodierer demonstriert die Echtzeitfähigkeit PDE-basierter Interpolation und eine Region-of-Interest-Methode erlaubt es, wichtige Bildbereiche mit hoher Genauigkeit zu speichern. Schlussendlich stellen wir ein neues diffusionsbasiertes Kolorierungsverfahren vor, welches uns effiziente Kompression von Farbbildern ermöglicht. Experimente auf Realwelt-Bilddatenbanken zeigen die Konkurrenzfähigkeit dieses Verfahrens auf
    corecore