239 research outputs found

    Distinct Sector Hashes for Target File Detection

    Get PDF
    Using an alternative approach to traditional file hashing, digital forensic investigators can hash individually sampled subject drives on sector boundaries and then check these hashes against a prebuilt database, making it possible to process raw media without reference to the underlying file system

    On Optimally Partitioning Variable-Byte Codes

    Get PDF
    The ubiquitous Variable-Byte encoding is one of the fastest compressed representation for integer sequences. However, its compression ratio is usually not competitive with other more sophisticated encoders, especially when the integers to be compressed are small that is the typical case for inverted indexes. This paper shows that the compression ratio of Variable-Byte can be improved by 2x by adopting a partitioned representation of the inverted lists. This makes Variable-Byte surprisingly competitive in space with the best bit-aligned encoders, hence disproving the folklore belief that Variable-Byte is space-inefficient for inverted index compression. Despite the significant space savings, we show that our optimization almost comes for free, given that: we introduce an optimal partitioning algorithm that does not affect indexing time because of its linear-time complexity; we show that the query processing speed of Variable-Byte is preserved, with an extensive experimental analysis and comparison with several other state-of-the-art encoders.Comment: Published in IEEE Transactions on Knowledge and Data Engineering (TKDE), 15 April 201

    The Parallelism Motifs of Genomic Data Analysis

    Get PDF
    Genomic data sets are growing dramatically as the cost of sequencing continues to decline and small sequencing devices become available. Enormous community databases store and share this data with the research community, but some of these genomic data analysis problems require large scale computational platforms to meet both the memory and computational requirements. These applications differ from scientific simulations that dominate the workload on high end parallel systems today and place different requirements on programming support, software libraries, and parallel architectural design. For example, they involve irregular communication patterns such as asynchronous updates to shared data structures. We consider several problems in high performance genomics analysis, including alignment, profiling, clustering, and assembly for both single genomes and metagenomes. We identify some of the common computational patterns or motifs that help inform parallelization strategies and compare our motifs to some of the established lists, arguing that at least two key patterns, sorting and hashing, are missing

    Caching, crashing & concurrency - verification under adverse conditions

    Get PDF
    The formal development of large-scale software systems is a complex and time-consuming effort. Generally, its main goal is to prove the functional correctness of the resulting system. This goal becomes significantly harder to reach when the verification must be performed under adverse conditions. When aiming for a realistic system, the implementation must be compatible with the “real world”: it must work with existing system interfaces, cope with uncontrollable events such as power cuts, and offer competitive performance by using mechanisms like caching or concurrency. The Flashix project is an example of such a development, in which a fully verified file system for flash memory has been developed. The project is a long-term team effort and resulted in a sequential, functionally correct and crash-safe implementation after its first project phase. This thesis continues the work by performing modular extensions to the file system with performance-oriented mechanisms that mainly involve caching and concurrency, always considering crash-safety. As a first contribution, this thesis presents a modular verification methodology for destructive heap algorithms. The approach simplifies the verification by separating reasoning about specifics of heap implementations, like pointer aliasing, from the reasoning about conceptual correctness arguments. The second contribution of this thesis is a novel correctness criterion for crash-safe, cached, and concurrent file systems. A natural criterion for crash-safety is defined in terms of system histories, matching the behavior of fine-grained caches using complex synchronization mechanisms that reorder operations. The third contribution comprises methods for verifying functional correctness and crash-safety of caching mechanisms and concurrency in file systems. A reference implementation for crash-safe caches of high-level data structures is given, and a strategy for proving crash-safety is demonstrated and applied. A compatible concurrent implementation of the top layer of file systems is presented, using a mechanism for the efficient management of fine-grained file locking, and a concurrent version of garbage collection is realized. Both concurrency extensions are proven to be correct by applying atomicity refinement, a methodology for proving linearizability. Finally, this thesis contributes a new iteration of executable code for the Flashix file system. With the efficiency extensions introduced with this thesis, Flashix covers all performance-oriented concepts of realistic file system implementations and achieves competitiveness with state-of-the-art flash file systems

    Timing Sensitive Dependency Analysis and its Application to Software Security

    Get PDF
    Ich präsentiere neue Verfahren zur statischen Analyse von Ausführungszeit-sensitiver Informationsflusskontrolle in Softwaresystemen. Ich wende diese Verfahren an zur Analyse nebenläufiger Java Programme, sowie zur Analyse von Ausführungszeit-Seitenkanälen in Implementierungen kryptographischer Primitive. Methoden der Informationsflusskontrolle zielen darauf ab, Fluss von Informationen (z.B.: zwischen verschiedenen externen Schnittstellen einer Software-Komponente) anhand expliziter Richtlinien einzuschränken. Solche Methoden können daher zur Einhaltung sowohl von Vertraulichkeit als auch Integrität eingesetzt werden. Der Ziel korrekter statischer Programmanalysen in diesem Umfeld ist der Nachweis, dass in allen Ausführungen eines gegebenen Programms die zugehörigen Richtlinien eingehalten werden. Ein solcher Nachweis erfordert ein Sicherheitskriterium, welches formalisiert, unter welchen Bedingungen dies der Fall ist. Jedem formalen Sicherheitskriterium entspricht implizit ein Programm- und Angreifermodell. Einfachste Nichtinterferenz-Kriterien beschreiben beispielsweise nur nicht-interaktive Programme. Dies sind Programme die nur bei Beginn und Ende der Ausführung Ein- und Ausgaben erlauben. Im zugehörigen Angreifer-Modell kennt der Angreifer das Programm, aber beobachtet nur bestimmte (öffentliche) Aus- und Eingaben oder stellt diese bereit. Ein Programm ist nichtinterferent, wenn der Angreifer aus seinen Beobachtungen keinerlei Rückschlüsse auf geheime Aus- und Eingaben terminierender Ausführungen machen kann. Aus nicht-terminierenden Ausführungen hingegen sind dem Angreifer in diesem Modell Schlussfolgerungen auf geheime Eingaben erlaubt. Seitenkanäle entstehen, wenn einem Angreifer aus Beobachtungen realer Systeme Rückschlüsse auf vertrauliche Informationen ziehen kann, welche im formalen Modell unmöglich sind. Typische Seitenkanäle (also: in vielen formalen Sicherheitskriterien unmodelliert) sind neben Nichttermination beispielsweise auch Energieverbrauch und die Ausführungszeit von Programmen. Hängt diese von geheimen Eingaben ab, so kann ein Angreifer aus der beobachteten Ausführungszeit auf die Eingabe (z.B.: auf den Wert einzelner geheimer Parameter) schließen. In meiner Dissertation präsentiere ich neue Abhängigkeitsanalysen, die auch Nichtterminations- und Ausführungszeitkanäle berücksichtigen. In Hinblick auf Nichtterminationskanäle stelle ich neue Verfahren zur Berechnung von Programm-Abhängigkeiten vor. Hierzu entwickle ich ein vereinheitlichendes Rahmenwerk, in welchem sowohl Nichttermination-sensitive als auch Nichttermination-insensitive Abhängigkeiten aus zueinander dualen Postdominanz-Begriffen resultieren. Für Ausführungszeitkanäle entwickle ich neue Abhängigkeitsbegriffe und dazugehörige Verfahren zu deren Berechnung. In zwei Anwendungen untermauere ich die These: Ausführungszeit-sensitive Abhängigkeiten ermöglichen korrekte statische Informationsfluss-Analyse unter Berücksichtigung von Ausführungszeitkanälen. Basierend auf Ausführungszeit-sensitiven Abhängigkeiten entwerfe ich hierfür neue Analysen für nebenläufige Programme. Ausführungszeit-sensitive Abhängigkeiten sind dort selbst für Ausführungszeit-insensitive Angreifermodelle relevant, da dort interne Ausführungszeitkanäle zwischen unterschiedlichen Ausführungsfäden extern beobachtbar sein können. Meine Implementierung für nebenläufige Java Programme basiert auf auf dem Programmanalyse- System JOANA. Außerdem präsentiere ich neue Analysen für Ausführungszeitkanäle aufgrund mikro-architektureller Abhängigkeiten. Exemplarisch untersuche ich Implementierungen von AES256 Blockverschlüsselung. Bei einigen Implementierungen führen Daten-Caches dazu, dass die Ausführungszeit abhängt von Schlüssel und Geheimtext, wodurch diese aus der Ausführungszeit inferierbar sind. Für andere Implementierungen weist meine automatische statische Analyse (unter Annahme einer einfachen konkreten Cache-Mikroarchitektur) die Abwesenheit solcher Kanäle nach

    Distributed pattern mining and data publication in life sciences using big data technologies

    Get PDF

    The application of the in-tree knapsack problem to routing prefix caches

    Get PDF
    Modern routers use specialized hardware, such as Ternary Content Addressable Memory (TCAM), to solve the Longest Prefix Matching Problem (LPMP) quickly. Due to the fact that TCAM is a non-standard type of memory and inherently parallel, there are concerns about its cost and power consumption. This problem is exacerbated by the growth in routing tables, which demands ever larger TCAMs. To reduce the size of the TCAMs in a distributed forwarding environment, a batch caching model is proposed and analyzed. The problem of determining which routing prefixes to store in the TCAMs reduces to the In-tree Knapsack Problem (ITKP) for unit weight vertices in this model. Several algorithms are analysed for solving the ITKP, both in the general case and when the problem is restricted to unit weight vertices. Additionally, a variant problem is proposed and analyzed, which exploits the caching model to provide better solutions. This thesis concludes with discussion of open problems and future experimental work

    A Plain-text Compression Technique with Fast Lookup Ability

    Get PDF
    Data compression has always been an essential aspect of computing. In recent times, with the increasing popularity of remote and cloud-based computation, compression is becoming more important. Reducing the size of a data object in this context would not only reduce the transfer time, but also the amount of data transferred. The key figures of merit of a data compression scheme are its compression ratio and its compression, decompression and lookup speeds. Traditional compression techniques achieve high compression ratios, but require decompression before a lookup can be performed. This increases the lookup time. In this thesis, we propose a compression technique for plain-text data objects, that uses variable length encoding to compress data. The dictionary of possible words is sorted based on the statistical frequency of the use of words, which are encoded using the variable length code-words. Words that are not in the dictionary are handled as well. The driving motivation of our technique is to perform significantly faster lookups without the need to decompress the compressed data object. Our approach also facilitates string operations (such as concatenation, insertion and deletion and search-and-replacement) on compressed text without the need of decompression. We implement our technique in C++, and compare our approach with industry standard tools like gzip and bzip2 in terms of compression ratio, lookup speed, search-and-replace time and peak memory uses. Our compression scheme is about 81x faster as compared to gzip and about 165x times faster as compared to bzip2, when the data is searched, and restored into a compressed format. In conclusion, our approach facilitates string operations like concatenation, insertion, deletion and search-and-replace on the compressed file itself without the need for decompression

    Scalable String and Suffix Sorting: Algorithms, Techniques, and Tools

    Get PDF
    This dissertation focuses on two fundamental sorting problems: string sorting and suffix sorting. The first part considers parallel string sorting on shared-memory multi-core machines, the second part external memory suffix sorting using the induced sorting principle, and the third part distributed external memory suffix sorting with a new distributed algorithmic big data framework named Thrill.Comment: 396 pages, dissertation, Karlsruher Instituts f\"ur Technologie (2018). arXiv admin note: text overlap with arXiv:1101.3448 by other author
    corecore