239 research outputs found
Distinct Sector Hashes for Target File Detection
Using an alternative approach to traditional
file hashing, digital forensic investigators
can hash individually sampled subject
drives on sector boundaries and then
check these hashes against a prebuilt database,
making it possible to process raw
media without reference to the underlying
file system
On Optimally Partitioning Variable-Byte Codes
The ubiquitous Variable-Byte encoding is one of the fastest compressed
representation for integer sequences. However, its compression ratio is usually
not competitive with other more sophisticated encoders, especially when the
integers to be compressed are small that is the typical case for inverted
indexes. This paper shows that the compression ratio of Variable-Byte can be
improved by 2x by adopting a partitioned representation of the inverted lists.
This makes Variable-Byte surprisingly competitive in space with the best
bit-aligned encoders, hence disproving the folklore belief that Variable-Byte
is space-inefficient for inverted index compression. Despite the significant
space savings, we show that our optimization almost comes for free, given that:
we introduce an optimal partitioning algorithm that does not affect indexing
time because of its linear-time complexity; we show that the query processing
speed of Variable-Byte is preserved, with an extensive experimental analysis
and comparison with several other state-of-the-art encoders.Comment: Published in IEEE Transactions on Knowledge and Data Engineering
(TKDE), 15 April 201
The Parallelism Motifs of Genomic Data Analysis
Genomic data sets are growing dramatically as the cost of sequencing
continues to decline and small sequencing devices become available. Enormous
community databases store and share this data with the research community, but
some of these genomic data analysis problems require large scale computational
platforms to meet both the memory and computational requirements. These
applications differ from scientific simulations that dominate the workload on
high end parallel systems today and place different requirements on programming
support, software libraries, and parallel architectural design. For example,
they involve irregular communication patterns such as asynchronous updates to
shared data structures. We consider several problems in high performance
genomics analysis, including alignment, profiling, clustering, and assembly for
both single genomes and metagenomes. We identify some of the common
computational patterns or motifs that help inform parallelization strategies
and compare our motifs to some of the established lists, arguing that at least
two key patterns, sorting and hashing, are missing
Caching, crashing & concurrency - verification under adverse conditions
The formal development of large-scale software systems is a complex and time-consuming effort. Generally, its main goal is to prove the functional correctness of the resulting system. This goal becomes significantly harder to reach when the verification must be performed under adverse conditions. When aiming for a realistic system, the implementation must be compatible with the “real world”: it must work with existing system interfaces, cope with uncontrollable events such as power cuts, and offer competitive performance by using mechanisms like caching or concurrency.
The Flashix project is an example of such a development, in which a fully verified file system for flash memory has been developed. The project is a long-term team effort and resulted in a sequential, functionally correct and crash-safe implementation after its first project phase. This thesis continues the work by performing modular extensions to the file system with performance-oriented mechanisms that mainly involve caching and concurrency, always considering crash-safety.
As a first contribution, this thesis presents a modular verification methodology for destructive heap algorithms. The approach simplifies the verification by separating reasoning about specifics of heap implementations, like pointer aliasing, from the reasoning about conceptual correctness arguments.
The second contribution of this thesis is a novel correctness criterion for crash-safe, cached, and concurrent file systems. A natural criterion for crash-safety is defined in terms of system histories, matching the behavior of fine-grained caches using complex synchronization mechanisms that reorder operations.
The third contribution comprises methods for verifying functional correctness and crash-safety of caching mechanisms and concurrency in file systems. A reference implementation for crash-safe caches of high-level data structures is given, and a strategy for proving crash-safety is demonstrated and applied. A compatible concurrent implementation of the top layer of file systems is presented, using a mechanism for the efficient management of fine-grained file locking, and a concurrent version of garbage collection is realized. Both concurrency extensions are proven to be correct by applying atomicity refinement, a methodology for proving linearizability.
Finally, this thesis contributes a new iteration of executable code for the Flashix file system. With the efficiency extensions introduced with this thesis, Flashix covers all performance-oriented concepts of realistic file system implementations and achieves competitiveness with state-of-the-art flash file systems
Timing Sensitive Dependency Analysis and its Application to Software Security
Ich präsentiere neue Verfahren zur statischen Analyse von
Ausführungszeit-sensitiver Informationsflusskontrolle in Softwaresystemen.
Ich wende diese Verfahren an zur Analyse nebenläufiger Java
Programme, sowie zur Analyse von Ausführungszeit-Seitenkanälen in
Implementierungen kryptographischer Primitive.
Methoden der Informationsflusskontrolle zielen darauf ab, Fluss von
Informationen (z.B.: zwischen verschiedenen externen Schnittstellen
einer Software-Komponente) anhand expliziter Richtlinien einzuschränken.
Solche Methoden können daher zur Einhaltung sowohl
von Vertraulichkeit als auch Integrität eingesetzt werden. Der Ziel korrekter
statischer Programmanalysen in diesem Umfeld ist der Nachweis,
dass in allen Ausführungen eines gegebenen Programms die zugehörigen
Richtlinien eingehalten werden. Ein solcher Nachweis erfordert
ein Sicherheitskriterium, welches formalisiert, unter welchen
Bedingungen dies der Fall ist.
Jedem formalen Sicherheitskriterium entspricht implizit ein
Programm- und Angreifermodell. Einfachste Nichtinterferenz-Kriterien
beschreiben beispielsweise nur nicht-interaktive Programme. Dies
sind Programme die nur bei Beginn und Ende der Ausführung Ein- und
Ausgaben erlauben. Im zugehörigen Angreifer-Modell kennt der
Angreifer das Programm, aber beobachtet nur bestimmte (öffentliche)
Aus- und Eingaben oder stellt diese bereit. Ein Programm ist nichtinterferent,
wenn der Angreifer aus seinen Beobachtungen keinerlei
Rückschlüsse auf geheime Aus- und Eingaben terminierender Ausführungen
machen kann. Aus nicht-terminierenden Ausführungen
hingegen sind dem Angreifer in diesem Modell Schlussfolgerungen
auf geheime Eingaben erlaubt.
Seitenkanäle entstehen, wenn einem Angreifer aus Beobachtungen realer
Systeme Rückschlüsse auf vertrauliche Informationen ziehen kann,
welche im formalen Modell unmöglich sind. Typische Seitenkanäle
(also: in vielen formalen Sicherheitskriterien unmodelliert) sind neben
Nichttermination beispielsweise auch Energieverbrauch und die Ausführungszeit
von Programmen. Hängt diese von geheimen Eingaben
ab, so kann ein Angreifer aus der beobachteten Ausführungszeit auf
die Eingabe (z.B.: auf den Wert einzelner geheimer Parameter) schließen.
In meiner Dissertation präsentiere ich neue Abhängigkeitsanalysen,
die auch Nichtterminations- und Ausführungszeitkanäle berücksichtigen.
In Hinblick auf Nichtterminationskanäle stelle ich neue Verfahren
zur Berechnung von Programm-Abhängigkeiten vor. Hierzu entwickle
ich ein vereinheitlichendes Rahmenwerk, in welchem sowohl
Nichttermination-sensitive als auch Nichttermination-insensitive Abhängigkeiten
aus zueinander dualen Postdominanz-Begriffen resultieren.
Für Ausführungszeitkanäle entwickle ich neue Abhängigkeitsbegriffe
und dazugehörige Verfahren zu deren Berechnung. In zwei Anwendungen
untermauere ich die These:
Ausführungszeit-sensitive Abhängigkeiten ermöglichen korrekte statische
Informationsfluss-Analyse unter Berücksichtigung von Ausführungszeitkanälen.
Basierend auf Ausführungszeit-sensitiven Abhängigkeiten entwerfe
ich hierfür neue Analysen für nebenläufige Programme.
Ausführungszeit-sensitive Abhängigkeiten sind dort selbst für
Ausführungszeit-insensitive Angreifermodelle relevant, da dort interne
Ausführungszeitkanäle zwischen unterschiedlichen Ausführungsfäden
extern beobachtbar sein können. Meine Implementierung für
nebenläufige Java Programme basiert auf auf dem Programmanalyse-
System JOANA.
Außerdem präsentiere ich neue Analysen für Ausführungszeitkanäle
aufgrund mikro-architektureller Abhängigkeiten. Exemplarisch untersuche
ich Implementierungen von AES256 Blockverschlüsselung. Bei einigen
Implementierungen führen Daten-Caches dazu, dass die Ausführungszeit
abhängt von Schlüssel und Geheimtext, wodurch diese
aus der Ausführungszeit inferierbar sind. Für andere Implementierungen
weist meine automatische statische Analyse (unter Annahme
einer einfachen konkreten Cache-Mikroarchitektur) die Abwesenheit
solcher Kanäle nach
The application of the in-tree knapsack problem to routing prefix caches
Modern routers use specialized hardware, such as Ternary Content Addressable Memory (TCAM), to solve the Longest Prefix Matching Problem (LPMP) quickly. Due to the fact that TCAM is a non-standard type of memory and inherently parallel, there are concerns about its cost and power consumption. This problem is exacerbated by the growth in routing tables, which demands ever larger TCAMs.
To reduce the size of the TCAMs in a distributed forwarding environment, a batch caching model is proposed and analyzed. The problem of determining which routing prefixes to store in the TCAMs reduces to the In-tree Knapsack Problem (ITKP) for unit weight vertices in this model.
Several algorithms are analysed for solving the ITKP, both in the general case and when the problem is restricted to unit weight vertices. Additionally, a variant problem is proposed and analyzed, which exploits the caching model to provide better solutions. This thesis concludes with discussion of open problems and future experimental work
A Plain-text Compression Technique with Fast Lookup Ability
Data compression has always been an essential aspect of computing. In recent times, with the increasing popularity of remote and cloud-based computation, compression is becoming more important. Reducing the size of a data object in this context would not only reduce the transfer time, but also the amount of data transferred. The key figures of merit of a data compression scheme are its compression ratio and its compression, decompression and lookup speeds. Traditional compression techniques achieve high compression ratios, but require decompression before a lookup can be performed. This increases the lookup time. In this thesis, we propose a compression technique for plain-text data objects, that uses variable length encoding to compress data. The dictionary of possible words is sorted based on the statistical frequency of the use of words, which are encoded using the variable length code-words. Words that are not in the dictionary are handled as well. The driving motivation of our technique is to perform significantly faster lookups without the need to decompress the compressed data object. Our approach also facilitates string operations (such as concatenation, insertion and deletion and search-and-replacement) on compressed text without the need of decompression. We implement our technique in C++, and compare our approach with industry standard tools like gzip and bzip2 in terms of compression ratio, lookup speed, search-and-replace time and peak memory uses. Our compression scheme is about 81x faster as compared to gzip and about 165x times faster as compared to bzip2, when the data is searched, and restored into a compressed format. In conclusion, our approach facilitates string operations like concatenation, insertion, deletion and search-and-replace on the compressed file itself without the need for decompression
Scalable String and Suffix Sorting: Algorithms, Techniques, and Tools
This dissertation focuses on two fundamental sorting problems: string sorting
and suffix sorting. The first part considers parallel string sorting on
shared-memory multi-core machines, the second part external memory suffix
sorting using the induced sorting principle, and the third part distributed
external memory suffix sorting with a new distributed algorithmic big data
framework named Thrill.Comment: 396 pages, dissertation, Karlsruher Instituts f\"ur Technologie
(2018). arXiv admin note: text overlap with arXiv:1101.3448 by other author
- …