11,344 research outputs found

    New Algorithms and Lower Bounds for Sequential-Access Data Compression

    Get PDF
    This thesis concerns sequential-access data compression, i.e., by algorithms that read the input one or more times from beginning to end. In one chapter we consider adaptive prefix coding, for which we must read the input character by character, outputting each character's self-delimiting codeword before reading the next one. We show how to encode and decode each character in constant worst-case time while producing an encoding whose length is worst-case optimal. In another chapter we consider one-pass compression with memory bounded in terms of the alphabet size and context length, and prove a nearly tight tradeoff between the amount of memory we can use and the quality of the compression we can achieve. In a third chapter we consider compression in the read/write streams model, which allows us passes and memory both polylogarithmic in the size of the input. We first show how to achieve universal compression using only one pass over one stream. We then show that one stream is not sufficient for achieving good grammar-based compression. Finally, we show that two streams are necessary and sufficient for achieving entropy-only bounds.Comment: draft of PhD thesi

    Record-Linkage from a Technical Point of View

    Get PDF
    TRecord linkage is used for preparing sampling frames, deduplication of lists and combining information on the same object from two different databases. If the identifiers of the same objects in two different databases have error free unique common identifiers like personal identification numbers (PID), record linkage is a simple file merge operation. If the identifiers contains errors, record linkage is a challenging task. In many applications, the files have widely different numbers of observations, for example a few thousand records of a sample survey and a few million records of an administrative database of social security numbers. Available software, privacy issues and future research topics are discussed.Record-Linkage, Data-mining, Privacy preserving protocols

    Finding approximate palindromes in strings

    Full text link
    We introduce a novel definition of approximate palindromes in strings, and provide an algorithm to find all maximal approximate palindromes in a string with up to kk errors. Our definition is based on the usual edit operations of approximate pattern matching, and the algorithm we give, for a string of size nn on a fixed alphabet, runs in O(k2n)O(k^2 n) time. We also discuss two implementation-related improvements to the algorithm, and demonstrate their efficacy in practice by means of both experiments and an average-case analysis

    Graph isomorphism and genotypical houses

    Get PDF
    This paper will introduce a new method, known as small graph matching, anddemonstrate how it may be used to determine the genotype signature of a sample ofbuildings. First, the origins of the method and its relationship to other ?similarity? testingtechniques will be discussed. Then the range of possible actions and transformations willbe established through the creation of a set of rules. Next, in order to fully explain thismethod, a technique of normalizing the similarity measure is presented in order to permitthe comparison of graphs of differing magnitude. The last stage of this method ispresented, this being the comparison of all possible graph-pairs within a given sampleand the mean-distance calculated for all individual graphs. This results in theidentification of a genotype signature. Finally, this paper presents an empiricalapplication of this method and shows how effective it is, not only for the identification ofa building genotype, but also for assessing the homogeneity of a sample or sub-samples

    A Survey of Software-based String Matching Algorithms for Forensic Analysis

    Get PDF
    Employing a fast string matching algorithm is essential for minimizing the overhead of extracting structured files from a raw disk image. In this paper, we summarize the concept, implementation, and main features of ten software-based string matching algorithms, and evaluate their applicability for forensic analysis. We provide comparisons between the selected software-based string matching algorithms from the perspective of forensic analysis by conducting their performance evaluation for file carving. According to the experimental results, the Shift-Or algorithm (R. Baeza-Yates & Gonnet, 1992) and the Karp-Rabin algorithm (Karp & Rabin, 1987) have the minimized search time for identifying the locations of specified headers and footers in the target disk. Keywords: string matching algorithm, forensic analysis, file carving, Scalpel, data recover
    • …
    corecore