141 research outputs found

    A practical index for approximate dictionary matching with few mismatches

    Get PDF
    Approximate dictionary matching is a classic string matching problem (checking if a query string occurs in a collection of strings) with applications in, e.g., spellchecking, online catalogs, geolocation, and web searchers. We present a surprisingly simple solution called a split index, which is based on the Dirichlet principle, for matching a keyword with few mismatches, and experimentally show that it offers competitive space-time tradeoffs. Our implementation in the C++ language is focused mostly on data compaction, which is beneficial for the search speed (e.g., by being cache friendly). We compare our solution with other algorithms and we show that it performs better for the Hamming distance. Query times in the order of 1 microsecond were reported for one mismatch for the dictionary size of a few megabytes on a medium-end PC. We also demonstrate that a basic compression technique consisting in qq-gram substitution can significantly reduce the index size (up to 50% of the input text size for the DNA), while still keeping the query time relatively low

    A Plain-text Compression Technique with Fast Lookup Ability

    Get PDF
    Data compression has always been an essential aspect of computing. In recent times, with the increasing popularity of remote and cloud-based computation, compression is becoming more important. Reducing the size of a data object in this context would not only reduce the transfer time, but also the amount of data transferred. The key figures of merit of a data compression scheme are its compression ratio and its compression, decompression and lookup speeds. Traditional compression techniques achieve high compression ratios, but require decompression before a lookup can be performed. This increases the lookup time. In this thesis, we propose a compression technique for plain-text data objects, that uses variable length encoding to compress data. The dictionary of possible words is sorted based on the statistical frequency of the use of words, which are encoded using the variable length code-words. Words that are not in the dictionary are handled as well. The driving motivation of our technique is to perform significantly faster lookups without the need to decompress the compressed data object. Our approach also facilitates string operations (such as concatenation, insertion and deletion and search-and-replacement) on compressed text without the need of decompression. We implement our technique in C++, and compare our approach with industry standard tools like gzip and bzip2 in terms of compression ratio, lookup speed, search-and-replace time and peak memory uses. Our compression scheme is about 81x faster as compared to gzip and about 165x times faster as compared to bzip2, when the data is searched, and restored into a compressed format. In conclusion, our approach facilitates string operations like concatenation, insertion, deletion and search-and-replace on the compressed file itself without the need for decompression

    A New Approach in Expanding the Hash Size of MD5

    Get PDF
    The enhanced MD5 algorithm has been developed by expanding its hash value up to 1280 bits from the original size of 128 bit using XOR and AND operators. Findings revealed that the hash value of the modified algorithm was not cracked or hacked during the experiment and testing using powerful bruteforce, dictionary, cracking tools and rainbow table such as CrackingStation, Hash Cracker, Cain and Abel and Rainbow Crack which are available online thus improved its security level compared to the original MD5. Furthermore, the proposed method could output a hash value with 1280 bits with only 10.9 ms additional execution time from MD5. Keywords: MD5 algorithm, hashing, client-server communication, modified MD5, hacking, bruteforce, rainbow table

    x264 Video Encoding Frontend

    Get PDF
    x264 is a free video codec for encoding video streams into the H.264/MPEG-4 AVC format. It has become the new standard for video encoding, providing higher quality with a higher compression than that of XviD. x264 provides a command line interface as well as an API and is used in popular applications such as HandBrake and FFmpeg. Advanced Audio Coding (AAC) is a very popular audio coding standard for lossy digital audio compression. AAC provides a higher sound quality than MP3 at similar bitrates. This senior project describes the design and implementation of a x264 video encoding frontend that uses these codecs to encode videos. The frontend provides a simple and easy-to-use graphical user interface. Subtitles are preserved across encodes and the resulting encoded file is stored in a Matroska container format

    Text Files Compression using Combination of two Dictionary methods (Specific dictionary for specific language and LZ77 Approach)

    Get PDF
    In this paper we suggest combination between two dictionary methods. Specific dictionary (for specific language) and LZ77. The dictionary is used for replacing any word in it by its two Bytes index. A words not exists in the dictionary is written without changing preceded by four bits for length of this word. A modification was made on this approach for reducing the file to minimum size. Because of each wordwas replaced by two bytes (which are substituted in any appearance for this word in text), LZ77 can be used efficiently. Before this, the file is arranged specially in order to use LZ77 optimally for minimizing the data. This approach is tested on real text files and verifies it’s successful

    A Plain-text Compression Technique with Fast Lookup Ability

    Get PDF
    Data compression has always been an essential aspect of computing. In recent times, with the increasing popularity of remote and cloud-based computation, compression is becoming more important. Reducing the size of a data object in this context would not only reduce the transfer time, but also the amount of data transferred. The key figures of merit of a data compression scheme are its compression ratio and its compression, decompression and lookup speeds. Traditional compression techniques achieve high compression ratios, but require decompression before a lookup can be performed. This increases the lookup time. In this thesis, we propose a compression technique for plain-text data objects, that uses variable length encoding to compress data. The dictionary of possible words is sorted based on the statistical frequency of the use of words, which are encoded using the variable length code-words. Words that are not in the dictionary are handled as well. The driving motivation of our technique is to perform significantly faster lookups without the need to decompress the compressed data object. Our approach also facilitates string operations (such as concatenation, insertion and deletion and search-and-replacement) on compressed text without the need of decompression. We implement our technique in C++, and compare our approach with industry standard tools like gzip and bzip2 in terms of compression ratio, lookup speed, search-and-replace time and peak memory uses. Our compression scheme is about 81x faster as compared to gzip and about 165x times faster as compared to bzip2, when the data is searched, and restored into a compressed format. In conclusion, our approach facilitates string operations like concatenation, insertion, deletion and search-and-replace on the compressed file itself without the need for decompression

    Progress Report : 1991 - 1994

    Get PDF
    • …
    corecore