12 research outputs found

    Approximate Matching in ACSM Dissimilarity Measure

    Get PDF
    AbstractThe paper introduces a new patch-based dissimilarity measure for image comparison employing an approximation strategy. It extends the Average Common Sub-matrix measure computing the exact dissimilarity among images. In the exact method, dissimilarity between two images is obtained by considering the average area of the biggest square sub-matrices in common between the images, by exact match of the extracted sub-matrices pixel by pixel. As an extension, the proposed dissimilarity measure computes an approximate match between the sub-matrices, which is obtained by omitting a controlled number of pixels at a given column offset inside the sub-matrices. The proposed dissimilarity measure is extensively compared with other well-known approximate methods for image comparison in the state-of-the-art. Experiments demonstrate the superiority of the proposed approximate measure in terms of execution time with respect to the exact method, and in terms of retrieval precision with respect to the other state-of-the-art methods

    Compressed Pattern Matching

    Get PDF
    Tato bakalářská práce se zabývá vyhledáváním vzorků v datech. Hlavním úkolem je popsat vybrané algoritmy a datové struktury, pomocí kterých se takové vyhledávání v praxi provádí a to v datech nekomprimovaných i komprimovaných. Nedílnou součástí tohoto úkolu je i implementace vybrané datové struktury. V současnosti se hojně používají komprimační algoritmy využívající Burrows-Wheelerovu transformaci, na které závisí datová struktura FM-Indexu, kterou budeme implementovat. Implementace je provedena v programovacím jazyce C#. Nad výslednou datovou strukturou budou provedeny experimenty, které se zaměří na rychlost vyhledávání a prostorové nároky. Rychlost vyhledávání bude porovnána oproti klasickým algoritmům. Prostorové nároky budou porovnány podle formátu vstupních dat a při různých konfiguracích FM-Indexu. Na závěr jsou prezentovány výsledky a zjištěné poznatky z experimentů implementované datové struktury.This Bachelor's thesis is about pattern matching. Main objective is to describe selected algorithms and data structures, that are used in practice for pattern matching on non-compressed as well as compressed data. Integral part of this thesis is subsequent implementation of the selected data structure. At present, compression algorithms using Burrows-Wheeler transformation are used extensively and data structure FM-Index depends on it. This data structure will be implemented in programming language C# and subjected to experiments. Experiments will mainly cover speed of pattern matching and will be cross examined against more classical algorithms. Space requirements will be tested on data of varying formats as well as with different configurations of FM-Index. At the end the results and findings from the experiments will be presented.460 - Katedra informatikyvelmi dobř

    An Investigation of GeoBase Mission Data Set Design, Implementation, and Usage within Air Force Civil Engineer Electrical and Utilities Work Centers

    Get PDF
    In 2001, the Office of the Civil Engineer, Installation and Logistics, Headquarters, United States Air Force, (ILE) identified Civil Engineer Squadrons as the central point of contact for all base-level mapping requirements/activities. In order to update mapping methods and procedures, ILE has put into place a program called GeoBase, which uses private sector Geographic Information Systems (GIS) technology as a foundation. In its current state, GeoBase uses the concept of a Common Installation Picture (CIP) to describe the goal of a consolidated visual that integrates the many layers of mapping information. The CIP visual is formed from a collection of data elements that are termed Mission Data Sets (MDS). There are varieties of MDS each of which contain data specific to a particular geospatial domain. The research uses a case study methodology to investigate how the MDS are designed, implemented, and used within four USAF Civil Engineer Squadron Electrical and Utilities Work Centers. The research findings indicate that MDS design and implementation processes vary across organizations; however, fundamental similarities do exist. At the same time, an evolution and maturation of these processes is evident. As for MDS usage within the Electrical and Utilities Work Centers, it was found that MDS usage is increasing; however, data quality is a limiting factor. Based on the research findings, recommendations are put forward for improving wing/base-level GeoBase program design, implementation, and usage

    Efficient Storage of Genomic Sequences in High Performance Computing Systems

    Get PDF
    ABSTRACT: In this dissertation, we address the challenges of genomic data storage in high performance computing systems. In particular, we focus on developing a referential compression approach for Next Generation Sequence data stored in FASTQ format files. The amount of genomic data available for researchers to process has increased exponentially, bringing enormous challenges for its efficient storage and transmission. General-purpose compressors can only offer limited performance for genomic data, thus the need for specialized compression solutions. Two trends have emerged as alternatives to harness the particular properties of genomic data: non-referential and referential compression. Non-referential compressors offer higher compression rations than general purpose compressors, but still below of what a referential compressor could theoretically achieve. However, the effectiveness of referential compression depends on selecting a good reference and on having enough computing resources available. This thesis presents one of the first referential compressors for FASTQ files. We first present a comprehensive analytical and experimental evaluation of the most relevant tools for genomic raw data compression, which led us to identify the main needs and opportunities in this field. As a consequence, we propose a novel compression workflow that aims at improving the usability of referential compressors. Subsequently, we discuss the implementation and performance evaluation for the core of the proposed workflow: a referential compressor for reads in FASTQ format that combines local read-to-reference alignments with a specialized binary-encoding strategy. The compression algorithm, named UdeACompress, achieved very competitive compression ratios when compared to the best compressors in the current state of the art, while showing reasonable execution times and memory use. In particular, UdeACompress outperformed all competitors when compressing long reads, typical of the newest sequencing technologies. Finally, we study the main aspects of the data-level parallelism in the Intel AVX-512 architecture, in order to develop a parallel version of the UdeACompress algorithms to reduce the runtime. Through the use of SIMD programming, we managed to significantly accelerate the main bottleneck found in UdeACompress, the Suffix Array Construction

    Optimal Parsing for Dictionary Text Compression

    Get PDF
    Dictionary-based compression algorithms include a parsing strategy to transform the input text into a sequence of dictionary phrases. Given a text, such process usually is not unique and, for compression purpose, it makes sense to find one of the possible parsing that minimize the final compression ratio. This is the parsing problem. An optimal parsing is a parsing strategy or a parsing algorithm that solve the parsing problem taking account of all the constraints of a compression algorithm or of a class of homogeneous compression algorithms. Compression algorithm constrains are, for instance, the dictionary itself, i.e. the dynamic set of available phrases, and how much a phrase weights on the compressed text, i.e. the number of bits of which the codeword representing such phrase is composed, also denoted as the encoding cost of a dictionary pointer. In more than 30th years of history of dictionary based text compression, while plenty of algorithms, variants and extensions appeared and while dictionary approach to text compression became one of the most appreciated and utilized in almost all the storage and communication processes, only few optimal parsing algorithms were presented. Many compression algorithms still leaks optimality of their parsing or, at least, proof of optimality. This happens because there is not a general model of the parsing problem that includes all the dictionary based algorithms and because the existing optimal parsing algorithms work under too restrictive hypothesis. This work focus on the parsing problem and presents both a general model for dictionary based text compression called Dictionary-Symbolwise Text Compression theory and a general parsing algorithm that is proved to be optimal under some realistic hypothesis. This algorithm is called iii Dictionary-Symbolwise Flexible Parsing and it covers almost all of the known cases of dictionary based text compression algorithms together with the large class of their variants where the text is decomposed in a sequence of symbols and dictionary phrases. In this work we further consider the case of a free mixture of a dictionary compressor and a symbolwise compressor. Our Dictionary-Symbolwise Flexible Parsing covers also this case. We have indeed an optimal parsing algorithm in the case of dictionary-symbolwise compression where the dictionary is prefix closed and the cost of encoding dictionary pointer is variable. The symbolwise compressor is any classical one that works in linear time, as many common variable-length encoders do. Our algorithm works under the assumption that a special graph that will be described in the following, is well defined. Even if this condition is not satisfied, it is possible to use the same method to obtain almost optimal parses. In detail, when the dictionary is LZ78-like, we show how to implement our algorithm in linear time. When the dictionary is LZ77-like our algorithm can be implemented in time O(n log n). Both have O(n) space complexity. Even if the main aim of this work is of theoretical nature, some experimental results will be introduced to underline some practical effects of the parsing optimality in terms of compression performance and to show how to improve the compression ratio by building extensions Dictionary- Symbolwise of known algorithms. Finally, some more detailed experiments are hosted in a devoted appendix

    A comparison of exact string search algorithms for deep packet inspection

    Get PDF
    Every day, computer networks throughout the world face a constant onslaught of attacks. To combat these, network administrators are forced to employ a multitude of mitigating measures. Devices such as firewalls and Intrusion Detection Systems are prevalent today and employ extensive Deep Packet Inspection to scrutinise each piece of network traffic. Systems such as these usually require specialised hardware to meet the demand imposed by high throughput networks. Hardware like this is extremely expensive and singular in its function. It is with this in mind that the string search algorithms are introduced. These algorithms have been proven to perform well when searching through large volumes of text and may be able to perform equally well in the context of Deep Packet Inspection. String search algorithms are designed to match a single pattern to a substring of a given piece of text. This is not unlike the heuristics employed by traditional Deep Packet Inspection systems. This research compares the performance of a large number of string search algorithms during packet processing. Deep Packet Inspection places stringent restrictions on the reliability and speed of the algorithms due to increased performance pressures. A test system had to be designed in order to properly test the string search algorithms in the context of Deep Packet Inspection. The system allowed for precise and repeatable tests of each algorithm and then for their comparison. Of the algorithms tested, the Horspool and Quick Search algorithms posted the best results for both speed and reliability. The Not So Naive and Rabin-Karp algorithms were slowest overall
    corecore