12 research outputs found
Approximate Matching in ACSM Dissimilarity Measure
AbstractThe paper introduces a new patch-based dissimilarity measure for image comparison employing an approximation strategy. It extends the Average Common Sub-matrix measure computing the exact dissimilarity among images. In the exact method, dissimilarity between two images is obtained by considering the average area of the biggest square sub-matrices in common between the images, by exact match of the extracted sub-matrices pixel by pixel. As an extension, the proposed dissimilarity measure computes an approximate match between the sub-matrices, which is obtained by omitting a controlled number of pixels at a given column offset inside the sub-matrices. The proposed dissimilarity measure is extensively compared with other well-known approximate methods for image comparison in the state-of-the-art. Experiments demonstrate the superiority of the proposed approximate measure in terms of execution time with respect to the exact method, and in terms of retrieval precision with respect to the other state-of-the-art methods
Compressed Pattern Matching
Tato bakalářská práce se zabývá vyhledáváním vzorků v datech. Hlavním úkolem je popsat vybrané algoritmy a datové struktury, pomocí kterých se takové vyhledávání v praxi provádí a to v datech nekomprimovaných i komprimovaných. Nedílnou součástí tohoto úkolu je i implementace vybrané datové struktury. V současnosti se hojně používají komprimační algoritmy využívající Burrows-Wheelerovu transformaci, na které závisí datová struktura FM-Indexu, kterou budeme implementovat. Implementace je provedena v programovacím jazyce C#. Nad výslednou datovou strukturou budou provedeny experimenty, které se zaměří na rychlost vyhledávání a prostorové nároky. Rychlost vyhledávání bude porovnána oproti klasickým algoritmům. Prostorové nároky budou porovnány podle formátu vstupních dat a při různých konfiguracích FM-Indexu. Na závěr jsou prezentovány výsledky a zjištěné poznatky z experimentů implementované datové struktury.This Bachelor's thesis is about pattern matching. Main objective is to describe selected algorithms and data structures, that are used in practice for pattern matching on non-compressed as well as compressed data. Integral part of this thesis is subsequent implementation of the selected data structure. At present, compression algorithms using Burrows-Wheeler transformation are used extensively and data structure FM-Index depends on it. This data structure will be implemented in programming language C# and subjected to experiments. Experiments will mainly cover speed of pattern matching and will be cross examined against more classical algorithms. Space requirements will be tested on data of varying formats as well as with different configurations of FM-Index. At the end the results and findings from the experiments will be presented.460 - Katedra informatikyvelmi dobř
An Investigation of GeoBase Mission Data Set Design, Implementation, and Usage within Air Force Civil Engineer Electrical and Utilities Work Centers
In 2001, the Office of the Civil Engineer, Installation and Logistics, Headquarters, United States Air Force, (ILE) identified Civil Engineer Squadrons as the central point of contact for all base-level mapping requirements/activities. In order to update mapping methods and procedures, ILE has put into place a program called GeoBase, which uses private sector Geographic Information Systems (GIS) technology as a foundation. In its current state, GeoBase uses the concept of a Common Installation Picture (CIP) to describe the goal of a consolidated visual that integrates the many layers of mapping information. The CIP visual is formed from a collection of data elements that are termed Mission Data Sets (MDS). There are varieties of MDS each of which contain data specific to a particular geospatial domain. The research uses a case study methodology to investigate how the MDS are designed, implemented, and used within four USAF Civil Engineer Squadron Electrical and Utilities Work Centers. The research findings indicate that MDS design and implementation processes vary across organizations; however, fundamental similarities do exist. At the same time, an evolution and maturation of these processes is evident. As for MDS usage within the Electrical and Utilities Work Centers, it was found that MDS usage is increasing; however, data quality is a limiting factor. Based on the research findings, recommendations are put forward for improving wing/base-level GeoBase program design, implementation, and usage
Efficient Storage of Genomic Sequences in High Performance Computing Systems
ABSTRACT: In this dissertation, we address the challenges of genomic data storage in high performance computing systems. In particular, we focus on developing a referential compression approach for Next Generation Sequence data stored in FASTQ format files. The amount of genomic data available for researchers to process has increased exponentially, bringing enormous challenges for its efficient storage and transmission. General-purpose compressors can only offer limited performance for genomic data, thus the need for specialized compression solutions. Two trends have emerged as alternatives to harness the particular properties of genomic data: non-referential and referential compression. Non-referential compressors offer higher compression rations than general purpose compressors, but still below of what a referential compressor could theoretically achieve. However, the effectiveness of referential compression depends on selecting a good reference and on having enough computing resources available. This thesis presents one of the first referential compressors for FASTQ files. We first present a comprehensive analytical and experimental evaluation of the most relevant tools for genomic raw data compression, which led us to identify the main needs and opportunities in this field. As a consequence, we propose a novel compression workflow that aims at improving the usability of referential compressors. Subsequently, we discuss the implementation and performance evaluation for the core of the proposed workflow: a referential compressor for reads in FASTQ format that combines local read-to-reference alignments with a specialized binary-encoding strategy. The compression algorithm, named UdeACompress, achieved very competitive compression ratios when compared to the best compressors in the current state of the art, while showing reasonable execution times and memory use. In particular, UdeACompress outperformed all competitors when compressing long reads, typical of the newest sequencing technologies. Finally, we study the main aspects of the data-level parallelism in the Intel AVX-512 architecture, in order to develop a parallel version of the UdeACompress algorithms to reduce the runtime. Through the use of SIMD programming, we managed to significantly accelerate the main bottleneck found in UdeACompress, the Suffix Array Construction
Optimal Parsing for Dictionary Text Compression
Dictionary-based compression algorithms include a parsing strategy to
transform the input text into a sequence of dictionary phrases. Given a text,
such process usually is not unique and, for compression purpose, it makes
sense to find one of the possible parsing that minimize the final compression
ratio. This is the parsing problem. An optimal parsing is a parsing strategy
or a parsing algorithm that solve the parsing problem taking account of
all the constraints of a compression algorithm or of a class of homogeneous
compression algorithms. Compression algorithm constrains are, for instance,
the dictionary itself, i.e. the dynamic set of available phrases, and how much
a phrase weights on the compressed text, i.e. the number of bits of which
the codeword representing such phrase is composed, also denoted as the
encoding cost of a dictionary pointer.
In more than 30th years of history of dictionary based text compression,
while plenty of algorithms, variants and extensions appeared and while dictionary
approach to text compression became one of the most appreciated
and utilized in almost all the storage and communication processes, only few
optimal parsing algorithms were presented. Many compression algorithms
still leaks optimality of their parsing or, at least, proof of optimality. This
happens because there is not a general model of the parsing problem that includes
all the dictionary based algorithms and because the existing optimal
parsing algorithms work under too restrictive hypothesis.
This work focus on the parsing problem and presents both a general
model for dictionary based text compression called Dictionary-Symbolwise
Text Compression theory and a general parsing algorithm that is proved
to be optimal under some realistic hypothesis. This algorithm is called
iii
Dictionary-Symbolwise Flexible Parsing and it covers almost all of the known
cases of dictionary based text compression algorithms together with the large
class of their variants where the text is decomposed in a sequence of symbols
and dictionary phrases.
In this work we further consider the case of a free mixture of a dictionary
compressor and a symbolwise compressor. Our Dictionary-Symbolwise
Flexible Parsing covers also this case. We have indeed an optimal parsing
algorithm in the case of dictionary-symbolwise compression where the dictionary
is prefix closed and the cost of encoding dictionary pointer is variable.
The symbolwise compressor is any classical one that works in linear time, as
many common variable-length encoders do. Our algorithm works under the
assumption that a special graph that will be described in the following, is
well defined. Even if this condition is not satisfied, it is possible to use the
same method to obtain almost optimal parses. In detail, when the dictionary
is LZ78-like, we show how to implement our algorithm in linear time.
When the dictionary is LZ77-like our algorithm can be implemented in time
O(n log n). Both have O(n) space complexity.
Even if the main aim of this work is of theoretical nature, some experimental
results will be introduced to underline some practical effects of
the parsing optimality in terms of compression performance and to show
how to improve the compression ratio by building extensions Dictionary-
Symbolwise of known algorithms. Finally, some more detailed experiments
are hosted in a devoted appendix
A comparison of exact string search algorithms for deep packet inspection
Every day, computer networks throughout the world face a constant onslaught of attacks. To combat these, network administrators are forced to employ a multitude of mitigating measures. Devices such as firewalls and Intrusion Detection Systems are prevalent today and employ extensive Deep Packet Inspection to scrutinise each piece of network traffic. Systems such as these usually require specialised hardware to meet the demand imposed by high throughput networks. Hardware like this is extremely expensive and singular in its function. It is with this in mind that the string search algorithms are introduced. These algorithms have been proven to perform well when searching through large volumes of text and may be able to perform equally well in the context of Deep Packet Inspection. String search algorithms are designed to match a single pattern to a substring of a given piece of text. This is not unlike the heuristics employed by traditional Deep Packet Inspection systems. This research compares the performance of a large number of string search algorithms during packet processing. Deep Packet Inspection places stringent restrictions on the reliability and speed of the algorithms due to increased performance pressures. A test system had to be designed in order to properly test the string search algorithms in the context of Deep Packet Inspection. The system allowed for precise and repeatable tests of each algorithm and then for their comparison. Of the algorithms tested, the Horspool and Quick Search algorithms posted the best results for both speed and reliability. The Not So Naive and Rabin-Karp algorithms were slowest overall
Recommended from our members
GPU-Acceleration of In-Memory Data Analytics
Hardware advances strongly influence the database system design. The flattening speed of CPU cores makes many-core accelerators, such as GPUs, a vital alternative to explore for processing the ever-increasing amounts of data. GPUs have a significantly higher degree of parallelism than multi-core CPUs but their cores are simpler. As a result, they do not face the power constraints limiting the parallelism of CPUs. Their trade-off, however, is the increased implementation complexity. This thesis adapts and redesigns data analytics operators to better exploit the GPU special memory and threading model. Due to the increasing memory capacity and also the user's need for fast interaction with the data, we focus on in-memory analytics.
Our techniques span different steps of the data processing pipeline: (1) Data preprocessing, (2) Query compilation, and (3) Algorithmic optimization of the operators. Our data preprocessing techniques adapt the data layout for numeric and string columns to maximize the achieved GPU memory bandwidth. Our query compilation techniques compute the optimal execution plan for conjunctive filters. We formulate \textit{memory divergence} for string matching algorithms and suggest how to eliminate it. Finally, we parallelize decompression algorithms in our compression framework \textit{Gompresso} to fit more data into the limited GPU memory. Gompresso achieves high speed-ups on GPUs over multi-core CPU state-of-the-art libraries and is suitable for any massively parallel processor