57 research outputs found
Artificial Sequences and Complexity Measures
In this paper we exploit concepts of information theory to address the
fundamental problem of identifying and defining the most suitable tools to
extract, in a automatic and agnostic way, information from a generic string of
characters. We introduce in particular a class of methods which use in a
crucial way data compression techniques in order to define a measure of
remoteness and distance between pairs of sequences of characters (e.g. texts)
based on their relative information content. We also discuss in detail how
specific features of data compression techniques could be used to introduce the
notion of dictionary of a given sequence and of Artificial Text and we show how
these new tools can be used for information extraction purposes. We point out
the versatility and generality of our method that applies to any kind of
corpora of character strings independently of the type of coding behind them.
We consider as a case study linguistic motivated problems and we present
results for automatic language recognition, authorship attribution and self
consistent-classification.Comment: Revised version, with major changes, of previous "Data Compression
approach to Information Extraction and Classification" by A. Baronchelli and
V. Loreto. 15 pages; 5 figure
Hybrid Technique for Arabic Text Compression
Arabic content on the Internet and other digital media is increasing exponentially, and the number of Arab users of these media has multiplied by more than 20 over the past five years. There is a real need to save allocated space for this content as well as allowing more efficient usage, searching, and retrieving information operations on this content. Using techniques borrowed from other languages or general data compression techniques, ignoring the proper features of Arabic has limited success in terms of compression ratio. In this paper, we present a hybrid technique that uses the linguistic features of Arabic language to improve the compression ratio of Arabic texts. This technique works in phases. In the first phase, the text file is split into four different files using a multilayer model-based approach. In the second phase, each one of these four files is compressed using the Burrows-Wheeler compression algorithm
Text compression for Chinese documents.
by Chi-kwun Kan.Thesis (M.Phil.)--Chinese University of Hong Kong, 1995.Includes bibliographical references (leaves 133-137).Abstract --- p.iAcknowledgement --- p.iiiChapter 1 --- Introduction --- p.1Chapter 1.1 --- Importance of Text Compression --- p.1Chapter 1.2 --- Historical Background of Data Compression --- p.2Chapter 1.3 --- The Essences of Data Compression --- p.4Chapter 1.4 --- Motivation and Objectives of the Project --- p.5Chapter 1.5 --- Definition of Important Terms --- p.6Chapter 1.5.1 --- Data Models --- p.6Chapter 1.5.2 --- Entropy --- p.10Chapter 1.5.3 --- Statistical and Dictionary-based Compression --- p.12Chapter 1.5.4 --- Static and Adaptive Modelling --- p.12Chapter 1.5.5 --- One-Pass and Two-Pass Modelling --- p.13Chapter 1.6 --- Benchmarks and Measurements of Results --- p.15Chapter 1.7 --- Sources of Testing Data --- p.16Chapter 1.8 --- Outline of the Thesis --- p.16Chapter 2 --- Literature Survey --- p.18Chapter 2.1 --- Data compression Algorithms --- p.18Chapter 2.1.1 --- Statistical Compression Methods --- p.18Chapter 2.1.2 --- Dictionary-based Compression Methods (Ziv-Lempel Fam- ily) --- p.23Chapter 2.2 --- Cascading of Algorithms --- p.33Chapter 2.3 --- Problems of Current Compression Programs on Chinese --- p.34Chapter 2.4 --- Previous Chinese Data Compression Literatures --- p.37Chapter 3 --- Chinese-related Issues --- p.38Chapter 3.1 --- Characteristics in Chinese Data Compression --- p.38Chapter 3.1.1 --- Large and Not Fixed Size Character Set --- p.38Chapter 3.1.2 --- Lack of Word Segmentation --- p.40Chapter 3.1.3 --- Rich Semantic Meaning of Chinese Characters --- p.40Chapter 3.1.4 --- Grammatical Variance of Chinese Language --- p.41Chapter 3.2 --- Definition of Different Coding Schemes --- p.41Chapter 3.2.1 --- Big5 Code --- p.42Chapter 3.2.2 --- GB (Guo Biao) Code --- p.43Chapter 3.2.3 --- Unicode --- p.44Chapter 3.2.4 --- HZ (Hanzi) Code --- p.45Chapter 3.3 --- Entropy of Chinese and Other Languages --- p.45Chapter 4 --- Huffman Coding on Chinese Text --- p.49Chapter 4.1 --- The use of the Chinese Character Identification Routine --- p.50Chapter 4.2 --- Result --- p.51Chapter 4.3 --- Justification of the Result --- p.53Chapter 4.4 --- Time and Memory Resources Analysis --- p.58Chapter 4.5 --- The Heuristic Order-n Huffman Coding for Chinese Text Com- pression --- p.61Chapter 4.5.1 --- The Algorithm --- p.62Chapter 4.5.2 --- Result --- p.63Chapter 4.5.3 --- Justification of the Result --- p.64Chapter 4.6 --- Chapter Conclusion --- p.66Chapter 5 --- The Ziv-Lempel Compression on Chinese Text --- p.67Chapter 5.1 --- The Chinese LZSS Compression --- p.68Chapter 5.1.1 --- The Algorithm --- p.69Chapter 5.1.2 --- Result --- p.73Chapter 5.1.3 --- Justification of the Result --- p.74Chapter 5.1.4 --- Time and Memory Resources Analysis --- p.80Chapter 5.1.5 --- Effects in Controlling the Parameters --- p.81Chapter 5.2 --- The Chinese LZW Compression --- p.92Chapter 5.2.1 --- The Algorithm --- p.92Chapter 5.2.2 --- Result --- p.94Chapter 5.2.3 --- Justification of the Result --- p.95Chapter 5.2.4 --- Time and Memory Resources Analysis --- p.97Chapter 5.2.5 --- Effects in Controlling the Parameters --- p.98Chapter 5.3 --- A Comparison of the performance of the LZSS and the LZW --- p.100Chapter 5.4 --- Chapter Conclusion --- p.101Chapter 6 --- Chinese Dictionary-based Huffman coding --- p.103Chapter 6.1 --- The Algorithm --- p.104Chapter 6.2 --- Result --- p.107Chapter 6.3 --- Justification of the Result --- p.108Chapter 6.4 --- Effects of Changing the Size of the Dictionary --- p.111Chapter 6.5 --- Chapter Conclusion --- p.114Chapter 7 --- Cascading of Huffman coding and LZW compression --- p.116Chapter 7.1 --- Static Cascading Model --- p.117Chapter 7.1.1 --- The Algorithm --- p.117Chapter 7.1.2 --- Result --- p.120Chapter 7.1.3 --- Explanation and Analysis of the Result --- p.121Chapter 7.2 --- Adaptive (Dynamic) Cascading Model --- p.125Chapter 7.2.1 --- The Algorithm --- p.125Chapter 7.2.2 --- Result --- p.126Chapter 7.2.3 --- Explanation and Analysis of the Result --- p.127Chapter 7.3 --- Chapter Conclusion --- p.128Chapter 8 --- Concluding Remarks --- p.129Chapter 8.1 --- Conclusion --- p.129Chapter 8.2 --- Future Work Direction --- p.130Chapter 8.2.1 --- Improvement in Efficiency and Resources Consumption --- p.130Chapter 8.2.2 --- The Compressibility of Chinese and Other Languages --- p.131Chapter 8.2.3 --- Use of Grammar Model --- p.131Chapter 8.2.4 --- Lossy Compression --- p.131Chapter 8.3 --- Epilogue --- p.132Bibliography --- p.13
Review on techniques and file formats of image compression
This paper presents a review of the compression technique in digital image processing. As well as a brief description of the main technologies and traditional format that commonly used in image compression. It can be defined as image compression a set of techniques that are applied to the images to store or transfer them in an effective way. In addition, this paper presents formats that use to reduce redundant information in an image, unnecessary pixels and non-visual redundancy. The conclusion of this paper The results for this paper concludes that image compression is a critical issue in digital image processing because it allows us to store or transmit image data efficiently
The similarity metric
A new class of distances appropriate for measuring similarity relations
between sequences, say one type of similarity per distance, is studied. We
propose a new ``normalized information distance'', based on the noncomputable
notion of Kolmogorov complexity, and show that it is in this class and it
minorizes every computable distance in the class (that is, it is universal in
that it discovers all computable similarities). We demonstrate that it is a
metric and call it the {\em similarity metric}. This theory forms the
foundation for a new practical tool. To evidence generality and robustness we
give two distinctive applications in widely divergent areas using standard
compression programs like gzip and GenCompress. First, we compare whole
mitochondrial genomes and infer their evolutionary history. This results in a
first completely automatic computed whole mitochondrial phylogeny tree.
Secondly, we fully automatically compute the language tree of 52 different
languages.Comment: 13 pages, LaTex, 5 figures, Part of this work appeared in Proc. 14th
ACM-SIAM Symp. Discrete Algorithms, 2003. This is the final, corrected,
version to appear in IEEE Trans Inform. T
Lossless Text Compression Technique Using Syllable Based Morphology
In this paper, we present a new lossless text compression technique which utilizes syllable-based morphology of
multi-syllabic languages. The proposed algorithm is designed to partition words into its syllables and then to produce their shorter bit representations for compression. The method has six main components namely source file, filtering unit, syllable unit, compression unit, dictionary file and target file. The number of bits in coding syllables depends on the number of entries in the dictionary file. The proposed algorithm is implemented and tested using 20 different texts of different lengths collected from different fields. The results indicated a compression of up to 43%
n-Gram-based text compression
We propose an efficient method for compressing Vietnamese text using n-gram dictionaries. It has a significant compression ratio in comparison with those of state-of-the-art methods on the same dataset. Given a text, first, the proposed method splits it into n-grams and then encodes them based on n-gram dictionaries. In the encoding phase, we use a sliding window with a size that ranges from bigram to five grams to obtain the best encoding stream. Each n-gram is encoded by two to four bytes accordingly based on its corresponding n-gram dictionary. We collected 2.5 GB text corpus from some Vietnamese news agencies to build n-gram dictionaries from unigram to five grams and achieve dictionaries with a size of 12 GB in total. In order to evaluate our method, we collected a testing set of 10 different text files with different sizes. The experimental results indicate that our method achieves compression ratio around 90% and outperforms state-of-the-art methods.Web of Scienceart. no. 948364
Sistema baseado em técnicas de compressão para o reconhecimento de dÃgitos manuscritos
Mestrado em Engenharia Eletrónica e TelecomunicaçõesO reconhecimento de dÃgitos manuscritos é uma habilidade humana
adquirida. Com pouco esforço, um humano pode reconhecer adequadamente
em milissegundos uma sequência de dÃgitos manuscritos. Com o
auxÃlio de um computador, esta tarefa de reconhecimento pode ser facilmente
automatizada, melhorando um número significativo de processos. A
separação do correio postal, a verificação de cheques bancários e operações
que têm como entrada de dados dÃgitos manuscritos estão incluÃdas num
amplo conjunto de aplicações que podem ser realizadas de forma mais eficaz e automatizada. Nos últimos anos, várias técnicas e métodos foram
propostos para automatizar o mecanismo de reconhecimento de dÃgitos
manuscritos. No entanto, para resolver esta desafiante questão de reconhecimento
de imagem são utilizadas técnicas complexas e computacionalmente
muito exigentes de machine learning, como é o caso do deep learning.
Nesta dissertação é introduzida uma nova solução para o problema do reconhecimento
de dÃgitos manuscritos, usando métricas de similaridade entre
imagens de dÃgitos. As métricas de similaridade são calculadas com base
na compressão de dados, nomeadamente pelo uso de Modelos de Contexto
Finito.The Recognition of Handwritten Digits is a human-acquired ability. With
little e ort, a human can properly recognize, in milliseconds, a sequence of
handwritten digits. With the help of a computer, the task of handwriting
recognition can be easily automated, improving and making a signi cant
number of processes faster. The postal mail sorting, bank check veri cation
and handwritten digit data entry operations are in a wide group of
applications that can be performed in a more e ective and automated way.
In the recent past years, a number of techniques and methods have been
proposed to automate the handwritten digit recognition mechanism. However,
to solve this challenging question of image recognition, there are used
complex and computationally demanding machine learning techniques, as
it is the case of deep learning. In this dissertation is introduced a novel
solution to the problem of handwritten digit recognition, using metrics of
similarity between digit images. The metrics are computed based on data
compression, namely by the use of Finite Context Models
- …