57 research outputs found

    Artificial Sequences and Complexity Measures

    Get PDF
    In this paper we exploit concepts of information theory to address the fundamental problem of identifying and defining the most suitable tools to extract, in a automatic and agnostic way, information from a generic string of characters. We introduce in particular a class of methods which use in a crucial way data compression techniques in order to define a measure of remoteness and distance between pairs of sequences of characters (e.g. texts) based on their relative information content. We also discuss in detail how specific features of data compression techniques could be used to introduce the notion of dictionary of a given sequence and of Artificial Text and we show how these new tools can be used for information extraction purposes. We point out the versatility and generality of our method that applies to any kind of corpora of character strings independently of the type of coding behind them. We consider as a case study linguistic motivated problems and we present results for automatic language recognition, authorship attribution and self consistent-classification.Comment: Revised version, with major changes, of previous "Data Compression approach to Information Extraction and Classification" by A. Baronchelli and V. Loreto. 15 pages; 5 figure

    Hybrid Technique for Arabic Text Compression

    Get PDF
    Arabic content on the Internet and other digital media is increasing exponentially, and the number of Arab users of these media has multiplied by more than 20 over the past five years. There is a real need to save allocated space for this content as well as allowing more efficient usage, searching, and retrieving information operations on this content. Using techniques borrowed from other languages or general data compression techniques, ignoring the proper features of Arabic has limited success in terms of compression ratio. In this paper, we present a hybrid technique that uses the linguistic features of Arabic language to improve the compression ratio of Arabic texts. This technique works in phases. In the first phase, the text file is split into four different files using a multilayer model-based approach. In the second phase, each one of these four files is compressed using the Burrows-Wheeler compression algorithm

    Text compression for Chinese documents.

    Get PDF
    by Chi-kwun Kan.Thesis (M.Phil.)--Chinese University of Hong Kong, 1995.Includes bibliographical references (leaves 133-137).Abstract --- p.iAcknowledgement --- p.iiiChapter 1 --- Introduction --- p.1Chapter 1.1 --- Importance of Text Compression --- p.1Chapter 1.2 --- Historical Background of Data Compression --- p.2Chapter 1.3 --- The Essences of Data Compression --- p.4Chapter 1.4 --- Motivation and Objectives of the Project --- p.5Chapter 1.5 --- Definition of Important Terms --- p.6Chapter 1.5.1 --- Data Models --- p.6Chapter 1.5.2 --- Entropy --- p.10Chapter 1.5.3 --- Statistical and Dictionary-based Compression --- p.12Chapter 1.5.4 --- Static and Adaptive Modelling --- p.12Chapter 1.5.5 --- One-Pass and Two-Pass Modelling --- p.13Chapter 1.6 --- Benchmarks and Measurements of Results --- p.15Chapter 1.7 --- Sources of Testing Data --- p.16Chapter 1.8 --- Outline of the Thesis --- p.16Chapter 2 --- Literature Survey --- p.18Chapter 2.1 --- Data compression Algorithms --- p.18Chapter 2.1.1 --- Statistical Compression Methods --- p.18Chapter 2.1.2 --- Dictionary-based Compression Methods (Ziv-Lempel Fam- ily) --- p.23Chapter 2.2 --- Cascading of Algorithms --- p.33Chapter 2.3 --- Problems of Current Compression Programs on Chinese --- p.34Chapter 2.4 --- Previous Chinese Data Compression Literatures --- p.37Chapter 3 --- Chinese-related Issues --- p.38Chapter 3.1 --- Characteristics in Chinese Data Compression --- p.38Chapter 3.1.1 --- Large and Not Fixed Size Character Set --- p.38Chapter 3.1.2 --- Lack of Word Segmentation --- p.40Chapter 3.1.3 --- Rich Semantic Meaning of Chinese Characters --- p.40Chapter 3.1.4 --- Grammatical Variance of Chinese Language --- p.41Chapter 3.2 --- Definition of Different Coding Schemes --- p.41Chapter 3.2.1 --- Big5 Code --- p.42Chapter 3.2.2 --- GB (Guo Biao) Code --- p.43Chapter 3.2.3 --- Unicode --- p.44Chapter 3.2.4 --- HZ (Hanzi) Code --- p.45Chapter 3.3 --- Entropy of Chinese and Other Languages --- p.45Chapter 4 --- Huffman Coding on Chinese Text --- p.49Chapter 4.1 --- The use of the Chinese Character Identification Routine --- p.50Chapter 4.2 --- Result --- p.51Chapter 4.3 --- Justification of the Result --- p.53Chapter 4.4 --- Time and Memory Resources Analysis --- p.58Chapter 4.5 --- The Heuristic Order-n Huffman Coding for Chinese Text Com- pression --- p.61Chapter 4.5.1 --- The Algorithm --- p.62Chapter 4.5.2 --- Result --- p.63Chapter 4.5.3 --- Justification of the Result --- p.64Chapter 4.6 --- Chapter Conclusion --- p.66Chapter 5 --- The Ziv-Lempel Compression on Chinese Text --- p.67Chapter 5.1 --- The Chinese LZSS Compression --- p.68Chapter 5.1.1 --- The Algorithm --- p.69Chapter 5.1.2 --- Result --- p.73Chapter 5.1.3 --- Justification of the Result --- p.74Chapter 5.1.4 --- Time and Memory Resources Analysis --- p.80Chapter 5.1.5 --- Effects in Controlling the Parameters --- p.81Chapter 5.2 --- The Chinese LZW Compression --- p.92Chapter 5.2.1 --- The Algorithm --- p.92Chapter 5.2.2 --- Result --- p.94Chapter 5.2.3 --- Justification of the Result --- p.95Chapter 5.2.4 --- Time and Memory Resources Analysis --- p.97Chapter 5.2.5 --- Effects in Controlling the Parameters --- p.98Chapter 5.3 --- A Comparison of the performance of the LZSS and the LZW --- p.100Chapter 5.4 --- Chapter Conclusion --- p.101Chapter 6 --- Chinese Dictionary-based Huffman coding --- p.103Chapter 6.1 --- The Algorithm --- p.104Chapter 6.2 --- Result --- p.107Chapter 6.3 --- Justification of the Result --- p.108Chapter 6.4 --- Effects of Changing the Size of the Dictionary --- p.111Chapter 6.5 --- Chapter Conclusion --- p.114Chapter 7 --- Cascading of Huffman coding and LZW compression --- p.116Chapter 7.1 --- Static Cascading Model --- p.117Chapter 7.1.1 --- The Algorithm --- p.117Chapter 7.1.2 --- Result --- p.120Chapter 7.1.3 --- Explanation and Analysis of the Result --- p.121Chapter 7.2 --- Adaptive (Dynamic) Cascading Model --- p.125Chapter 7.2.1 --- The Algorithm --- p.125Chapter 7.2.2 --- Result --- p.126Chapter 7.2.3 --- Explanation and Analysis of the Result --- p.127Chapter 7.3 --- Chapter Conclusion --- p.128Chapter 8 --- Concluding Remarks --- p.129Chapter 8.1 --- Conclusion --- p.129Chapter 8.2 --- Future Work Direction --- p.130Chapter 8.2.1 --- Improvement in Efficiency and Resources Consumption --- p.130Chapter 8.2.2 --- The Compressibility of Chinese and Other Languages --- p.131Chapter 8.2.3 --- Use of Grammar Model --- p.131Chapter 8.2.4 --- Lossy Compression --- p.131Chapter 8.3 --- Epilogue --- p.132Bibliography --- p.13

    Review on techniques and file formats of image compression

    Get PDF
    This paper presents a review of the compression technique in digital image processing. As well as a brief description of the main technologies and traditional format that commonly used in image compression. It can be defined as image compression a set of techniques that are applied to the images to store or transfer them in an effective way. In addition, this paper presents formats that use to reduce redundant information in an image, unnecessary pixels and non-visual redundancy. The conclusion of this paper The results for this paper concludes that image compression is a critical issue in digital image processing because it allows us to store or transmit image data efficiently

    The similarity metric

    Full text link
    A new class of distances appropriate for measuring similarity relations between sequences, say one type of similarity per distance, is studied. We propose a new ``normalized information distance'', based on the noncomputable notion of Kolmogorov complexity, and show that it is in this class and it minorizes every computable distance in the class (that is, it is universal in that it discovers all computable similarities). We demonstrate that it is a metric and call it the {\em similarity metric}. This theory forms the foundation for a new practical tool. To evidence generality and robustness we give two distinctive applications in widely divergent areas using standard compression programs like gzip and GenCompress. First, we compare whole mitochondrial genomes and infer their evolutionary history. This results in a first completely automatic computed whole mitochondrial phylogeny tree. Secondly, we fully automatically compute the language tree of 52 different languages.Comment: 13 pages, LaTex, 5 figures, Part of this work appeared in Proc. 14th ACM-SIAM Symp. Discrete Algorithms, 2003. This is the final, corrected, version to appear in IEEE Trans Inform. T

    Lossless Text Compression Technique Using Syllable Based Morphology

    Get PDF
    In this paper, we present a new lossless text compression technique which utilizes syllable-based morphology of multi-syllabic languages. The proposed algorithm is designed to partition words into its syllables and then to produce their shorter bit representations for compression. The method has six main components namely source file, filtering unit, syllable unit, compression unit, dictionary file and target file. The number of bits in coding syllables depends on the number of entries in the dictionary file. The proposed algorithm is implemented and tested using 20 different texts of different lengths collected from different fields. The results indicated a compression of up to 43%

    n-Gram-based text compression

    Get PDF
    We propose an efficient method for compressing Vietnamese text using n-gram dictionaries. It has a significant compression ratio in comparison with those of state-of-the-art methods on the same dataset. Given a text, first, the proposed method splits it into n-grams and then encodes them based on n-gram dictionaries. In the encoding phase, we use a sliding window with a size that ranges from bigram to five grams to obtain the best encoding stream. Each n-gram is encoded by two to four bytes accordingly based on its corresponding n-gram dictionary. We collected 2.5 GB text corpus from some Vietnamese news agencies to build n-gram dictionaries from unigram to five grams and achieve dictionaries with a size of 12 GB in total. In order to evaluate our method, we collected a testing set of 10 different text files with different sizes. The experimental results indicate that our method achieves compression ratio around 90% and outperforms state-of-the-art methods.Web of Scienceart. no. 948364

    Sistema baseado em técnicas de compressão para o reconhecimento de dígitos manuscritos

    Get PDF
    Mestrado em Engenharia Eletrónica e TelecomunicaçõesO reconhecimento de dígitos manuscritos é uma habilidade humana adquirida. Com pouco esforço, um humano pode reconhecer adequadamente em milissegundos uma sequência de dígitos manuscritos. Com o auxílio de um computador, esta tarefa de reconhecimento pode ser facilmente automatizada, melhorando um número significativo de processos. A separação do correio postal, a verificação de cheques bancários e operações que têm como entrada de dados dígitos manuscritos estão incluídas num amplo conjunto de aplicações que podem ser realizadas de forma mais eficaz e automatizada. Nos últimos anos, várias técnicas e métodos foram propostos para automatizar o mecanismo de reconhecimento de dígitos manuscritos. No entanto, para resolver esta desafiante questão de reconhecimento de imagem são utilizadas técnicas complexas e computacionalmente muito exigentes de machine learning, como é o caso do deep learning. Nesta dissertação é introduzida uma nova solução para o problema do reconhecimento de dígitos manuscritos, usando métricas de similaridade entre imagens de dígitos. As métricas de similaridade são calculadas com base na compressão de dados, nomeadamente pelo uso de Modelos de Contexto Finito.The Recognition of Handwritten Digits is a human-acquired ability. With little e ort, a human can properly recognize, in milliseconds, a sequence of handwritten digits. With the help of a computer, the task of handwriting recognition can be easily automated, improving and making a signi cant number of processes faster. The postal mail sorting, bank check veri cation and handwritten digit data entry operations are in a wide group of applications that can be performed in a more e ective and automated way. In the recent past years, a number of techniques and methods have been proposed to automate the handwritten digit recognition mechanism. However, to solve this challenging question of image recognition, there are used complex and computationally demanding machine learning techniques, as it is the case of deep learning. In this dissertation is introduced a novel solution to the problem of handwritten digit recognition, using metrics of similarity between digit images. The metrics are computed based on data compression, namely by the use of Finite Context Models
    • …
    corecore