70,770 research outputs found

    Better text compression from fewer lexical n-grams

    Get PDF
    Word-based context models for text compression have the capacity to outperform more simple character-based models, but are generally unattractive because of inherent problems with exponential model growth and corresponding data sparseness. These ill-effects can be mitigated in an adaptive lossless compression scheme by modelling syntactic and semantic lexical dependencies independently

    Development of Word-Based Text Compression Algorithm For Indonesian Language Document

    Get PDF
    ABSTRAKSI: Teknologi informasi berkembang sangat pesat saat ini, khususnya untuk penanganan data. Data merupakan aset berharga bagi semua orang, terutama bagi perusahaan yang lebih besar. Khusus untuk perusahaan besar yang sudah memiliki cabang di beberapa tempat. Transmisi data dari kantor pusat ke kantor cabang membuat perusahaan harus menyediakan alat yang baik untuk melakukannya. Perusahaan-perusahaan ini juga membutuhkan alat yang dapat digunakan untuk kompres data sehingga dapat mengurangi ukuran data itu sendiri. Ide utama dari Word-Based encoding adalah untuk mengidentifikasi setiap kata dari teks sumber, kemudian kata yang didentifikasi itu akan dicek apakah terdapat huruf besar pada kata tersebut. Setelah itu, kata tersebut akan dicek kembali apakah terdapat simbol atau angka. Lalu setelah itu akan dilakukan stemming terhadap kata untuk memisahkan kata dasar dari imbuhan. Simbol, angka dan imbuhan akan diberi indeks sesuai dengan indeks yang sudah disimpan sebelumnya dalam kamus dasar. Sedangkan kata dasar yang diperoleh setelah proses stemming akan dicek kembali ke kamus dasar, apakah cocok atau tidak. Jika kata dasar tidak cocok pada kamus, maka kata ini akan menjadi entri baru ke dalam kamus. Pada pengujian ini dilakukan pada data asli yang berukuran antara 10.000 Byte sampai 500.000 Byte dan menggunakan panjang bit kode 16 bit. Hasilnya menunjukkan bahwa rasio kompresi metode yang diusulkan sebanding dengan aplikasi populer RAR hingga 200 kbyte, sementara waktu pengolahannya jauh lebih baik daripada metode urutan dari karakter yang dibalik pada algoritma LZWKata Kunci : Kompresi data, WB-LZW, Berbasis Kata, Stemming, Tree, Kamus dasar, Kamus utamaABSTRACT: Information technology is growing very rapidly, in particular for data handling. Data is a valuable asset for everyone, especially for larger companies with branches in several places. Data transmission from headquarters to branch offices make the company must provides good tools to do it. These companies also need tools that can be used to compress data to reduce their size. The main idea of the word-based encoding is to extract each word of the source text, then it is checked whether containing capital letters or not. After that, it is checked if there is a symbol or number. The particle will be separated from the basic word using stemming algorithm. Symbols, numbers and affixes will be indexed in the basic dictionary. The basic word will also be checked whether it exists in the basic dictionary or not. If there not a match, then the word will be stored to the supplement dictionary. The experiment was conducted on the text file with the size from about a 10.000 bytes up to 500.000 bytes and a code with length of bits is 16 bits. The result shows that the compression ratio of the proposed method is comparable with popular RAR application up to 200 kbyte, while its processing time is much better than the Reversed Sequence of Characters on LZW method.Keyword: Data Compression, WB-LZW, Word-Base, Stemming, Tree, Basic Dictionary, Main Dictionar

    Space-Efficient Re-Pair Compression

    Get PDF
    Re-Pair is an effective grammar-based compression scheme achieving strong compression rates in practice. Let nn, σ\sigma, and dd be the text length, alphabet size, and dictionary size of the final grammar, respectively. In their original paper, the authors show how to compute the Re-Pair grammar in expected linear time and 5n+4σ2+4d+n5n + 4\sigma^2 + 4d + \sqrt{n} words of working space on top of the text. In this work, we propose two algorithms improving on the space of their original solution. Our model assumes a memory word of log2n\lceil\log_2 n\rceil bits and a re-writable input text composed by nn such words. Our first algorithm runs in expected O(n/ϵ)\mathcal O(n/\epsilon) time and uses (1+ϵ)n+n(1+\epsilon)n +\sqrt n words of space on top of the text for any parameter 0<ϵ10<\epsilon \leq 1 chosen in advance. Our second algorithm runs in expected O(nlogn)\mathcal O(n\log n) time and improves the space to n+nn +\sqrt n words

    Automatic Correction of Arabic Dyslexic Text

    Get PDF
    This paper proposes an automatic correction system that detects and corrects dyslexic errors in Arabic text. The system uses a language model based on the Prediction by Partial Matching (PPM) text compression scheme that generates possible alternatives for each misspelled word. Furthermore, the generated candidate list is based on edit operations (insertion, deletion, substitution and transposition), and the correct alternative for each misspelled word is chosen on the basis of the compression codelength of the trigram. The system is compared with widely-used Arabic word processing software and the Farasa tool. The system provided good results compared with the other tools, with a recall of 43%, precision 89%, F1 58% and accuracy 81%

    Sequential Recurrent Neural Networks for Language Modeling

    Full text link
    Feedforward Neural Network (FNN)-based language models estimate the probability of the next word based on the history of the last N words, whereas Recurrent Neural Networks (RNN) perform the same task based only on the last word and some context information that cycles in the network. This paper presents a novel approach, which bridges the gap between these two categories of networks. In particular, we propose an architecture which takes advantage of the explicit, sequential enumeration of the word history in FNN structure while enhancing each word representation at the projection layer through recurrent context information that evolves in the network. The context integration is performed using an additional word-dependent weight matrix that is also learned during the training. Extensive experiments conducted on the Penn Treebank (PTB) and the Large Text Compression Benchmark (LTCB) corpus showed a significant reduction of the perplexity when compared to state-of-the-art feedforward as well as recurrent neural network architectures.Comment: published (INTERSPEECH 2016), 5 pages, 3 figures, 4 table

    Universal Compressed Text Indexing

    Get PDF
    The rise of repetitive datasets has lately generated a lot of interest in compressed self-indexes based on dictionary compression, a rich and heterogeneous family that exploits text repetitions in different ways. For each such compression scheme, several different indexing solutions have been proposed in the last two decades. To date, the fastest indexes for repetitive texts are based on the run-length compressed Burrows-Wheeler transform and on the Compact Directed Acyclic Word Graph. The most space-efficient indexes, on the other hand, are based on the Lempel-Ziv parsing and on grammar compression. Indexes for more universal schemes such as collage systems and macro schemes have not yet been proposed. Very recently, Kempa and Prezza [STOC 2018] showed that all dictionary compressors can be interpreted as approximation algorithms for the smallest string attractor, that is, a set of text positions capturing all distinct substrings. Starting from this observation, in this paper we develop the first universal compressed self-index, that is, the first indexing data structure based on string attractors, which can therefore be built on top of any dictionary-compressed text representation. Let γ\gamma be the size of a string attractor for a text of length nn. Our index takes O(γlog(n/γ))O(\gamma\log(n/\gamma)) words of space and supports locating the occocc occurrences of any pattern of length mm in O(mlogn+occlogϵn)O(m\log n + occ\log^{\epsilon}n) time, for any constant ϵ>0\epsilon>0. This is, in particular, the first index for general macro schemes and collage systems. Our result shows that the relation between indexing and compression is much deeper than what was previously thought: the simple property standing at the core of all dictionary compressors is sufficient to support fast indexed queries.Comment: Fixed with reviewer's comment

    Arithmetic coding revisited

    Get PDF
    Over the last decade, arithmetic coding has emerged as an important compression tool. It is now the method of choice for adaptive coding on multisymbol alphabets because of its speed, low storage requirements, and effectiveness of compression. This article describes a new implementation of arithmetic coding that incorporates several improvements over a widely used earlier version by Witten, Neal, and Cleary, which has become a de facto standard. These improvements include fewer multiplicative operations, greatly extended range of alphabet sizes and symbol probabilities, and the use of low-precision arithmetic, permitting implementation by fast shift/add operations. We also describe a modular structure that separates the coding, modeling, and probability estimation components of a compression system. To motivate the improved coder, we consider the needs of a word-based text compression program. We report a range of experimental results using this and other models. Complete source code is available
    corecore