70,770 research outputs found
Better text compression from fewer lexical n-grams
Word-based context models for text compression have the capacity to outperform more simple character-based models, but are generally unattractive because of inherent problems with exponential model growth and corresponding data sparseness. These ill-effects can be mitigated in an adaptive lossless compression scheme by modelling syntactic and semantic lexical dependencies independently
Development of Word-Based Text Compression Algorithm For Indonesian Language Document
ABSTRAKSI: Teknologi informasi berkembang sangat pesat saat ini, khususnya untuk penanganan data. Data merupakan aset berharga bagi semua orang, terutama bagi perusahaan yang lebih besar. Khusus untuk perusahaan besar yang sudah memiliki cabang di beberapa tempat. Transmisi data dari kantor pusat ke kantor cabang membuat perusahaan harus menyediakan alat yang baik untuk melakukannya. Perusahaan-perusahaan ini juga membutuhkan alat yang dapat digunakan untuk kompres data sehingga dapat mengurangi ukuran data itu sendiri. Ide utama dari Word-Based encoding adalah untuk mengidentifikasi setiap kata dari teks sumber, kemudian kata yang didentifikasi itu akan dicek apakah terdapat huruf besar pada kata tersebut. Setelah itu, kata tersebut akan dicek kembali apakah terdapat simbol atau angka. Lalu setelah itu akan dilakukan stemming terhadap kata untuk memisahkan kata dasar dari imbuhan. Simbol, angka dan imbuhan akan diberi indeks sesuai dengan indeks yang sudah disimpan sebelumnya dalam kamus dasar. Sedangkan kata dasar yang diperoleh setelah proses stemming akan dicek kembali ke kamus dasar, apakah cocok atau tidak. Jika kata dasar tidak cocok pada kamus, maka kata ini akan menjadi entri baru ke dalam kamus. Pada pengujian ini dilakukan pada data asli yang berukuran antara 10.000 Byte sampai 500.000 Byte dan menggunakan panjang bit kode 16 bit. Hasilnya menunjukkan bahwa rasio kompresi metode yang diusulkan sebanding dengan aplikasi populer RAR hingga 200 kbyte, sementara waktu pengolahannya jauh lebih baik daripada metode urutan dari karakter yang dibalik pada algoritma LZWKata Kunci : Kompresi data, WB-LZW, Berbasis Kata, Stemming, Tree, Kamus dasar, Kamus utamaABSTRACT: Information technology is growing very rapidly, in particular for data handling. Data is a valuable asset for everyone, especially for larger companies with branches in several places. Data transmission from headquarters to branch offices make the company must provides good tools to do it. These companies also need tools that can be used to compress data to reduce their size. The main idea of the word-based encoding is to extract each word of the source text, then it is checked whether containing capital letters or not. After that, it is checked if there is a symbol or number. The particle will be separated from the basic word using stemming algorithm. Symbols, numbers and affixes will be indexed in the basic dictionary. The basic word will also be checked whether it exists in the basic dictionary or not. If there not a match, then the word will be stored to the supplement dictionary. The experiment was conducted on the text file with the size from about a 10.000 bytes up to 500.000 bytes and a code with length of bits is 16 bits. The result shows that the compression ratio of the proposed method is comparable with popular RAR application up to 200 kbyte, while its processing time is much better than the Reversed Sequence of Characters on LZW method.Keyword: Data Compression, WB-LZW, Word-Base, Stemming, Tree, Basic Dictionary, Main Dictionar
Space-Efficient Re-Pair Compression
Re-Pair is an effective grammar-based compression scheme achieving strong
compression rates in practice. Let , , and be the text length,
alphabet size, and dictionary size of the final grammar, respectively. In their
original paper, the authors show how to compute the Re-Pair grammar in expected
linear time and words of working space on top
of the text. In this work, we propose two algorithms improving on the space of
their original solution. Our model assumes a memory word of bits and a re-writable input text composed by such words. Our
first algorithm runs in expected time and uses
words of space on top of the text for any parameter
chosen in advance. Our second algorithm runs in expected
time and improves the space to words
Automatic Correction of Arabic Dyslexic Text
This paper proposes an automatic correction system that detects and corrects dyslexic errors in Arabic text. The system uses a language model based on the Prediction by Partial Matching (PPM) text compression scheme that generates possible alternatives for each misspelled word. Furthermore, the generated candidate list is based on edit operations (insertion, deletion, substitution and transposition), and the correct alternative for each misspelled word is chosen on the basis of the compression codelength of the trigram. The system is compared with widely-used Arabic word processing software and the Farasa tool. The system provided good results compared with the other tools, with a recall of 43%, precision 89%, F1 58% and accuracy 81%
Sequential Recurrent Neural Networks for Language Modeling
Feedforward Neural Network (FNN)-based language models estimate the
probability of the next word based on the history of the last N words, whereas
Recurrent Neural Networks (RNN) perform the same task based only on the last
word and some context information that cycles in the network. This paper
presents a novel approach, which bridges the gap between these two categories
of networks. In particular, we propose an architecture which takes advantage of
the explicit, sequential enumeration of the word history in FNN structure while
enhancing each word representation at the projection layer through recurrent
context information that evolves in the network. The context integration is
performed using an additional word-dependent weight matrix that is also learned
during the training. Extensive experiments conducted on the Penn Treebank (PTB)
and the Large Text Compression Benchmark (LTCB) corpus showed a significant
reduction of the perplexity when compared to state-of-the-art feedforward as
well as recurrent neural network architectures.Comment: published (INTERSPEECH 2016), 5 pages, 3 figures, 4 table
Universal Compressed Text Indexing
The rise of repetitive datasets has lately generated a lot of interest in
compressed self-indexes based on dictionary compression, a rich and
heterogeneous family that exploits text repetitions in different ways. For each
such compression scheme, several different indexing solutions have been
proposed in the last two decades. To date, the fastest indexes for repetitive
texts are based on the run-length compressed Burrows-Wheeler transform and on
the Compact Directed Acyclic Word Graph. The most space-efficient indexes, on
the other hand, are based on the Lempel-Ziv parsing and on grammar compression.
Indexes for more universal schemes such as collage systems and macro schemes
have not yet been proposed. Very recently, Kempa and Prezza [STOC 2018] showed
that all dictionary compressors can be interpreted as approximation algorithms
for the smallest string attractor, that is, a set of text positions capturing
all distinct substrings. Starting from this observation, in this paper we
develop the first universal compressed self-index, that is, the first indexing
data structure based on string attractors, which can therefore be built on top
of any dictionary-compressed text representation. Let be the size of a
string attractor for a text of length . Our index takes
words of space and supports locating the
occurrences of any pattern of length in
time, for any constant . This is, in particular, the first index
for general macro schemes and collage systems. Our result shows that the
relation between indexing and compression is much deeper than what was
previously thought: the simple property standing at the core of all dictionary
compressors is sufficient to support fast indexed queries.Comment: Fixed with reviewer's comment
Arithmetic coding revisited
Over the last decade, arithmetic coding has emerged as an important compression tool. It is now the method of choice for adaptive coding on multisymbol alphabets because of its speed,
low storage requirements, and effectiveness of compression. This article describes a new implementation of arithmetic coding that incorporates several improvements over a widely used earlier version by Witten, Neal, and Cleary, which has become a de facto standard. These improvements include fewer multiplicative operations, greatly extended range of alphabet sizes and symbol probabilities, and the use of low-precision arithmetic, permitting implementation by fast shift/add operations. We also describe a modular structure that separates the coding, modeling, and probability estimation components of a compression system. To motivate the improved coder, we consider the needs of a word-based text compression program. We report a range of experimental results using this and other models. Complete source code is available
- …