Search CORE

2,348 research outputs found

Bidirectional Text Compression in External Memory

Author: Dinklage Patrick
Ellert Jonas
Fischer Johannes
Penschuck Manuel
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 27th Annual European Symposium on Algorithms (ESA 2019)
Publication date: 01/01/2019
Field of study

Bidirectional compression algorithms work by substituting repeated substrings by references that, unlike in the famous LZ77-scheme, can point to either direction. We present such an algorithm that is particularly suited for an external memory implementation. We evaluate it experimentally on large data sets of size up to 128 GiB (using only 16 GiB of RAM) and show that it is significantly faster than all known LZ77 compressors, while producing a roughly similar number of factors. We also introduce an external memory decompressor for texts compressed with any uni- or bidirectional compression scheme

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Better text compression from fewer lexical n-grams

Author: Lorenz Michelle
Smith Tony C.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2001
Field of study

Word-based context models for text compression have the capacity to outperform more simple character-based models, but are generally unattractive because of inherent problems with exponential model growth and corresponding data sparseness. These ill-effects can be mitigated in an adaptive lossless compression scheme by modelling syntactic and semantic lexical dependencies independently

Research Commons@Waikato

Text Compression Using Antidictionaries

Author: Crochemore Maxime
Mignosi Filippo
Restivo Antonio
Salemi Sergio
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/1998
Field of study

International audienceWe give a new text compression scheme based on Forbidden Words ("antidictionary"). We prove that our algorithms attain the entropy for balanced binary sources. They run in linear time. Moreover, one of the main advantages of this approach is that it produces very fast decompressors. A second advantage is a synchronization property that is helpful to search compressed data and allows parallel compression. Our algorithms can also be presented as "compilers" that create compressors dedicated to any previously fixed source. The techniques used in this paper are from Information Theory and Finite Automata

CiteSeerX

HAL-Ecole des Ponts ParisTech

HAL - UPEC / UPEM

n-Gram-based text compression

Author: Duong Hieu N.
Nguyen Hien T.
Nguyen Vu H.
Snášel Václav
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2016
Field of study

We propose an efficient method for compressing Vietnamese text using n-gram dictionaries. It has a significant compression ratio in comparison with those of state-of-the-art methods on the same dataset. Given a text, first, the proposed method splits it into n-grams and then encodes them based on n-gram dictionaries. In the encoding phase, we use a sliding window with a size that ranges from bigram to five grams to obtain the best encoding stream. Each n-gram is encoded by two to four bytes accordingly based on its corresponding n-gram dictionary. We collected 2.5 GB text corpus from some Vietnamese news agencies to build n-gram dictionaries from unigram to five grams and achieve dictionaries with a size of 12 GB in total. In order to evaluate our method, we collected a testing set of 10 different text files with different sizes. The experimental results indicate that our method achieves compression ratio around 90% and outperforms state-of-the-art methods.Web of Scienceart. no. 948364

Crossref

Directory of Open Access Journals

PubMed Central

DSpace at VSB Technical University of Ostrava

Semantic Text Compression for Classification

Author: Kutay Emrecan
Yener Aylin
Publication venue
Publication date: 19/09/2023
Field of study

We study semantic compression for text where meanings contained in the text are conveyed to a source decoder, e.g., for classification. The main motivator to move to such an approach of recovering the meaning without requiring exact reconstruction is the potential resource savings, both in storage and in conveying the information to another node. Towards this end, we propose semantic quantization and compression approaches for text where we utilize sentence embeddings and the semantic distortion metric to preserve the meaning. Our results demonstrate that the proposed semantic approaches result in substantial (orders of magnitude) savings in the required number of bits for message representation at the expense of very modest accuracy loss compared to the semantic agnostic baseline. We compare the results of proposed approaches and observe that resource savings enabled by semantic quantization can be further amplified by semantic clustering. Importantly, we observe the generalizability of the proposed methodology which produces excellent results on many benchmark text classification datasets with a diverse array of contexts.Comment: Appeared in IEEE ICC 2023 2nd International Workshop on Semantic Communication

arXiv.org e-Print Archive

Towards the text compression based feature extraction in high impedance fault detection

Author: Fulneček Jan
Hrbáč Roman
Mišák Stanislav
Prílepok Michal
Vantuch Tomáš
Publication venue: 'MDPI AG'
Publication date: 01/01/2019
Field of study

High impedance faults of medium voltage overhead lines with covered conductors can be identified by the presence of partial discharges. Despite it is a subject of research for more than 60 years, online partial discharges detection is always a challenge, especially in environment with heavy background noise. In this paper, a new approach for partial discharge pattern recognition is presented. All results were obtained on data, acquired from real 22 kV medium voltage overhead power line with covered conductors. The proposed method is based on a text compression algorithm and it serves as a signal similarity estimation, applied for the first time on partial discharge pattern. Its relevancy is examined by three different variations of classification model. The improvement gained on an already deployed model proves its quality.Web of Science1211art. no. 214

DSpace at VSB Technical University of Ostrava