89,832 research outputs found

    Block-based Classification Method for Computer Screen Image Compression

    Get PDF
    In this paper, a high accuracy and reduced processing time block based classification method for computer screen images is presented. This method classifies blocks into five types: smooth, sparse, fuzzy, text and picture blocks. In a computer screen compression application, the choice of block compression algorithm is made based on these block types. The classification method presented has four novel features. The first novel feature is a combination of Discrete Wavelet Transform (DWT) and colour counting classification methods. Both of these methods have only been used for computer image compression in isolation in previous publications but this paper shows that combined together more accurate results are obtained overall. The second novel feature is the classification of the image blocks into five block types. The addition of the fuzzy and sparse block types make the use of optimum compression methods possible for these blocks. The third novel feature is block type prediction. The prediction algorithm is applied to a current block when the blocks on the top and the left of the current block are text blocks or smooth blocks. This new algorithm is designed to exploit the correlation of adjacent blocks and reduces the overall classification processing time by 33%. The fourth novel feature is down sampling of the pixels in each block which reduces the classification processing time by 62%. When both block prediction and down sampling are enabled, the classification time is reduced by 74% overall. The overall classification accuracy is 98.46%

    Textual Case-based Reasoning for Spam Filtering: a Comparison of Feature-based and Feature-free Approaches

    Get PDF
    Spam filtering is a text classification task to which Case-Based Reasoning (CBR) has been successfully applied. We describe the ECUE system, which classifies emails using a feature-based form of textual CBR. Then, we describe an alternative way to compute the distances between cases in a feature-free fashion, using a distance measure based on text compression. This distance measure has the advantages of having no set-up costs and being resilient to concept drift. We report an empirical comparison, which shows the feature-free approach to be more accurate than the feature-based system. These results are fairly robust over different compression algorithms in that we find that the accuracy when using a Lempel-Ziv compressor (GZip) is approximately the same as when using a statistical compressor (PPM). We note, however, that the feature-free systems take much longer to classify emails than the feature-based system. Improvements in the classification time of both kinds of systems can be obtained by applying case base editing algorithms, which aim to remove noisy and redundant cases from a case base while maintaining, or even improving, generalisation accuracy. We report empirical results using the Competence-Based Editing (CBE) technique. We show that CBE removes more cases when we use the distance measure based on text compression (without significant changes in generalisation accuracy) than it does when we use the feature-based approach

    Application of compression-based distance measures to protein sequence classification: a methodological study

    Get PDF
    Abstract Motivation: Distance measures built on the notion of text compression have been used for the comparison and classification of entire genomes and mitochondrial genomes. The present study was undertaken in order to explore their utility in the classification of protein sequences. Results: We constructed compression-based distance measures (CBMs) using the Lempel-Zlv and the PPMZ compression algorithms and compared their performance with that of the Smith–Waterman algorithm and BLAST, using nearest neighbour or support vector machine classification schemes. The datasets included a subset of the SCOP protein structure database to test distant protein similarities, a 3-phosphoglycerate-kinase sequences selected from archaean, bacterial and eukaryotic species as well as low and high-complexity sequence segments of the human proteome, CBMs values show a dependence on the length and the complexity of the sequences compared. In classification tasks CBMs performed especially well on distantly related proteins where the performance of a combined measure, constructed from a CBM and a BLAST score, approached or even slightly exceeded that of the Smith–Waterman algorithm and two hidden Markov model-based algorithms. Contact: [email protected] Supplementary information

    Artificial Sequences and Complexity Measures

    Get PDF
    In this paper we exploit concepts of information theory to address the fundamental problem of identifying and defining the most suitable tools to extract, in a automatic and agnostic way, information from a generic string of characters. We introduce in particular a class of methods which use in a crucial way data compression techniques in order to define a measure of remoteness and distance between pairs of sequences of characters (e.g. texts) based on their relative information content. We also discuss in detail how specific features of data compression techniques could be used to introduce the notion of dictionary of a given sequence and of Artificial Text and we show how these new tools can be used for information extraction purposes. We point out the versatility and generality of our method that applies to any kind of corpora of character strings independently of the type of coding behind them. We consider as a case study linguistic motivated problems and we present results for automatic language recognition, authorship attribution and self consistent-classification.Comment: Revised version, with major changes, of previous "Data Compression approach to Information Extraction and Classification" by A. Baronchelli and V. Loreto. 15 pages; 5 figure

    Towards the text compression based feature extraction in high impedance fault detection

    Get PDF
    High impedance faults of medium voltage overhead lines with covered conductors can be identified by the presence of partial discharges. Despite it is a subject of research for more than 60 years, online partial discharges detection is always a challenge, especially in environment with heavy background noise. In this paper, a new approach for partial discharge pattern recognition is presented. All results were obtained on data, acquired from real 22 kV medium voltage overhead power line with covered conductors. The proposed method is based on a text compression algorithm and it serves as a signal similarity estimation, applied for the first time on partial discharge pattern. Its relevancy is examined by three different variations of classification model. The improvement gained on an already deployed model proves its quality.Web of Science1211art. no. 214

    Learning Low-Rank Representations for Model Compression

    Full text link
    Vector Quantization (VQ) is an appealing model compression method to obtain a tiny model with less accuracy loss. While methods to obtain better codebooks and codes under fixed clustering dimensionality have been extensively studied, optimizations of the vectors in favour of clustering performance are not carefully considered, especially via the reduction of vector dimensionality. This paper reports our recent progress on the combination of dimensionality compression and vector quantization, proposing a Low-Rank Representation Vector Quantization (LR2VQ\text{LR}^2\text{VQ}) method that outperforms previous VQ algorithms in various tasks and architectures. LR2VQ\text{LR}^2\text{VQ} joins low-rank representation with subvector clustering to construct a new kind of building block that is directly optimized through end-to-end training over the task loss. Our proposed design pattern introduces three hyper-parameters, the number of clusters kk, the size of subvectors mm and the clustering dimensionality d~\tilde{d}. In our method, the compression ratio could be directly controlled by mm, and the final accuracy is solely determined by d~\tilde{d}. We recognize d~\tilde{d} as a trade-off between low-rank approximation error and clustering error and carry out both theoretical analysis and experimental observations that empower the estimation of the proper d~\tilde{d} before fine-tunning. With a proper d~\tilde{d}, we evaluate LR2VQ\text{LR}^2\text{VQ} with ResNet-18/ResNet-50 on ImageNet classification datasets, achieving 2.8\%/1.0\% top-1 accuracy improvements over the current state-of-the-art VQ-based compression algorithms with 43×\times/31×\times compression factor
    corecore