3 research outputs found

    Reducing the loss of information through annealing text distortion

    Full text link
    Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. Granados, A. ;Cebrian, M. ; Camacho, D. ; de Borja Rodriguez, F. "Reducing the Loss of Information through Annealing Text Distortion". IEEE Transactions on Knowledge and Data Engineering, vol. 23, no. 7 pp. 1090 - 1102, July 2011Compression distances have been widely used in knowledge discovery and data mining. They are parameter-free, widely applicable, and very effective in several domains. However, little has been done to interpret their results or to explain their behavior. In this paper, we take a step toward understanding compression distances by performing an experimental evaluation of the impact of several kinds of information distortion on compression-based text clustering. We show how progressively removing words in such a way that the complexity of a document is slowly reduced helps the compression-based text clustering and improves its accuracy. In fact, we show how the nondistorted text clustering can be improved by means of annealing text distortion. The experimental results shown in this paper are consistent using different data sets, and different compression algorithms belonging to the most important compression families: Lempel-Ziv, Statistical and Block-Sorting.This work was supported by the Spanish Ministry of Education and Science under TIN2010-19872 and TIN2010-19607 projects

    Evaluating the impact of information distortion on normalized compression distance

    Full text link
    The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-540-87448-5_8Proceedings of Second International Castle Meeting, ICMCTA 2008, Castillo de la Mota, Medina del Campo, Spain, September 15-19, 2008.In this paper we apply different techniques of information distortion on a set of classical books written in English. We study the impact that these distortions have upon the Kolmogorov complexity and the clustering by compression technique (the latter based on Normalized Compression Distance, NCD). We show how to decrease the complexity of the considered books introducing several modifications in them. We measure how the information contained in each book is maintained using a clustering error measure. We find experimentally that the best way to keep the clustering error is by means of modifications in the most frequent words. We explain the details of these information distortions and we compare with other kinds of modifications like random word distortions and unfrequent word distortions. Finally, some phenomenological explanations from the different empirical results that have been carried out are presented.This work was supported by TIN 2004-04363-CO03-03, TIN 2007-65989, CAM S-SEM-0255-2006, TIN2007-64718 and TSI 2005-08255-C07-06. We would also like to thank Franscico Sánchez for his useful comments on this draft

    Automatizing chromatic quality assessment for cultural heritage image digitization

    Get PDF
    In the context of digitization of photographs and other documents with graphical value, cultural heritage organizations need to give a guarantee that the stored digital image is a faithful representation of the physical image both at the physical level and the perceptual level. On the physical level, image quality can be measured objectively in a simple way by applying certain physical attributes to the image, as well as by measuring how distorting images affects the performance of the attributes. However, on the perceptual level, image quality should correspond to the perception that a human expert would experience when observing the physical image under certain determined and controlled conditions. In this paper we address the problem of image quality assessment (IQA) in the context of cultural heritage digitization by applying machine learning (ML). In particular, we explore the possibility of creating a decision tree that mimics the response of an expert on cultural heritage when observing cultural heritage images
    corecore