3,162 research outputs found
Reducing the loss of information through annealing text distortion
Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. Granados, A. ;Cebrian, M. ; Camacho, D. ; de Borja Rodriguez, F. "Reducing the Loss of Information through Annealing Text Distortion". IEEE Transactions on Knowledge and Data Engineering, vol. 23, no. 7 pp. 1090 - 1102, July 2011Compression distances have been widely used in knowledge discovery and data mining. They are parameter-free, widely applicable, and very effective in several domains. However, little has been done to interpret their results or to explain their behavior. In this paper, we take a step toward understanding compression distances by performing an experimental evaluation of the impact of several kinds of information distortion on compression-based text clustering. We show how progressively removing words in such a way that the complexity of a document is slowly reduced helps the compression-based text clustering and improves its accuracy. In fact, we show how the nondistorted text clustering can be improved by means of annealing text distortion. The experimental results shown in this paper are consistent using different data sets, and different compression algorithms belonging to the most important compression families: Lempel-Ziv, Statistical and Block-Sorting.This work was supported by the Spanish Ministry of Education and Science under TIN2010-19872 and TIN2010-19607 projects
Analysis and study on text representation to improve the accuracy of the Normalized Compression Distance
The huge amount of information stored in text form makes methods that deal
with texts really interesting. This thesis focuses on dealing with texts using
compression distances. More specifically, the thesis takes a small step towards
understanding both the nature of texts and the nature of compression distances.
Broadly speaking, the way in which this is done is exploring the effects that
several distortion techniques have on one of the most successful distances in
the family of compression distances, the Normalized Compression Distance -NCD-.Comment: PhD Thesis; 202 page
Evaluating the impact of information distortion on normalized compression distance
The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-540-87448-5_8Proceedings of Second International Castle Meeting, ICMCTA 2008, Castillo de la Mota, Medina del Campo, Spain, September 15-19, 2008.In this paper we apply different techniques of information distortion on a set of classical books written in English. We study the impact that these distortions have upon the Kolmogorov complexity and the clustering by compression technique (the latter based on Normalized Compression Distance, NCD). We show how to decrease the complexity of the considered books introducing several modifications in them. We measure how the information contained in each book is maintained using a clustering error measure. We find experimentally that the best way to keep the clustering error is by means of modifications in the most frequent words. We explain the details of these information distortions and we compare with other kinds of modifications like random word distortions and unfrequent word distortions. Finally, some phenomenological explanations from the different empirical results that have been carried out are presented.This work was supported by TIN 2004-04363-CO03-03, TIN 2007-65989, CAM
S-SEM-0255-2006, TIN2007-64718 and TSI 2005-08255-C07-06. We would also
like to thank Franscico Sánchez for his useful comments on this draft
Contextual Information Retrieval based on Algorithmic Information Theory and Statistical Outlier Detection
The main contribution of this paper is to design an Information Retrieval
(IR) technique based on Algorithmic Information Theory (using the Normalized
Compression Distance- NCD), statistical techniques (outliers), and novel
organization of data base structure. The paper shows how they can be integrated
to retrieve information from generic databases using long (text-based) queries.
Two important problems are analyzed in the paper. On the one hand, how to
detect "false positives" when the distance among the documents is very low and
there is actual similarity. On the other hand, we propose a way to structure a
document database which similarities distance estimation depends on the length
of the selected text. Finally, the experimental evaluations that have been
carried out to study previous problems are shown.Comment: Submitted to 2008 IEEE Information Theory Workshop (6 pages, 6
figures
Nonlinear power spectrum in the presence of massive neutrinos: perturbation theory approach, galaxy bias and parameter forecasts
Future or ongoing galaxy redshift surveys can put stringent constraints on
neutrinos masses via the high-precision measurements of galaxy power spectrum,
when combined with cosmic microwave background (CMB) information. In this paper
we develop a method to model galaxy power spectrum in the weakly nonlinear
regime for a mixed dark matter (CDM plus finite-mass neutrinos) model, based on
perturbation theory (PT) whose validity is well tested by simulations for a CDM
model. In doing this we carefully study various aspects of the nonlinear
clustering and then arrive at a useful approximation allowing for a quick
computation of the nonlinear power spectrum as in the CDM case. The nonlinear
galaxy bias is also included in a self-consistent manner within the PT
framework. Thus the use of our PT model can give a more robust understanding of
the measured galaxy power spectrum as well as allow for higher sensitivity to
neutrino masses due to the gain of Fourier modes beyond the linear regime.
Based on the Fisher matrix formalism, we find that BOSS or Stage-III type
survey, when combined with Planck CMB information, gives a precision of total
neutrino mass constraint, sigma(m_nu,tot) 0.1eV, while Stage-IV type survey may
achieve sigma(m_nu,tot) 0.05eV, i.e. more than a 1-sigma detection of neutrino
masses. We also discuss possible systematic errors on dark energy parameters
caused by the neutrino mass uncertainty. The significant correlation between
neutrino mass and dark energy parameters is found, if the information on power
spectrum amplitude is included. More importantly, for Stage-IV type survey, a
best-fit dark energy model may be biased and falsely away from the underlying
true model by more than the 1-sigma statistical errors, if neutrino mass is
ignored in the model fitting.Comment: 33 pages, 11 figure
Towards video streaming in IoT environments: vehicular communication perspective
Multimedia oriented Internet of Things (IoT) enables pervasive and real-time communication of video, audio and image data among devices in an immediate surroundings. Today's vehicles have the capability of supporting real time multimedia acquisition. Vehicles with high illuminating infrared cameras and customized sensors can communicate with other on-road devices using dedicated short-range communication (DSRC) and 5G enabled communication technologies. Real time incidence of both urban and highway vehicular traffic environment can be captured and transmitted using vehicle-to-vehicle and vehicle-to-infrastructure communication modes. Video streaming in vehicular IoT (VSV-IoT) environments is in growing stage with several challenges that need to be addressed ranging from limited resources in IoT devices, intermittent connection in vehicular networks, heterogeneous devices, dynamism and scalability in video encoding, bandwidth underutilization in video delivery, and attaining application-precise quality of service in video streaming. In this context, this paper presents a comprehensive review on video streaming in IoT environments focusing on vehicular communication perspective. Specifically, significance of video streaming in vehicular IoT environments is highlighted focusing on integration of vehicular communication with 5G enabled IoT technologies, and smart city oriented application areas for VSV-IoT. A taxonomy is presented for the classification of related literature on video streaming in vehicular network environments. Following the taxonomy, critical review of literature is performed focusing on major functional model, strengths and weaknesses. Metrics for video streaming in vehicular IoT environments are derived and comparatively analyzed in terms of their usage and evaluation capabilities. Open research challenges in VSV-IoT are identified as future directions of research in the area. The survey would benefit both IoT and vehicle industry practitioners and researchers, in terms of augmenting understanding of vehicular video streaming and its IoT related trends and issues
- …