3,592 research outputs found

    Assessing Information Transmission in Data Transformations with the Channel Multivariate Entropy Triangle

    Get PDF
    Data transformation, e.g., feature transformation and selection, is an integral part of any machine learning procedure. In this paper, we introduce an information-theoretic model and tools to assess the quality of data transformations in machine learning tasks. In an unsupervised fashion, we analyze the transformation of a discrete, multivariate source of information (X) over bar into a discrete, multivariate sink of information (Y) over bar related by a distribution P-(XY) over bar. The first contribution is a decomposition of the maximal potential entropy of ((X) over bar, (Y) over bar), which we call a balance equation, into its (a) non-transferable, (b) transferable, but not transferred, and (c) transferred parts. Such balance equations can be represented in (de Finetti) entropy diagrams, our second set of contributions. The most important of these, the aggregate channel multivariate entropy triangle, is a visual exploratory tool to assess the effectiveness of multivariate data transformations in transferring information from input to output variables. We also show how these decomposition and balance equations also apply to the entropies of (X) over bar and (Y) over bar, respectively, and generate entropy triangles for them. As an example, we present the application of these tools to the assessment of information transfer efficiency for Principal Component Analysis and Independent Component Analysis as unsupervised feature transformation and selection procedures in supervised classification tasks.This research was funded by he Spanish Government-MinECo projects TEC2014-53390-P and TEC2017-84395-P

    The evaluation of data sources using multivariate entropy tools

    Get PDF
    We introduce from first principles an analysis of the information content of multivariate distributions as information sources. Specifically, we generalize a balance equation and a visualization device, the Entropy Triangle, for multivariate distributions and find notable differences with similar analyses done on joint distributions as models of information channels. As an example application, we extend a framework for the analysis of classifiers to also encompass the analysis of data sets. With such tools we analyze a handful of UCI machine learning task to start addressing the question of how well do datasets convey the information they are supposed to capture about the phenomena they stand for

    Two Information-Theoretic Tools to Assess the Performance of Multi-class Classifiers

    Get PDF
    We develop two tools to analyze the behavior of multiple-class, or multi-class, classifiers by means of entropic measures on their confusion matrix or contingency table. First we obtain a balance equation on the entropies that captures interesting properties of the classifier. Second, by normalizing this balance equation we first obtain a 2-simplex in a three-dimensional entropy space and then the de Finetti entropy diagram or entropy triangle. We also give examples of the assessment of classifiers with these tools.Spanish Government-Comisión Interministerial de Ciencia y Tecnología projects 2008-06382/TEC and 2008-02473/TEC and the regional projects S-505/TIC/0223 (DGUI-CM) and CCG08-UC3M/TIC-4457 (Comunidad Autónoma de Madrid – UC3M)Publicad

    Doctor of Philosophy

    Get PDF
    dissertationVisualization and exploration of volumetric datasets has been an active area of research for over two decades. During this period, volumetric datasets used by domain users have evolved from univariate to multivariate. The volume datasets are typically explored and classified via transfer function design and visualized using direct volume rendering. To improve classification results and to enable the exploration of multivariate volume datasets, multivariate transfer functions emerge. In this dissertation, we describe our research on multivariate transfer function design. To improve the classification of univariate volumes, various one-dimensional (1D) or two-dimensional (2D) transfer function spaces have been proposed; however, these methods work on only some datasets. We propose a novel transfer function method that provides better classifications by combining different transfer function spaces. Methods have been proposed for exploring multivariate simulations; however, these approaches are not suitable for complex real-world datasets and may be unintuitive for domain users. To this end, we propose a method based on user-selected samples in the spatial domain to make complex multivariate volume data visualization more accessible for domain users. However, this method still requires users to fine-tune transfer functions in parameter space transfer function widgets, which may not be familiar to them. We therefore propose GuideME, a novel slice-guided semiautomatic multivariate volume exploration approach. GuideME provides the user, an easy-to-use, slice-based user interface that suggests the feature boundaries and allows the user to select features via click and drag, and then an optimal transfer function is automatically generated by optimizing a response function. Throughout the exploration process, the user does not need to interact with the parameter views at all. Finally, real-world multivariate volume datasets are also usually of large size, which is larger than the GPU memory and even the main memory of standard work stations. We propose a ray-guided out-of-core, interactive volume rendering and efficient query method to support large and complex multivariate volumes on standard work stations

    Disambiguating the role of blood flow and global signal with partial information decomposition

    Get PDF
    Global signal (GS) is an ubiquitous construct in resting state functional magnetic resonance imaging (rs-fMRI), associated to nuisance, but containing by definition most of the neuronal signal. Global signal regression (GSR) effectively removes the impact of physiological noise and other artifacts, but at the same time it alters correlational patterns in unpredicted ways. Performing GSR taking into account the underlying physiology (mainly the blood arrival time) has been proven to be beneficial. From these observations we aimed to: 1) characterize the effect of GSR on network-level functional connectivity in a large dataset; 2) assess the complementary role of global signal and vessels; and 3) use the framework of partial information decomposition to further look into the joint dynamics of the global signal and vessels, and their respective influence on the dynamics of cortical areas. We observe that GSR affects intrinsic connectivity networks in the connectome in a non-uniform way. Furthermore, by estimating the predictive information of blood flow and the global signal using partial information decomposition, we observe that both signals are present in different amounts across intrinsic connectivity networks. Simulations showed that differences in blood arrival time can largely explain this phenomenon, while using hemodynamic and calcium mouse recordings we were able to confirm the presence of vascular effects, as calcium recordings lack hemodynamic information. With these results we confirm network-specific effects of GSR and the importance of taking blood flow into account for improving de-noising methods. Additionally, and beyond the mere issue of data denoising, we quantify the diverse and complementary effect of global and vessel BOLD signals on the dynamics of cortical areas

    On Concept Lattices as Information Channels

    Get PDF
    Proceedings of: 11th International Conference on Concept Lattices and Their Applications (CLA 2014). Kosice, Slovakia, October 07-10, 2014.This paper explores the idea that a concept lattice is an information channel between objects and attributes. For this purpose we study the behaviour of incidences in L-formal contexts where L is the range of an information-theoretic entropy function. Examples of such data abound in machine learning and data mining, e.g. confusion matrices of multi-class classifers or document-term matrices. We use a wellmotivated information-theoretic heuristic, the maximization of mutual information, that in our conclusions provides a favour of feature selection providing and information-theory explanation of an established practice in Data Mining, Natural Language Processing and Information Retrieval applications, viz. stop-wording and frequency thresholding. We also introduce a post-clustering class identi cation in the presence of confusions and a favour of term selection for a multi-label document classifcation task.FJVA and AP are supported by EU FP7 project LiMoSINe (contract 288024) for this work. CPM has been supported by the Spanish Government-ComisiĂłn Interministerial de Ciencia y TecnologĂ­a project TEC2011-26807.Publicad

    Sensing Method for Two-Target Detection in Time-Constrained Vector Poisson Channel

    Get PDF
    It is an experimental design problem in which there are two Poisson sources with two possible and known rates, and one counter. Through a switch, the counter can observe the sources individually or the counts can be combined so that the counter observes the sum of the two. The sensor scheduling problem is to determine an optimal proportion of the available time to be allocated toward individual and joint sensing, under a total time constraint. Two different metrics are used for optimization: mutual information between the sources and the observed counts, and probability of detection for the associated source detection problem. Our results, which are primarily computational, indicate similar but not identical results under the two cost functions

    100% classification accuracy considered harmful: The normalized information transfer factor explains the accuracy paradox

    Get PDF
    The most widely spread measure of performance, accuracy, suffers from a paradox: predictive models with a given level of accuracy may have greater predictive power than models with higher accuracy. Despite optimizing classification error rate, high accuracy models may fail to capture crucial information transfer in the classification task. We present evidence of this behavior by means of a combinatorial analysis where every possible contingency matrix of 2, 3 and 4 classes classifiers are depicted on the entropy triangle, a more reliable information-theoretic tool for classification assessment. Motivated by this, we develop from first principles a measure of classification performance that takes into consideration the information learned by classifiers. We are then able to obtain the entropy-modulated accuracy (EMA), a pessimistic estimate of the expected accuracy with the influence of the input distribution factored out, and the normalized information transfer factor (NIT), a measure of how efficient is the transmission of information from the input to the output set of classes. The EMA is a more natural measure of classification performance than accuracy when the heuristic to maximize is the transfer of information through the classifier instead of classification error count. The NIT factor measures the effectiveness of the learning process in classifiers and also makes it harder for them to "cheat" using techniques like specialization, while also promoting the interpretability of results. Their use is demonstrated in a mind reading task competition that aims at decoding the identity of a video stimulus based on magnetoencephalography recordings. We show how the EMA and the NIT factor reject rankings based in accuracy, choosing more meaningful and interpretable classifiers.Francisco José Valverde-Albacete has been partially supported by EU FP7 project LiMoSINe (contract 288024): www.limosine-project.eu Carmen Peláez Moreno has been partially supported by the Spanish Government-Comisión Interministerial de Ciencia y Tecnología project TEC2011–26807

    Disentangling the information in species interaction networks

    Get PDF
    Shannon’s entropy measure is a popular means for quantifying ecological diversity. We explore how one can use information-theoretic measures (that are often called indices in ecology) on joint ensembles to study the diversity of species interaction networks. We leverage the little-known balance equation to decompose the network information into three components describing the species abundance, specificity, and redundancy. This balance reveals that there exists a fundamental trade-off between these components. The decomposition can be straightforwardly extended to analyse networks through time as well as space, leading to the corresponding notions for alpha, beta, and gamma diversity. Our work aims to provide an accessible introduction for ecologists. To this end, we illustrate the interpretation of the components on numerous real networks. The corresponding code is made available to the community in the specialised Julia package EcologicalNetworks.jl
    • …
    corecore