Search CORE

31,192 research outputs found

An Information Theoretic Approach to Quantify the Stability of Feature Selection and Ranking Algorithms

Author: Alaiz Rodríguez Rocío
Parnell Andrew C.
Publication venue: Elsevier
Publication date: 17/01/2024
Field of study

[EN] Feature selection is a key step when dealing with high-dimensional data. In particular, these techniques simplify the process of knowledge discovery from the data in fields like biomedicine, bioinformatics, genetics or chemometrics by selecting the most relevant features out of the noisy, redundant and irrel- evant features. A problem that arises in many of these applications is that the outcome of the feature selection algorithm is not stable. Thus, small variations in the data may yield very different feature rankings. Assessing the stability of these methods becomes an important issue in the previously mentioned situations, but it has been long overlooked in the literature. We propose an information-theoretic approach based on the Jensen-Shannon di-vergence to quantify this robustness. Unlike other stability measures, this metric is suitable for different algorithm outcomes: full ranked lists, top-k lists (feature subsets) as well as the lesser studied partial ranked lists that keep the k best ranked elements. This generalized metric quantifies the dif-ference among a whole set of lists with the same size, following a probabilistic approach and being able to give more importance to the disagreements that appear at the top of the list. Moreover, it possesses desirable properties for a stability metric including correction for change, and upper/lower bounds and conditions for a deterministic selection. We illustrate the use of this stability metric with data generated in a fully controlled way and compare it with popular metrics including the Spearman’s rank correlation and the Kuncheva’s index on feature ranking and selection outcomes respectively.S

Leon University (Spain)

What May Visualization Processes Optimize?

Author: Chen Min
Golan Amos
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2015
Field of study

In this paper, we present an abstract model of visualization and inference processes and describe an information-theoretic measure for optimizing such processes. In order to obtain such an abstraction, we first examined six classes of workflows in data analysis and visualization, and identified four levels of typical visualization components, namely disseminative, observational, analytical and model-developmental visualization. We noticed a common phenomenon at different levels of visualization, that is, the transformation of data spaces (referred to as alphabets) usually corresponds to the reduction of maximal entropy along a workflow. Based on this observation, we establish an information-theoretic measure of cost-benefit ratio that may be used as a cost function for optimizing a data visualization process. To demonstrate the validity of this measure, we examined a number of successful visualization processes in the literature, and showed that the information-theoretic measure can mathematically explain the advantages of such processes over possible alternatives.Comment: 10 page

arXiv.org e-Print Archive

Crossref

Oxford University Research Archive

Statistical methods for automated drug susceptibility testing: Bayesian minimum inhibitory concentration prediction from growth curves

Author: Clyde Merlise A.
Garrett James
Lourdes Viridiana
O'Connell Michael
Parmigiani Giovanni
Turner David J.
Wiles Tim
Zhou Xi Kathy
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 18/01/2008
Field of study

Determination of the minimum inhibitory concentration (MIC) of a drug that prevents microbial growth is an important step for managing patients with infections. In this paper we present a novel probabilistic approach that accurately estimates MICs based on a panel of multiple curves reflecting features of bacterial growth. We develop a probabilistic model for determining whether a given dilution of an antimicrobial agent is the MIC given features of the growth curves over time. Because of the potentially large collection of features, we utilize Bayesian model selection to narrow the collection of predictors to the most important variables. In addition to point estimates of MICs, we are able to provide posterior probabilities that each dilution is the MIC based on the observed growth curves. The methods are easily automated and have been incorporated into the Becton--Dickinson PHOENIX automated susceptibility system that rapidly and accurately classifies the resistance of a large number of microorganisms in clinical samples. Over seventy-five studies to date have shown this new method provides improved estimation of MICs over existing approaches.Comment: Published in at http://dx.doi.org/10.1214/08-AOAS217 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

Crossref

Collection Of Biostatistics Research Archive

On association in regression: the coefficient of determination revisited

Author: Linde A. van der
Tutz Gerhard
Publication venue
Publication date: 01/01/2004
Field of study

Universal coefficients of determination are investigated which quantify the strength of the relation between a vector of dependent variables Y and a vector of independent covariates X. They are defined as measures of dependence between Y and X through theta(x), with theta(x) parameterizing the conditional distribution of Y given X=x. If theta(x) involves unknown coefficients gamma the definition is conditional on gamma, and in practice gamma, respectively the coefficient of determination has to be estimated. The estimates of quantities we propose generalize R^2 in classical linear regression and are also related to other definitions previously suggested. Our definitions apply to generalized regression models with arbitrary link functions as well as multivariate and nonparametric regression. The definition and use of the proposed coefficients of determination is illustrated for several regression problems with simulated and real data sets

Open Access LMU