31,192 research outputs found
An Information Theoretic Approach to Quantify the Stability of Feature Selection and Ranking Algorithms
[EN] Feature selection is a key step when dealing with high-dimensional data. In particular, these techniques simplify the process of knowledge discovery from the data in fields like biomedicine, bioinformatics, genetics or chemometrics by selecting the most relevant features out of the noisy, redundant and irrel- evant features. A problem that arises in many of these applications is that the outcome of the feature selection algorithm is not stable. Thus, small variations in the data may yield very different feature rankings. Assessing the stability of these methods becomes an important issue in the previously mentioned situations, but it has been long overlooked in the literature. We propose an information-theoretic approach based on the Jensen-Shannon di-vergence to quantify this robustness. Unlike other stability measures, this metric is suitable for different algorithm outcomes: full ranked lists, top-k lists (feature subsets) as well as the lesser studied partial ranked lists that keep the k best ranked elements. This generalized metric quantifies the dif-ference among a whole set of lists with the same size, following a probabilistic approach and being able to give more importance to the disagreements that appear at the top of the list. Moreover, it possesses desirable properties for a stability metric including correction for change, and upper/lower bounds and conditions for a deterministic selection. We illustrate the use of this stability metric with data generated in a fully controlled way and compare it with popular metrics including the Spearman’s rank correlation and the Kuncheva’s index on feature ranking and selection outcomes respectively.S
What May Visualization Processes Optimize?
In this paper, we present an abstract model of visualization and inference
processes and describe an information-theoretic measure for optimizing such
processes. In order to obtain such an abstraction, we first examined six
classes of workflows in data analysis and visualization, and identified four
levels of typical visualization components, namely disseminative,
observational, analytical and model-developmental visualization. We noticed a
common phenomenon at different levels of visualization, that is, the
transformation of data spaces (referred to as alphabets) usually corresponds to
the reduction of maximal entropy along a workflow. Based on this observation,
we establish an information-theoretic measure of cost-benefit ratio that may be
used as a cost function for optimizing a data visualization process. To
demonstrate the validity of this measure, we examined a number of successful
visualization processes in the literature, and showed that the
information-theoretic measure can mathematically explain the advantages of such
processes over possible alternatives.Comment: 10 page
Statistical methods for automated drug susceptibility testing: Bayesian minimum inhibitory concentration prediction from growth curves
Determination of the minimum inhibitory concentration (MIC) of a drug that
prevents microbial growth is an important step for managing patients with
infections. In this paper we present a novel probabilistic approach that
accurately estimates MICs based on a panel of multiple curves reflecting
features of bacterial growth. We develop a probabilistic model for determining
whether a given dilution of an antimicrobial agent is the MIC given features of
the growth curves over time. Because of the potentially large collection of
features, we utilize Bayesian model selection to narrow the collection of
predictors to the most important variables. In addition to point estimates of
MICs, we are able to provide posterior probabilities that each dilution is the
MIC based on the observed growth curves. The methods are easily automated and
have been incorporated into the Becton--Dickinson PHOENIX automated
susceptibility system that rapidly and accurately classifies the resistance of
a large number of microorganisms in clinical samples. Over seventy-five studies
to date have shown this new method provides improved estimation of MICs over
existing approaches.Comment: Published in at http://dx.doi.org/10.1214/08-AOAS217 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
On association in regression: the coefficient of determination revisited
Universal coefficients of determination are investigated which quantify the strength of the relation between a vector of dependent variables Y and a vector of independent covariates X. They are defined as measures of dependence between Y and X through theta(x), with theta(x) parameterizing the conditional distribution of Y given X=x. If theta(x) involves unknown coefficients gamma the definition is conditional on gamma, and in practice gamma, respectively the coefficient of determination has to be estimated. The estimates of quantities we propose generalize R^2 in classical linear regression and are also related to other definitions previously suggested. Our definitions apply to generalized regression models with arbitrary link functions as well as multivariate and nonparametric regression. The definition and use of the proposed coefficients of determination is illustrated for several regression problems with simulated and real data sets
- …