623,767 research outputs found
Controlling Costs: Feature Selection on a Budget
The traditional framework for feature selection treats all features as
costing the same amount. However, in reality, a scientist often has
considerable discretion regarding what variables to measure, and the decision
involves a tradeoff between model accuracy and cost (where cost can refer to
money, time, difficulty, or intrusiveness). In particular, unnecessarily
including an expensive feature in a model is worse than unnecessarily including
a cheap feature. We propose a procedure, which we call cheap knockoffs, for
performing feature selection in a cost-conscious manner. The key idea behind
our method is to force higher cost features to compete with more knockoffs than
cheaper features. We derive an upper bound on the weighted false discovery
proportion associated with this procedure, which corresponds to the fraction of
the feature cost that is wasted on unimportant features. We prove that this
bound holds simultaneously with high probability over a path of selected
variable sets of increasing size. A user may thus select a set of features
based, for example, on the overall budget, while knowing that no more than a
particular fraction of feature cost is wasted. We investigate, through
simulation and a biomedical application, the practical importance of
incorporating cost considerations into the feature selection process
Improving the Efficiency of Genomic Selection
We investigate two approaches to increase the efficiency of phenotypic
prediction from genome-wide markers, which is a key step for genomic selection
(GS) in plant and animal breeding. The first approach is feature selection
based on Markov blankets, which provide a theoretically-sound framework for
identifying non-informative markers. Fitting GS models using only the
informative markers results in simpler models, which may allow cost savings
from reduced genotyping. We show that this is accompanied by no loss, and
possibly a small gain, in predictive power for four GS models: partial least
squares (PLS), ridge regression, LASSO and elastic net. The second approach is
the choice of kinship coefficients for genomic best linear unbiased prediction
(GBLUP). We compare kinships based on different combinations of centring and
scaling of marker genotypes, and a newly proposed kinship measure that adjusts
for linkage disequilibrium (LD).
We illustrate the use of both approaches and examine their performances using
three real-world data sets from plant and animal genetics. We find that elastic
net with feature selection and GBLUP using LD-adjusted kinships performed
similarly well, and were the best-performing methods in our study.Comment: 17 pages, 5 figure
Parameterized Complexity of Feature Selection for Categorical Data Clustering
We develop new algorithmic methods with provable guarantees for feature selection in regard to categorical data clustering. While feature selection is one of the most common approaches to reduce dimensionality in practice, most of the known feature selection methods are heuristics. We study the following mathematical model. We assume that there are some inadvertent (or undesirable) features of the input data that unnecessarily increase the cost of clustering. Consequently, we want to select a subset of the original features from the data such that there is a small-cost clustering on the selected features. More precisely, for given integers ℓ (the number of irrelevant features) and k (the number of clusters), budget B, and a set of n categorical data points (represented by m-dimensional vectors whose elements belong to a finite set of values Σ), we want to select m-ℓ relevant features such that the cost of any optimal k-clustering on these features does not exceed B. Here the cost of a cluster is the sum of Hamming distances (ℓ0-distances) between the selected features of the elements of the cluster and its center. The clustering cost is the total sum of the costs of the clusters. We use the framework of parameterized complexity to identify how the complexity of the problem depends on parameters k, B, and |Σ|. Our main result is an algorithm that solves the Feature Selection problem in time f(k, B, |Σ|)⋅ mg(k, |Σ|)⋅ n2 for some functions f and g. In other words, the problem is fixed-parameter tractable parameterized by B when |Σ| and k are constants. Our algorithm for Feature Selection is based on a solution to a more general problem, Constrained Clustering with Outliers. In this problem, we want to delete a certain number of outliers such that the remaining points could be clustered around centers satisfying specific constraints. One interesting fact about Constrained Clustering with Outliers is that besides Feature Selection, it encompasses many other fundamental problems regarding categorical data such as Robust Clustering and Binary and Boolean Low-rank Matrix Approximation with Outliers. Thus, as a byproduct of our theorem, we obtain algorithms for all these problems. We also complement our algorithmic findings with complexity lower bounds
Parameterized Complexity of Feature Selection for Categorical Data Clustering
We develop new algorithmic methods with provable guarantees for feature selection in regard to categorical data clustering. While feature selection is one of the most common approaches to reduce dimensionality in practice, most of the known feature selection methods are heuristics. We study the following mathematical model. We assume that there are some inadvertent (or undesirable) features of the input data that unnecessarily increase the cost of clustering. Consequently, we want to select a subset of the original features from the data such that there is a small-cost clustering on the selected features. More precisely, for given integers l (the number of irrelevant features) and k (the number of clusters), budget B, and a set of n categorical data points (represented by m-dimensional vectors whose elements belong to a finite set of values Σ), we want to select m-l relevant features such that the cost of any optimal k-clustering on these features does not exceed B. Here the cost of a cluster is the sum of Hamming distances (l0-distances) between the selected features of the elements of the cluster and its center. The clustering cost is the total sum of the costs of the clusters.
We use the framework of parameterized complexity to identify how the complexity of the problem depends on parameters k, B, and |Σ|. Our main result is an algorithm that solves the Feature Selection problem in time f(k,B,|Σ|)⋅m^{g(k,|Σ|)}⋅n² for some functions f and g. In other words, the problem is fixed-parameter tractable parameterized by B when |Σ| and k are constants. Our algorithm for Feature Selection is based on a solution to a more general problem, Constrained Clustering with Outliers. In this problem, we want to delete a certain number of outliers such that the remaining points could be clustered around centers satisfying specific constraints. One interesting fact about Constrained Clustering with Outliers is that besides Feature Selection, it encompasses many other fundamental problems regarding categorical data such as Robust Clustering, Binary and Boolean Low-rank Matrix Approximation with Outliers, and Binary Robust Projective Clustering. Thus as a byproduct of our theorem, we obtain algorithms for all these problems. We also complement our algorithmic findings with complexity lower bounds.publishedVersio
Model-based learning of local image features for unsupervised texture segmentation
Features that capture well the textural patterns of a certain class of images
are crucial for the performance of texture segmentation methods. The manual
selection of features or designing new ones can be a tedious task. Therefore,
it is desirable to automatically adapt the features to a certain image or class
of images. Typically, this requires a large set of training images with similar
textures and ground truth segmentation. In this work, we propose a framework to
learn features for texture segmentation when no such training data is
available. The cost function for our learning process is constructed to match a
commonly used segmentation model, the piecewise constant Mumford-Shah model.
This means that the features are learned such that they provide an
approximately piecewise constant feature image with a small jump set. Based on
this idea, we develop a two-stage algorithm which first learns suitable
convolutional features and then performs a segmentation. We note that the
features can be learned from a small set of images, from a single image, or
even from image patches. The proposed method achieves a competitive rank in the
Prague texture segmentation benchmark, and it is effective for segmenting
histological images
- …