2,525 research outputs found
Classification with Asymmetric Label Noise: Consistency and Maximal Denoising
In many real-world classification problems, the labels of training examples
are randomly corrupted. Most previous theoretical work on classification with
label noise assumes that the two classes are separable, that the label noise is
independent of the true class label, or that the noise proportions for each
class are known. In this work, we give conditions that are necessary and
sufficient for the true class-conditional distributions to be identifiable.
These conditions are weaker than those analyzed previously, and allow for the
classes to be nonseparable and the noise levels to be asymmetric and unknown.
The conditions essentially state that a majority of the observed labels are
correct and that the true class-conditional distributions are "mutually
irreducible," a concept we introduce that limits the similarity of the two
distributions. For any label noise problem, there is a unique pair of true
class-conditional distributions satisfying the proposed conditions, and we
argue that this pair corresponds in a certain sense to maximal denoising of the
observed distributions.
Our results are facilitated by a connection to "mixture proportion
estimation," which is the problem of estimating the maximal proportion of one
distribution that is present in another. We establish a novel rate of
convergence result for mixture proportion estimation, and apply this to obtain
consistency of a discrimination rule based on surrogate loss minimization.
Experimental results on benchmark data and a nuclear particle classification
problem demonstrate the efficacy of our approach
A transfer-learning approach to feature extraction from cancer transcriptomes with deep autoencoders
Publicado en Lecture Notes in Computer Science.The diagnosis and prognosis of cancer are among the more
challenging tasks that oncology medicine deals with. With the main aim
of fitting the more appropriate treatments, current personalized medicine
focuses on using data from heterogeneous sources to estimate the evolu-
tion of a given disease for the particular case of a certain patient. In recent
years, next-generation sequencing data have boosted cancer prediction by
supplying gene-expression information that has allowed diverse machine
learning algorithms to supply valuable solutions to the problem of cancer
subtype classification, which has surely contributed to better estimation
of patient’s response to diverse treatments. However, the efficacy of these
models is seriously affected by the existing imbalance between the high
dimensionality of the gene expression feature sets and the number of sam-
ples available for a particular cancer type. To counteract what is known
as the curse of dimensionality, feature selection and extraction methods
have been traditionally applied to reduce the number of input variables
present in gene expression datasets. Although these techniques work by
scaling down the input feature space, the prediction performance of tradi-
tional machine learning pipelines using these feature reduction strategies
remains moderate. In this work, we propose the use of the Pan-Cancer
dataset to pre-train deep autoencoder architectures on a subset com-
posed of thousands of gene expression samples of very diverse tumor
types. The resulting architectures are subsequently fine-tuned on a col-
lection of specific breast cancer samples. This transfer-learning approach
aims at combining supervised and unsupervised deep learning models
with traditional machine learning classification algorithms to tackle the
problem of breast tumor intrinsic-subtype classification.Universidad de Málaga. Campus de Excelencia Internacional AndalucĂa Tech
- …