2,525 research outputs found

    Classification with Asymmetric Label Noise: Consistency and Maximal Denoising

    Full text link
    In many real-world classification problems, the labels of training examples are randomly corrupted. Most previous theoretical work on classification with label noise assumes that the two classes are separable, that the label noise is independent of the true class label, or that the noise proportions for each class are known. In this work, we give conditions that are necessary and sufficient for the true class-conditional distributions to be identifiable. These conditions are weaker than those analyzed previously, and allow for the classes to be nonseparable and the noise levels to be asymmetric and unknown. The conditions essentially state that a majority of the observed labels are correct and that the true class-conditional distributions are "mutually irreducible," a concept we introduce that limits the similarity of the two distributions. For any label noise problem, there is a unique pair of true class-conditional distributions satisfying the proposed conditions, and we argue that this pair corresponds in a certain sense to maximal denoising of the observed distributions. Our results are facilitated by a connection to "mixture proportion estimation," which is the problem of estimating the maximal proportion of one distribution that is present in another. We establish a novel rate of convergence result for mixture proportion estimation, and apply this to obtain consistency of a discrimination rule based on surrogate loss minimization. Experimental results on benchmark data and a nuclear particle classification problem demonstrate the efficacy of our approach

    A transfer-learning approach to feature extraction from cancer transcriptomes with deep autoencoders

    Get PDF
    Publicado en Lecture Notes in Computer Science.The diagnosis and prognosis of cancer are among the more challenging tasks that oncology medicine deals with. With the main aim of fitting the more appropriate treatments, current personalized medicine focuses on using data from heterogeneous sources to estimate the evolu- tion of a given disease for the particular case of a certain patient. In recent years, next-generation sequencing data have boosted cancer prediction by supplying gene-expression information that has allowed diverse machine learning algorithms to supply valuable solutions to the problem of cancer subtype classification, which has surely contributed to better estimation of patient’s response to diverse treatments. However, the efficacy of these models is seriously affected by the existing imbalance between the high dimensionality of the gene expression feature sets and the number of sam- ples available for a particular cancer type. To counteract what is known as the curse of dimensionality, feature selection and extraction methods have been traditionally applied to reduce the number of input variables present in gene expression datasets. Although these techniques work by scaling down the input feature space, the prediction performance of tradi- tional machine learning pipelines using these feature reduction strategies remains moderate. In this work, we propose the use of the Pan-Cancer dataset to pre-train deep autoencoder architectures on a subset com- posed of thousands of gene expression samples of very diverse tumor types. The resulting architectures are subsequently fine-tuned on a col- lection of specific breast cancer samples. This transfer-learning approach aims at combining supervised and unsupervised deep learning models with traditional machine learning classification algorithms to tackle the problem of breast tumor intrinsic-subtype classification.Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tech
    • …
    corecore