3 research outputs found

    AdaSampling for positive-unlabeled and label noise learning with bioinformatics applications

    Full text link
    © 2018 IEEE. Class labels are required for supervised learning but may be corrupted or missing in various applications. In binary classification, for example, when only a subset of positive instances is labeled whereas the remaining are unlabeled, positive-unlabeled (PU) learning is required to model from both positive and unlabeled data. Similarly, when class labels are corrupted by mislabeled instances, methods are needed for learning in the presence of class label noise (LN). Here we propose adaptive sampling (AdaSampling), a framework for both PU learning and learning with class LN. By iteratively estimating the class mislabeling probability with an adaptive sampling procedure, the proposed method progressively reduces the risk of selecting mislabeled instances for model training and subsequently constructs highly generalizable models even when a large proportion of mislabeled instances is present in the data. We demonstrate the utilities of proposed methods using simulation and benchmark data, and compare them to alternative approaches that are commonly used for PU learning and/or learning with LN. We then introduce two novel bioinformatics applications where AdaSampling is used to: 1) identify kinase-substrates from mass spectrometry-based phosphoproteomics data and 2) predict transcription factor target genes by integrating various next-generation sequencing data

    Discovery and Interpretation of Subspace Structures in Omics Data by Low-Rank Representation

    Get PDF
    Indiana University-Purdue University Indianapolis (IUPUI)Biological functions in cells are highly complicated and heterogenous, and can be reflected by omics data, such as gene expression levels. Detecting subspace structures in omics data and understanding the diversity of the biological processes is essential to the full comprehension of biological mechanisms and complicated biological systems. In this thesis, we are developing novel statistical learning approaches to reveal the subspace structures in omics data. Specifically, we focus on three types of subspace structures: low-rank subspace, sparse subspace and covariates explainable subspace. For low-rank subspace, we developed a semi-supervised model SSMD to detect cell type specific low-rank structures and predict their relative proportions across different tissue samples. SSMD is the first computational tool that utilizes semi-supervised identification of cell types and their marker genes specific to each mouse tissue transcriptomics data, for better understanding of the disease microenvironment and downstream disease mechanism. For sparsity-driven sparse subspace, we proposed a novel positive and unlabeled learning model, namely PLUS, that could identify cancer metastasis related genes, predict cancer metastasis status and specifically address the under-diagnosis issue in studying metastasis potential. We found PLUS predicted metastasis potential at diagnosis have significantly strong association with patient’s progression-free survival in their follow-up data. Lastly, to discover the covariates explainable subspace, we proposed an analytical pipeline based on covariance regression, namely, scCovReg. We utilized scCovReg to detect the pathway level second-order variations using scRNA-Seq data in a statistically powerful manner, and to associate the second-order variations with important subject-level characteristics, such as disease status. In conclusion, we presented a set of state-of-the-art computational solutions for identifying sparse subspaces in omics data, which promise to provide insights into the mechanism in complex diseases

    Automated Machine Learning for Positive-Unlabelled Learning

    Get PDF
    Positive-Unlabelled (PU) learning is a field of machine learning that involves learning classifiers from data consisting of positive class and unlabelled instances. That is, instances that may be either positive or negative, but the label is unknown. PU learning differs from standard binary classification due to the absence of negative instances. This difference is non-trivial and requires differing classification frameworks and evaluation metrics. This thesis looks to address gaps in the PU learning literature and make PU learning more accessible to non-experts by introducing Automated Machine Learning (Auto-ML) systems specific to PU learning. Three such systems have been developed, GA-Auto-PU, a Genetic Algorithm (GA)-based Auto-ML system, BO-Auto-PU, a Bayesian Optimisation (BO)-based Auto-ML system, and EBO-Auto-PU, an Evolutionary/Bayesian Optimisation (EBO) hybrid-based Auto-ML system. These three Auto-ML systems are three primary contributions of this work. EBO, the optimiser component of EBO-Auto-PU, is by itself a novel optimisation method developed in this work that has proved effective for the task of Auto-ML and represents another contribution. EBO was developed with the aim of acting as a trade-off between GA, which achieved high predictive performance but at high computational expense, and BO, which, when utilised by the Auto-PU system, did not perform as well as the GA-based system but did execute much faster. EBO achieved this aim, providing high predictive performance with a computational runtime much faster than the GA-based system, and not substantially slower than the BO-based system. The proposed Auto-ML systems for PU learning were evaluated on three versions of 40 datasets, thus evaluated on 120 learning tasks in total. The 40 datasets consist of 20 real-world biomedical datasets and 20 synthetic datasets. The main evaluation measure was the F-measure, a popular measure in PU learning. Based on the F-measure results, the three proposed systems outperformed in general two baseline PU learning methods, usually with statistically significant results. Among the three proposed systems, there was no statistically significance difference between their results in general, whilst a version of the EBO-Auto-PU system performed overall slightly better than the other systems, in terms of F-measure. The two other main contributions of this work relate specifically to the field of PU learning. Firstly, in this work we present and utilise a robust evaluation approach. Evaluating PU learning classifiers is non-trivial and little guidance has been provided in the literature on how to do so. In this work, we present a clear framework for evaluation and use this framework to evaluate the proposed systems. Secondly, when evaluating the proposed systems, an analysis of the most frequently selected components of the optimised PU learning algorithm is presented. That is, the components that constitute the PU learning algorithms produced by the optimisers (for example, the choice of classifiers used in the algorithm, the number of iterations, etc.). This analysis is used to provide guidance on the construction of PU learning algorithms for specific dataset characteristics
    corecore