6 research outputs found

    Bayesian Multiple Instance Learning with Application to Cancer Detection Using TCR Repertoire Sequencing Data

    Get PDF
    As a branch of machine learning, multiple instance learning (MIL) learns from a collection of labeled bags, each containing a set of instances. Each instance is described by a feature vector. The learning process is weakly supervised due to ambiguous instance labels. Since its emergence, MIL has been applied to solve various problems including content-based image retrieval, object tracking/detection, and computer-aided diagnosis. In biomedical research, the use of MIL has been focused on medical image analysis and molecule activity prediction. The first part of this dissertation focuses on a comparative study of MIL methods for a novel biomedical application. To date, the majority of the off-the-shelf MIL methods are developed in the computer science domain and so algorithm-driven. We review and apply a large collection of existing methods to investigate the applicability of MIL to cancer detection using T-cell receptor (TCR) sequences. This important application can be a viable approach for large-scale cancer screening, as TCRs can be easily profiled from a subject\u27s peripheral blood. Based on our numerical results from extensive simulation and analysis of sequencing data from The Cancer Genome Atlas for ten types of cancer, we make suggestions about selection of a proper method and avoidance of any method with poor performance. We further identify a pressing need of new model-based MIL methodologies for accurate modeling of increasingly complex structures of real world data and more explainable outcomes. The second part of this dissertation proposes a novel Bayesian MIL method for binary classification based on hierarchical probit regression (MICProB), which contributes a significant portion to the suite of statistical methodologies for MIL. MICProB is composed of two nested probit regression models, where the inner model is estimated for predicting primary instances, which are considered as the ``important\u27\u27 ones that determine the bag label, and the outer model is for predicting bag labels based on the features of primary instances estimated by the inner model. The posterior distribution of MICProB can be conveniently approximated using a Gibbs sampler, and the prediction for new bags can be performed in a fully integrated Bayesian way. We evaluate the performance of MICProB against various benchmark methods and demonstrate its competitiveness in simulation and real data examples. In addition to its capability of identifying primary instances, as compared to existing optimization-based approaches, MICProB also enjoys great advantages in providing a transparent model structure, straightforward statistical inference of quantities related to model parameters, and favorable interpretability of covariate effects on the bag-level response

    A comparative study of multiple instance learning methods for cancer detection using T-cell receptor sequences

    No full text
    As a branch of machine learning, multiple instance learning (MIL) learns from a collection of labeled bags, each containing a set of instances. The learning process is weakly supervised due to ambiguous instance labels. Since its emergence, MIL has been applied to solve various problems including content-based image retrieval, object tracking/detection, and computer-aided diagnosis. In biomedical research, the use of MIL has been focused on medical image analysis and molecule activity prediction. We review and apply 16 methods to investigate the applicability of MIL to a novel biomedical application, cancer detection using T-cell receptor (TCR) sequences. This important application can be a viable approach for large-scale cancer screening, as TCRs can be easily profiled from a subject’s peripheral blood. We consider two feasible data-generating mechanisms, and for the purpose of performance evaluation, we simulate data under each mechanism, where we vary potentially important factors to mimic realistic situations. We also apply the methods to sequencing data of ten cancer types from The Cancer Genome Atlas, as an early proof of concept for distinguishing tumor patients from healthy individuals via TCR sequencing of peripheral blood. We find that given an appropriate MIL method is used, satisfactory performance with Area Under the Receiver Operating Characteristic Curve above 80% can be achieved for five in the ten cancers. Based on our numerical results, we make suggestions about selection of a proper method and avoidance of any method with poor performance. We further point out directions of future research as well as identify a pressing need of new MIL methodologies for improved performance (for some cancer types) and more explainable outcomes

    Genome-wide association mapping and transcriptomic analysis reveal key drought-responding genes in barley seedlings

    No full text
    Drought stress is a major abiotic factor restricting crop production. frequently suffers from drought stress as it is mainly planted in the harsh environments. Little research on the identification of drought-tolerant loci or genes of barley has been performed up to date. Here, we determined the phenotypic variation of drought tolerance in a barley population from the International Barley Core Selected Collection (BCS). Under drought stress, shoot water content showed the distinct difference among barley genotypes and the maximum consistency for a given genotype under the two planting conditions. Twenty significant SNPs (P < 10−3) and 41 candidate genes were identified by genome-wide association study (GWAS)on the examined barley accessions. Furthermore, transcriptomic analysis (RNA-Seq) identified 2030 and 1947 differentially expression genes (DEGs) in the leaves of a drought-sensitive genotype BCS8 and a tolerant genotype BS24, respectively, and they are mainly involved in water deficit processes in GO analysis and metabolic pathways in KEGG analysis. Finally, seven DEGs were confirmed by qRT-PCR as drought-responding genes, including WRKY, NPF and FLA

    Genome-Wide Identification, Expression Pattern and Sequence Variation Analysis of <i>SnRK</i> Family Genes in Barley

    No full text
    Sucrose non-fermenting 1 (SNF1)-related protein kinase (SnRK) is a large family of protein kinases that play a significant role in plant stress responses. Although intensive studies have been conducted on SnRK members in some crops, little is known about the SnRK in barley. Using phylogenetic and conserved motif analyses, we discovered 46 SnRK members scattered across barley’s 7 chromosomes and classified them into 3 sub-families. The gene structures of HvSnRKs showed the divergence among three subfamilies. Gene duplication and synteny analyses on the genomes of barley and rice revealed the evolutionary features of HvSnRKs. The promoter regions of HvSnRK family genes contained many ABRE, MBS and LTR elements responding to abiotic stresses, and their expression patterns varied with different plant tissues and abiotic stresses. HvSnRKs could interact with the components of ABA signaling pathway to respond to abiotic stress. Moreover, the haplotypes of HvSnRK2.5 closely associated with drought tolerance were detected in a barley core collection. The current results could be helpful for further exploration of the HvSnRK genes responding to abiotic stress tolerance in barley

    A deep learning system accurately classifies primary and metastatic cancers using passenger mutation patterns

    No full text
    In cancer, the primary tumour's organ of origin and histopathology are the strongest determinants of its clinical behaviour, but in 3% of cases a patient presents with a metastatic tumour and no obvious primary. Here,as part of the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium, we train a deep learning classifier to predict cancer type based on patterns of somatic passenger mutations detected in whole genome sequencing (WGS) of 2606 tumours representing 24 common cancer types produced by the PCAWG Consortium. Our classifier achieves an accuracy of 91% on held-out tumor samples and 88% and 83% respectively on independent primary and metastatic samples, roughly double the accuracy of trained pathologists when presented with a metastatic tumour without knowledge of the primary. Surprisingly, adding information on driver mutations reduced accuracy. Our results have clinical applicability, underscore how patterns of somatic passenger mutations encode the state of the cell of origin, and can inform future strategies to detect the source of circulating tumour DNA
    corecore