4 research outputs found

    In silico phenotyping via co-training for improved phenotype prediction from genotype

    Get PDF
    Motivation: Predicting disease phenotypes from genotypes is a key challenge in medical applications in the postgenomic era. Large training datasets of patients that have been both genotyped and phenotyped are the key requisite when aiming for high prediction accuracy. With current genotyping projects producing genetic data for hundreds of thousands of patients, large-scale phenotyping has become the bottleneck in disease phenotype prediction. Results: Here we present an approach for imputing missing disease phenotypes given the genotype of a patient. Our approach is based on co-training, which predicts the phenotype of unlabeled patients based on a second class of information, e.g. clinical health record information. Augmenting training datasets by this type of in silico phenotyping can lead to significant improvements in prediction accuracy. We demonstrate this on a dataset of patients with two diagnostic types of migraine, termed migraine with aura and migraine without aura, from the International Headache Genetics Consortium. Conclusions: Imputing missing disease phenotypes for patients via co-training leads to larger training datasets and improved prediction accuracy in phenotype prediction. Availability and implementation: The code can be obtained at: http://www.bsse.ethz.ch/mlcb/research/bioinformatics-and-computational-biology/co-training.html Contact: [email protected] or [email protected] Supplementary information: Supplementary data are available at Bioinformatics onlin

    Predicting enhancers using a small subset of high confidence examples and co-training

    Get PDF
    ABSTRACT Enhancers are important regulatory regions located throughout the genome, primarily in non-coding regions. Several experimental methods have been developed over the last several years to identify their location, but the search space is large and the overlap between the putative enhancer identified using these methods tends to be very small. Computational methods for enhancer prediction often use one large set of experimentally identified enhancer regions as input, and therefore rely critically on their correctness. We chose to take a different approach, and start with a high confidence set of 21 enhancer that are in the intersection of enhancers identified using three completely unrelated experimental approaches: deepCAGE, HiCap and classical enhancer reporter assays. Because this starting set is so small, we use a semi-supervised approach called co-training rather than a fully supervised approach to progressively predict enhancers from unlabeled regions. Using this approach we are able to outperform supervised learning as well as simpler semi-supervised learning methods and achieve an average area under the ROC curve of 0.84

    Multi-view Co-training for microRNA Prediction

    Get PDF
    MicroRNA (miRNA) are short, non-coding RNAs involved in cell regulation at post-transcriptional and translational levels. Numerous computational predictors of miRNA been developed that generally classify miRNA based on either sequence- or expression-based features. While these methods are highly effective, they require large labelled training data sets, which are often not available for many species. Simultaneously, emerging high-throughput wet-lab experimental procedures are producing large unlabelled data sets of genomic sequence and RNA expression profiles. Existing methods use supervised machine learning and are therefore unable to leverage these unlabelled data. In this paper, we design and develop a multi-view co-training approach for the classification of miRNA to maximize the utility of unlabelled training data by taking advantage of multiple views of the problem. Starting with only 10 labelled training data, co-training is shown to significantly (p < 0.01) increase classification accuracy of both sequence- and expression-based classifiers, without requiring any new labelled training data. After 11 iterations of co-training, the expression-based view of miRNA classification experiences an average increase in AUPRC of 15.81% over six species, compared to 11.90% for self-training and 4.84% for passive learning. Similar results are observed for sequence-based classifiers with increases of 46.47%, 39.53% and 29.43%, for co-training, self-training, and passive learning, respectively. The final co-trained sequence and expression-based classifiers are integrated into a final confidence-based classifier which shows improved performance compared to both the expression (1.5%, p = 0.021) and sequence (3.7%, p = 0.006) views. This study represents the first application of multi-view co-training to miRNA prediction and shows great promise, particularly for understudied species with few available training data

    In silico phenotyping via co-training for improved phenotype prediction from genotype

    No full text
    Functional Genomics of Muscle, Nerve and Brain Disorder
    corecore