365 research outputs found

    Chipper: discovering transcription-factor targets from chromatin immunoprecipitation microarrays using variance stabilization

    Get PDF
    Chromatin immunoprecipitation combined with microarray technology (Chip(2)) allows genome-wide determination of protein-DNA binding sites. The current standard method for analyzing Chip(2 )data requires additional control experiments that are subject to systematic error. We developed methods to assess significance using variance stabilization, learning error-model parameters without external control experiments. The method was validated experimentally, shows greater sensitivity than the current standard method, and incorporates false-discovery rate analysis. The corresponding software ('Chipper') is freely available. The method described here should help reveal an organism's transcription-regulatory 'wiring diagram'

    Predicting co-complexed protein pairs using genomic and proteomic data integration

    Get PDF
    BACKGROUND: Identifying all protein-protein interactions in an organism is a major objective of proteomics. A related goal is to know which protein pairs are present in the same protein complex. High-throughput methods such as yeast two-hybrid (Y2H) and affinity purification coupled with mass spectrometry (APMS) have been used to detect interacting proteins on a genomic scale. However, both Y2H and APMS methods have substantial false-positive rates. Aside from high-throughput interaction screens, other gene- or protein-pair characteristics may also be informative of physical interaction. Therefore it is desirable to integrate multiple datasets and utilize their different predictive value for more accurate prediction of co-complexed relationship. RESULTS: Using a supervised machine learning approach – probabilistic decision tree, we integrated high-throughput protein interaction datasets and other gene- and protein-pair characteristics to predict co-complexed pairs (CCP) of proteins. Our predictions proved more sensitive and specific than predictions based on Y2H or APMS methods alone or in combination. Among the top predictions not annotated as CCPs in our reference set (obtained from the MIPS complex catalogue), a significant fraction was found to physically interact according to a separate database (YPD, Yeast Proteome Database), and the remaining predictions may potentially represent unknown CCPs. CONCLUSIONS: We demonstrated that the probabilistic decision tree approach can be successfully used to predict co-complexed protein (CCP) pairs from other characteristics. Our top-scoring CCP predictions provide testable hypotheses for experimental validation

    Genome-wide functional analysis of human 5' untranslated region introns

    Get PDF
    Genes with short 5'UTR introns have higher expression than genes with no or long 5'UTR introns. Complex evolutionary forces act on these introns

    An en masse phenotype and function prediction system for Mus musculus

    Get PDF
    Background: Individual researchers are struggling to keep up with the accelerating emergence of high-throughput biological data, and to extract information that relates to their specific questions. Integration of accumulated evidence should permit researchers to form fewer - and more accurate - hypotheses for further study through experimentation. Results: Here a method previously used to predict Gene Ontology (GO) terms for Saccharomyces cerevisiae (Tian et al.: Combining guilt-by-association and guilt-by-profiling to predict Saccharomyces cerevisiae gene function. Genome Biol 2008, 9(Suppl 1):S7) is applied to predict GO terms and phenotypes for 21,603 Mus musculus genes, using a diverse collection of integrated data sources (including expression, interaction, and sequence-based data). This combined 'guilt-by-profiling' and 'guilt-by-association' approach optimizes the combination of two inference methodologies. Predictions at all levels of confidence are evaluated by examining genes not used in training, and top predictions are examined manually using available literature and knowledge base resources. Conclusion: We assigned a confidence score to each gene/term combination. The results provided high prediction performance, with nearly every GO term achieving greater than 40% precision at 1% recall. Among the 36 novel predictions for GO terms and 40 for phenotypes that were studied manually, >80% and >40%, respectively, were identified as accurate. We also illustrate that a combination of 'guilt-by-profiling' and 'guilt-by-association' outperforms either approach alone in their application to M. musculus.Molecular and Cellular Biolog

    Characterizing ABC-Transporter Substrate-Likeness Using a Clean-Slate Genetic Background

    Get PDF
    Mutations in ATP Binding Cassette (ABC)-transporter genes can have major effects on the bioavailability and toxicity of the drugs that are ABC-transporter substrates. Consequently, methods to predict if a drug is an ABC-transporter substrate are useful for drug development. Such methods traditionally relied on literature curated collections of ABC-transporter dependent membrane transfer assays. Here, we used a single large-scale dataset of 376 drugs with relative efficacy on an engineered yeast strain with all ABC-transporter genes deleted (ABC-16), to explore the relationship between a drug’s chemical structure and ABC-transporter substrate-likeness. We represented a drug’s chemical structure by an array of substructure keys and explored several machine learning methods to predict the drug’s efficacy in an ABC-16 yeast strain. Gradient-Boosted Random Forest models outperformed all other methods with an AUC of 0.723. We prospectively validated the model using new experimental data and found significant agreement with predictions. Our analysis expands the previously reported chemical substructures associated with ABC-transporter substrates and provides an alternative means to investigate ABC-transporter substrate-likeness

    A Common Class of Transcripts with 5\u27-Intron Depletion, Distinct Early Coding Sequence Features, and N1-Methyladenosine Modification [preprint]

    Get PDF
    Introns are found in 5\u27 untranslated regions (5\u27UTRs) for 35% of all human transcripts. These 5\u27UTR introns are not randomly distributed: genes that encode secreted, membrane-bound and mitochondrial proteins are less likely to have them. Curiously, transcripts lacking 5\u27UTR introns tend to harbor specific RNA sequence elements in their early coding regions. To model and understand the connection between coding-region sequence and 5\u27UTR intron status, we developed a classifier that can predict 5\u27UTR intron status with \u3e80% accuracy using only sequence features in the early coding region. Thus, the classifier identifies transcripts with 5\u27 proximal-intron-minus-like-coding regions ( 5IM transcripts). Unexpectedly, we found that the early coding sequence features defining 5IM transcripts are widespread, appearing in 21% of all human RefSeq transcripts. The 5IM class of transcripts is enriched for non-AUG start codons, more extensive secondary structure both preceding the start codon and near the 5\u27 cap, greater dependence on eIF4E for translation, and association with ER-proximal ribosomes. 5IM transcripts are bound by the Exon Junction Complex (EJC) at non-canonical 5\u27 proximal positions. Finally, N1-methyladenosines are specifically enriched in the early coding regions of 5IM transcripts. Taken together, our analyses point to the existence of a distinct 5IM class comprising ~20% of human transcripts. This class is defined by depletion of 5\u27 proximal introns, presence of specific RNA sequence features associated with low translation efficiency, N1-methyladenosines in the early coding region, and enrichment for non-canonical binding by the Exon Junction Complex

    A common class of transcripts with 5\u27-intron depletion, distinct early coding sequence features, and N1-methyladenosine modification

    Get PDF
    Introns are found in 5\u27 untranslated regions (5\u27UTRs) for 35% of all human transcripts. These 5\u27UTR introns are not randomly distributed: Genes that encode secreted, membrane-bound and mitochondrial proteins are less likely to have them. Curiously, transcripts lacking 5\u27UTR introns tend to harbor specific RNA sequence elements in their early coding regions. To model and understand the connection between coding-region sequence and 5\u27UTR intron status, we developed a classifier that can predict 5\u27UTR intron status with \u3e 80% accuracy using only sequence features in the early coding region. Thus, the classifier identifies transcripts with 5\u27 proximal-intron-minus-like-coding regions ( 5IM transcripts). Unexpectedly, we found that the early coding sequence features defining 5IM transcripts are widespread, appearing in 21% of all human RefSeq transcripts. The 5IM class of transcripts is enriched for non-AUG start codons, more extensive secondary structure both preceding the start codon and near the 5\u27 cap, greater dependence on eIF4E for translation, and association with ER-proximal ribosomes. 5IM transcripts are bound by the exon junction complex (EJC) at noncanonical 5\u27 proximal positions. Finally, N1-methyladenosines are specifically enriched in the early coding regions of 5IM transcripts. Taken together, our analyses point to the existence of a distinct 5IM class comprising approximately 20% of human transcripts. This class is defined by depletion of 5\u27 proximal introns, presence of specific RNA sequence features associated with low translation efficiency, N1-methyladenosines in the early coding region, and enrichment for noncanonical binding by the EJC
    • …
    corecore