7,090 research outputs found

    Development of New Bioinformatic Approaches for Human Genetic Studies

    Get PDF
    The development of bioinformatics methods for human genetic studies utilizes the vast amount of data to generate new valuable information. Machine learning and statistical coupling analysis can be used in the study of human diseases. These diseases include intellectual disabilities (ID), prevalent in 1-3% of the population and caused primarily by genetics. Although many cases of ID are caused by mutations in protein-coding genes, the possible involvement of long non-coding RNAs (lncRNAs) in ID due to their role in gene expression regulation, has been explored. In this study, we used machine learning to develop a new expression-based model trained using ID genes encoded with the developing brain transcriptome. The model was fine-tuned using the class-balancing approach of synthetic over-sampling of the minority class, resulting in improved performance. We used the model to predict candidate ID-associated lncRNAs. Our model identified several candidates that overlapped with previously reported ID-associated lncRNAs, enriched with neurodevelopmental functions, and highly expressed in brain tissues. Machine learning was also used to predict protein stability changes caused by missense mutations, which can lead to disease conditions including ID. We tested Random Forests, Support Vector Machines (SVM) and Naïve Bayes to find the best-performing algorithm to develop a multi-class classifier. We developed an SVM model using relevant physico-chemical features after feature selection. Our work identified new features for predicting the effect of amino acid substitutions on protein stability and a well-performing multi-class classifier solely based on sequence information. Statistical approaches were used to analyze the association between mutations and phenotypes. In this study, we used statistical coupling analysis (SCA) to cluster disease-causing mutations and ID phenotypes. Using SCA we identified groups of co-evolving residues, known as protein sectors, in ID protein families. Within each distinct sector, mutations associated with different phenotypic manifestations associated with a syndromic ID were identified. Our results suggest that protein sector analysis can be used to associate mutations with phenotypic manifestations in human diseases. The bioinformatic methods developed in this dissertation can be used in human genetic research to understand the role of new genes and proteins in human disease

    Accurate Prediction of the Functional Significance of Single Nucleotide Polymorphisms and Mutations in the ABCA1 Gene

    Get PDF
    The human genome contains an estimated 100,000 to 300,000 DNA variants that alter an amino acid in an encoded protein. However, our ability to predict which of these variants are functionally significant is limited. We used a bioinformatics approach to define the functional significance of genetic variation in the ABCA1 gene, a cholesterol transporter crucial for the metabolism of high density lipoprotein cholesterol. To predict the functional consequence of each coding single nucleotide polymorphism and mutation in this gene, we calculated a substitution position-specific evolutionary conservation score for each variant, which considers site-specific variation among evolutionarily related proteins. To test the bioinformatics predictions experimentally, we evaluated the biochemical consequence of these sequence variants by examining the ability of cell lines stably transfected with the ABCA1 alleles to elicit cholesterol efflux. Our bioinformatics approach correctly predicted the functional impact of greater than 94% of the naturally occurring variants we assessed. The bioinformatics predictions were significantly correlated with the degree of functional impairment of ABCA1 mutations (r (2) = 0.62, p = 0.0008). These results have allowed us to define the impact of genetic variation on ABCA1 function and to suggest that the in silico evolutionary approach we used may be a useful tool in general for predicting the effects of DNA variation on gene function. In addition, our data suggest that considering patterns of positive selection, along with patterns of negative selection such as evolutionary conservation, may improve our ability to predict the functional effects of amino acid variation

    Phylogenetic influence of complex, evolutionary models: a Bayesian approach

    Get PDF
    Molecular evolution recovers the history of living species by comparing genetic information, exploring genome structure and function from an evolutionary perspective. Here we infer substitution rates and ancestral reconstructions, to better understand mutation responses to some known biochemical phenomena. Mutation processes are commonly inferred using parsimony, maximum likelihood and Bayesian. Parsimony is not explicitly model-based, and is statistically biased due to unrealistic assumptions. The model-based maximum likelihood approaches become computationally inefficient while analyzing large or high-dimensional datasets, leaving little opportunities to incorporate complex evolutionary models. We implemented a posterior probability (Bayesian) approach that evaluates evolutionary models, applying it to primate mitochondrial genomes. The species nucleotide sequence data were augmented with ancestral states at the internal nodes of the phylogeny. We simplified probability calculations for substitution events along the branches by assuming that only up to one or two substitution events occurred per branch per site. These conditional pathway calculations introduce very little bias into the inferred reconstructions, while increasing the feasibility of incorporating complex evolutionary models with higher dimensions. Compositional bias tests, including functional predictions of ancestral tRNAs, show that ancestral sequences from the Bayesian approach are more biologically realistic than those reconstructed by maximum likelihood. To explore other model complexity, we allowed substitution rates to vary among sites by having a different model at each site. With a strand-symmetric model as the base model, asymmetric substitution probabilities for specific substitution types were varied among sites. This model would not be feasible with standard matrix exponentiation methods, particularly maximum likelihood. We observed for A--\u3eG and C--\u3eT substitutions almost linear, respectively, almost asymptotic responses (with some regional deviations). Note that the HMM models had no a priori response built in them. Observed responses fitted predictions from earlier gene by gene likelihood analyses. For A--\u3eG substitutions, deviations from the expected linear response correlated positively with the loop-forming propensity of the corresponding site in the mRNA secondary structure. In the COI region, C--\u3eT substitutions have a prominent dip, suggesting protection against mutations. The C--\u3eT substitution responses differed significantly between primate sub-groups defined based on their single genome A--\u3eG responses

    Deep learning suggests that gene expression is encoded in all parts of a co-evolving interacting gene regulatory structure

    Get PDF
    Understanding the genetic regulatory code governing gene expression is an important challenge in molecular biology. However, how individual coding and non-coding regions of the gene regulatory structure interact and contribute to mRNA expression levels remains unclear. Here we apply deep learning on over 20,000 mRNA datasets to examine the genetic regulatory code controlling mRNA abundance in 7 model organisms ranging from bacteria to Human. In all organisms, we can predict mRNA abundance directly from DNA sequence, with up to 82% of the variation of transcript levels encoded in the gene regulatory structure. By searching for DNA regulatory motifs across the gene regulatory structure, we discover that motif interactions could explain the whole dynamic range of mRNA levels.\ua0Co-evolution across coding and non-coding regions suggests that it is not single motifs or regions, but the entire gene regulatory structure and specific combination of regulatory elements that define gene expression levels

    Host sequence motifs shared by HIV predict response to antiretroviral therapy

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The HIV viral genome mutates at a high rate and poses a significant long term health risk even in the presence of combination antiretroviral therapy. Current methods for predicting a patient's response to therapy rely on site-directed mutagenesis experiments and <it>in vitro </it>resistance assays. In this bioinformatics study we treat response to antiretroviral therapy as a two-body problem: response to therapy is considered to be a function of both the host and pathogen proteomes. We set out to identify potential responders based on the presence or absence of host protein and DNA motifs on the HIV proteome.</p> <p>Results</p> <p>An alignment of thousands of HIV-1 sequences attested to extensive variation in nucleotide sequence but also showed conservation of eukaryotic short linear motifs on the protein coding regions. The reduction in viral load of patients in the Stanford HIV Drug Resistance Database exhibited a bimodal distribution after 24 weeks of antiretroviral therapy, with 2,000 copies/ml cutoff. Similarly, patients allocated into responder/non-responder categories based on consistent viral load reduction during a 24 week period showed clear separation. In both cases of phenotype identification, a set of features composed of short linear motifs in the reverse transcriptase region of HIV sequence accurately predicted a patient's response to therapy. Motifs that overlap resistance sites were highly predictive of responder identification in single drug regimens but these features lost importance in defining responders in multi-drug therapies.</p> <p>Conclusion</p> <p>HIV sequence mutates in a way that preferentially preserves peptide sequence motifs that are also found in the human proteome. The presence and absence of such motifs at specific regions of the HIV sequence is highly predictive of response to therapy. Some of these predictive motifs overlap with known HIV-1 resistance sites. These motifs are well established in bioinformatics databases and hence do not require identification via <it>in vitro </it>mutation experiments.</p
    corecore