Search CORE

7,090 research outputs found

Linear predictive coding representation of correlated mutation for protein sequence alignment

Author: A Elofsson
AG Murzin
AS Yang
BC Lee
Chan-seok Jeong
CM Buslje
D Cozzetto
Dongsup Kim
DT Jones
E Neher
ER Tillier
G Shackelford
GJ Bartlett
GM Süel
J Kleinjung
J Kopp
J Söding
JM Chandonia
JP Dekker
LR Rabiner
M Lee
N Siew
O Olmea
S Wu
SD Dunn
SF Altschul
SW Lockless
T Ohlson
T Pham
U Göbel
WR Atchley
Y Qi
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Crossref

Springer - Publisher Connector

PubMed Central

Development of New Bioinformatic Approaches for Human Genetic Studies

Author: Coto Jose Andres Guevara
Publication venue: Clemson University Libraries
Publication date: 01/12/2017
Field of study

The development of bioinformatics methods for human genetic studies utilizes the vast amount of data to generate new valuable information. Machine learning and statistical coupling analysis can be used in the study of human diseases. These diseases include intellectual disabilities (ID), prevalent in 1-3% of the population and caused primarily by genetics. Although many cases of ID are caused by mutations in protein-coding genes, the possible involvement of long non-coding RNAs (lncRNAs) in ID due to their role in gene expression regulation, has been explored. In this study, we used machine learning to develop a new expression-based model trained using ID genes encoded with the developing brain transcriptome. The model was fine-tuned using the class-balancing approach of synthetic over-sampling of the minority class, resulting in improved performance. We used the model to predict candidate ID-associated lncRNAs. Our model identified several candidates that overlapped with previously reported ID-associated lncRNAs, enriched with neurodevelopmental functions, and highly expressed in brain tissues. Machine learning was also used to predict protein stability changes caused by missense mutations, which can lead to disease conditions including ID. We tested Random Forests, Support Vector Machines (SVM) and Naïve Bayes to find the best-performing algorithm to develop a multi-class classifier. We developed an SVM model using relevant physico-chemical features after feature selection. Our work identified new features for predicting the effect of amino acid substitutions on protein stability and a well-performing multi-class classifier solely based on sequence information. Statistical approaches were used to analyze the association between mutations and phenotypes. In this study, we used statistical coupling analysis (SCA) to cluster disease-causing mutations and ID phenotypes. Using SCA we identified groups of co-evolving residues, known as protein sectors, in ID protein families. Within each distinct sector, mutations associated with different phenotypic manifestations associated with a syndromic ID were identified. Our results suggest that protein sector analysis can be used to associate mutations with phenotypic manifestations in human diseases. The bioinformatic methods developed in this dissertation can be used in human genetic research to understand the role of new genes and proteins in human disease

Clemson University: TigerPrints

Accurate Prediction of the Functional Significance of Single Nucleotide Polymorphisms and Mutations in the ABCA1 Gene

Author: Anish Kejariwal
David Allison
Liam R Brunham
Michael R Hayden
Paul D Thomas
Roshni R Singaraja
Terry D Pape
Publication venue: Public Library of Science
Publication date: 01/01/2005
Field of study

The human genome contains an estimated 100,000 to 300,000 DNA variants that alter an amino acid in an encoded protein. However, our ability to predict which of these variants are functionally significant is limited. We used a bioinformatics approach to define the functional significance of genetic variation in the ABCA1 gene, a cholesterol transporter crucial for the metabolism of high density lipoprotein cholesterol. To predict the functional consequence of each coding single nucleotide polymorphism and mutation in this gene, we calculated a substitution position-specific evolutionary conservation score for each variant, which considers site-specific variation among evolutionarily related proteins. To test the bioinformatics predictions experimentally, we evaluated the biochemical consequence of these sequence variants by examining the ability of cell lines stably transfected with the ABCA1 alleles to elicit cholesterol efflux. Our bioinformatics approach correctly predicted the functional impact of greater than 94% of the naturally occurring variants we assessed. The bioinformatics predictions were significantly correlated with the degree of functional impairment of ABCA1 mutations (r (2) = 0.62, p = 0.0008). These results have allowed us to define the impact of genetic variation on ABCA1 function and to suggest that the in silico evolutionary approach we used may be a useful tool in general for predicting the effects of DNA variation on gene function. In addition, our data suggest that considering patterns of positive selection, along with patterns of negative selection such as evolutionary conservation, may improve our ability to predict the functional effects of amino acid variation

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

ScholarBank@NUS

Phylogenetic influence of complex, evolutionary models: a Bayesian approach

Author: Krishnan Neeraja M
Publication venue: LSU Digital Commons
Publication date: 01/01/2004
Field of study

Molecular evolution recovers the history of living species by comparing genetic information, exploring genome structure and function from an evolutionary perspective. Here we infer substitution rates and ancestral reconstructions, to better understand mutation responses to some known biochemical phenomena. Mutation processes are commonly inferred using parsimony, maximum likelihood and Bayesian. Parsimony is not explicitly model-based, and is statistically biased due to unrealistic assumptions. The model-based maximum likelihood approaches become computationally inefficient while analyzing large or high-dimensional datasets, leaving little opportunities to incorporate complex evolutionary models. We implemented a posterior probability (Bayesian) approach that evaluates evolutionary models, applying it to primate mitochondrial genomes. The species nucleotide sequence data were augmented with ancestral states at the internal nodes of the phylogeny. We simplified probability calculations for substitution events along the branches by assuming that only up to one or two substitution events occurred per branch per site. These conditional pathway calculations introduce very little bias into the inferred reconstructions, while increasing the feasibility of incorporating complex evolutionary models with higher dimensions. Compositional bias tests, including functional predictions of ancestral tRNAs, show that ancestral sequences from the Bayesian approach are more biologically realistic than those reconstructed by maximum likelihood. To explore other model complexity, we allowed substitution rates to vary among sites by having a different model at each site. With a strand-symmetric model as the base model, asymmetric substitution probabilities for specific substitution types were varied among sites. This model would not be feasible with standard matrix exponentiation methods, particularly maximum likelihood. We observed for A--\u3eG and C--\u3eT substitutions almost linear, respectively, almost asymptotic responses (with some regional deviations). Note that the HMM models had no a priori response built in them. Observed responses fitted predictions from earlier gene by gene likelihood analyses. For A--\u3eG substitutions, deviations from the expected linear response correlated positively with the loop-forming propensity of the corresponding site in the mRNA secondary structure. In the COI region, C--\u3eT substitutions have a prominent dip, suggesting protection against mutations. The C--\u3eT substitution responses differed significantly between primate sub-groups defined based on their single genome A--\u3eG responses

Louisiana State University

Deep learning suggests that gene expression is encoded in all parts of a co-evolving interacting gene regulatory structure

Author: B\uf6rlin Christoph Sebastian
Buric Filip
Chen Rongzhen
Nielsen Jens B
Sheikh Muhammad Azam
Siewers Verena
T\uf6pel Mats
Verendel Vilhelm
Zelezniak Aleksej
Zrimec Jan
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2020
Field of study

Understanding the genetic regulatory code governing gene expression is an important challenge in molecular biology. However, how individual coding and non-coding regions of the gene regulatory structure interact and contribute to mRNA expression levels remains unclear. Here we apply deep learning on over 20,000 mRNA datasets to examine the genetic regulatory code controlling mRNA abundance in 7 model organisms ranging from bacteria to Human. In all organisms, we can predict mRNA abundance directly from DNA sequence, with up to 82% of the variation of transcript levels encoded in the gene regulatory structure. By searching for DNA regulatory motifs across the gene regulatory structure, we discover that motif interactions could explain the whole dynamic range of mRNA levels.\ua0Co-evolution across coding and non-coding regions suggests that it is not single motifs or regions, but the entire gene regulatory structure and specific combination of regulatory elements that define gene expression levels

Chalmers Research

Recommended from our members

Accurate Prediction Methods on Biomolecular Data

Author: Hasan Md Abid
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

With the recent advancements in sequencing technologies, molecular biologists are producing ever-increasing amounts of biomolecular data. Extracting useful information from these massive data sets requires efficient and effective data mining and machine learning methods. In this dissertation, we explore the use of supervised machine learning (ML) to solve some challenging classification problems in molecular biology.First, we devise an ML model for classifying cancer types from very sparse somatic point mutation data. Accumulation of mutation and epigenetic modifications in somatic cells results in various cancer. For this purpose, we propose a method called mClass for efficient feature (gene) ranking that uses clustering, normalized mutual information and logistic regression. We show that somatic mutation data has sufficient discriminative power for cancer type classification.Next, we address the problem of gene essentiality prediction in microbes. Essential genes are significant to identify since their function is vital for the survival of the organism. Our proposed deep learning architecture called DeeplyEssential exclusively uses features extracted from the primary sequence of genes and their corresponding proteins, to maximize the utility and practicality of the tool. DeeplyEssential achieved state-of-the-art performance over previously proposed methods as well as expose and study a hidden performance bias affected previous models.Finally, we consider the problem of predicting the enhancer regions in the human genome from chromatin data. Enhancers contribute to the transcription of target genes. We propose a convolutional neural network framework named Epi2En that takes advantage of epigenetic ChIP-seq data. Epi2En's classification performance is not only very strong on cross-validation experiments, but also when testing across different cell-lines

eScholarship - University of California

Host sequence motifs shared by HIV predict response to antiretroviral therapy

Author: A Ertel
A Matsukawa
A Mocroft
A Rambaut
AD Frankel
AE Kel
AL Brass
AL Brass
Aydin Tozeren
B Larder
C Van Lint
D Jacobs
D Li
DM Moore
E De Clercq
F Longo
GM Lucas
H Vermeiren
J Castilla
J Fellay
J Huang
J Mulder
JA Levy
JW Pinney
K Kadaveru
L Nanni
LM Mansky
Lyle Ungar
M Hariharan
M Kanehisa
M Rehmsmeier
M Rosen-Zvi
ME Garber
MH Katz
MK Kuhner
MV Rockman
N Beerenwinkel
N Beerenwinkel
NR Draper
P Puntervoll
Perry Evans
R König
RG Ptak
RH Stauber
RI Connor
RM Biondi
RM Grant
RW Shafer
S Scheer
SF Altschul
SG Deeks
SG Deeks
SY Rhee
V Matys
V Nair
VA Johnson
VA Johnson
VA Johnson
William Dampier
WM Kati
Y He
Y Pommier
Y Pommier
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background The HIV viral genome mutates at a high rate and poses a significant long term health risk even in the presence of combination antiretroviral therapy. Current methods for predicting a patient's response to therapy rely on site-directed mutagenesis experiments and <it>in vitro </it>resistance assays. In this bioinformatics study we treat response to antiretroviral therapy as a two-body problem: response to therapy is considered to be a function of both the host and pathogen proteomes. We set out to identify potential responders based on the presence or absence of host protein and DNA motifs on the HIV proteome. Results An alignment of thousands of HIV-1 sequences attested to extensive variation in nucleotide sequence but also showed conservation of eukaryotic short linear motifs on the protein coding regions. The reduction in viral load of patients in the Stanford HIV Drug Resistance Database exhibited a bimodal distribution after 24 weeks of antiretroviral therapy, with 2,000 copies/ml cutoff. Similarly, patients allocated into responder/non-responder categories based on consistent viral load reduction during a 24 week period showed clear separation. In both cases of phenotype identification, a set of features composed of short linear motifs in the reverse transcriptase region of HIV sequence accurately predicted a patient's response to therapy. Motifs that overlap resistance sites were highly predictive of responder identification in single drug regimens but these features lost importance in defining responders in multi-drug therapies. Conclusion HIV sequence mutates in a way that preferentially preserves peptide sequence motifs that are also found in the human proteome. The presence and absence of such motifs at specific regions of the HIV sequence is highly predictive of response to therapy. Some of these predictive motifs overlap with known HIV-1 resistance sites. These motifs are well established in bioinformatics databases and hence do not require identification via <it>in vitro </it>mutation experiments.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central