196 research outputs found

    Global analysis of SNPs, proteins and protein-protein interactions: approaches for the prioritisation of candidate disease genes.

    Get PDF
    PhDUnderstanding the etiology of complex disease remains a challenge in biology. In recent years there has been an explosion in biological data, this study investigates machine learning and network analysis methods as tools to aid candidate disease gene prioritisation, specifically relating to hypertension and cardiovascular disease. This thesis comprises four sets of analyses: Firstly, non synonymous single nucleotide polymorphisms (nsSNPs) were analysed in terms of sequence and structure based properties using a classifier to provide a model for predicting deleterious nsSNPs. The degree of sequence conservation at the nsSNP position was found to be the single best attribute but other sequence and structural attributes in combination were also useful. Predictions for nsSNPs within Ensembl have been made publicly available. Secondly, predicting protein function for proteins with an absence of experimental data or lack of clear similarity to a sequence of known function was addressed. Protein domain attributes based on physicochemical and predicted structural characteristics of the sequence were used as input to classifiers for predicting membership of large and diverse protein superfamiles from the SCOP database. An enrichment method was investigated that involved adding domains to the training dataset that are currently absent from SCOP. This analysis resulted in improved classifier accuracy, optimised classifiers achieved 66.3% for single domain proteins and 55.6% when including domains from multi domain proteins. The domains from superfamilies with low sequence similarity, share global sequence properties enabling applications to be developed which compliment profile methods for detecting distant sequence relationships. Thirdly, a topological analysis of the human protein interactome was performed. The results were combined with functional annotation and sequence based properties to build models for predicting hypertension associated proteins. The study found that predicted hypertension related proteins are not generally associated with network hubs and do not exhibit high clustering coefficients. Despite this, they tend to be closer and better connected to other hypertension proteins on the interaction network than would be expected by chance. Classifiers that combined PPI network, amino acid sequence and functional properties produced a range of precision and recall scores according to the applied 3 weights. Finally, interactome properties of proteins implicated in cardiovascular disease and cancer were studied. The analysis quantified the influential (central) nature of each protein and defined characteristics of functional modules and pathways in which the disease proteins reside. Such proteins were found to be enriched 2 fold within proteins that are influential (p<0.05) in the interactome. Additionally, they cluster in large, complex, highly connected communities, acting as interfaces between multiple processes more often than expected. An approach to prioritising disease candidates based on this analysis was proposed. Each analyses can provide some new insights into the effort to identify novel disease related proteins for cardiovascular disease

    Prediction of Deleterious Nonsynonymous Single-Nucleotide Polymorphism for Human Diseases

    Get PDF
    The identification of genetic variants that are responsible for human inherited diseases is a fundamental problem in human and medical genetics. As a typical type of genetic variation, nonsynonymous single-nucleotide polymorphisms (nsSNPs) occurring in protein coding regions may alter the encoded amino acid, potentially affect protein structure and function, and further result in human inherited diseases. Therefore, it is of great importance to develop computational approaches to facilitate the discrimination of deleterious nsSNPs from neutral ones. In this paper, we review databases that collect nsSNPs and summarize computational methods for the identification of deleterious nsSNPs. We classify the existing methods for characterizing nsSNPs into three categories (sequence based, structure based, and annotation based), and we introduce machine learning models for the prediction of deleterious nsSNPs. We further discuss methods for identifying deleterious nsSNPs in noncoding variants and those for dealing with rare variants

    Investigation of the molecular basis of inherited developmental conditions in high risk population isolates

    Get PDF
    The Amish communities of Ohio (USA) are a distinct group of endogamous, rural-living Anabaptist Christians. An ancestral bottleneck, caused by migratory events in the 17th century and subsequent rapid population expansion, has led to the enrichment of a number of inherited conditions within these communities. This provides significantly enhanced power to identify genes responsible for rare monogenic disorders, as well traits with more complex inheritance patterns. The studies detailed in this thesis aims to provide diagnoses to individuals and their families for the underlying genetic causes responsible for the difficulties they experience and contributes to a long-running, non-profit community clinical-genetic research programme called the Windows of Hope (WoH). Forming part of a wider Amish Hearing Loss Program the studies described in chapter three document the discovery of the genetic causes of hearing loss for eight Amish families. Through a combination of targeted gene sequencing, genome-wide SNP mapping and exome sequencing this study identified a variant in the Gap junction beta-2 (GJB2) gene, not previously reported in the Amish, as the cause of non-syndromic hearing loss in six families. Additionally, one family initially thought to be affected by a neurodevelopment disorder which included syndromic hearing loss, was found to possess two distinct genetic disorders; a 16p11.2 microdeletion, responsible for the developmental delay, and a homozygous GJB2 variant, responsible for the hearing loss. Finally, this chapter proposes two novel hearing loss genes and details the functional work undertaken to assess the pathogenicity of one of these genes (SLC15A5). This work provided important diagnoses for many families and acquired significant information regarding the spectrum and frequency of hearing loss-associated gene variants across distinct Amish communities. Chapter four details work undertaken to define the clinical phenotype and molecular basis of a novel complex autosomal recessive neurological disorder. Work undertaken by one of our collaborators, Dr Zineb Ammous, was instrumental in precisely defining the clinical phenotype of this disorder. A combination of genome-wide SNP mapping and exome sequence identified a sequence variant in Smad Nuclear Interacting Protein 1 (SNIP1), which encodes an evolutionary-conserved transcriptional regulator, as the likely underlying genetic cause. Due to its role as a transcription regulator whole transcriptome sequencing was undertaken to determine the impact of this gene mutation. This work provided important information regarding the specific biological role of SNIP1 and identified gene expression pathways of direct relevance to the clinical phenotype, highlighting therapeutic approaches likely to benefit affected individuals. Additionally, this study determined that SNIP1-associated syndrome is one of the most common conditions across many Amish communities. In recent years the WoH Project has accumulated extensive single nucleotide polymorphisms (SNP) and exome sequencing datasets from patients and individuals from the Amish community. Chapter five outlines a pilot, proof-of-principle study undertaken to explore this data with the aim characterising the architecture of the Amish genome. The interrogation of 26 exomes identified the presence of 12 pathogenic variants known to cause autosomal recessive (AR) diseases that have not yet been reported in the Amish but are likely to be present. Additionally, a PLEXseq sequencing approach was implemented to determine the prevalence of 165 pathogenic variants in 171 unaffected Amish individuals. The findings indicated diverse carrier frequencies within the different Amish communities and contributed to the consolidation of two genes responsible for ultra-rare inherited AR diseases (CEP55, MNS1). By developing approaches to improve knowledge of the specific causes of inherited diseases in the community, this work has laid the foundation for the development of a new genetic-based approach to diagnostic testing in the community. This thesis, and the wider programme of work of Windows of Hope, occupies a privileged positioned at the interface between scientific research and clinical care. The findings described here have made a significant contribution to our understanding of the pathomolecular cause of a number of rare inherited disorders by increasing our knowledge of the nature and spectrum of inherited disease within the Amish laying the foundations to aid the future discovery of new disease genes and improving clinical outcomes by enabling focussed clinical diagnostic and management strategies to be implemented

    Novel Algorithm Development for ‘NextGeneration’ Sequencing Data Analysis

    Get PDF
    In recent years, the decreasing cost of ‘Next generation’ sequencing has spawned numerous applications for interrogating whole genomes and transcriptomes in research, diagnostic and forensic settings. While the innovations in sequencing have been explosive, the development of scalable and robust bioinformatics software and algorithms for the analysis of new types of data generated by these technologies have struggled to keep up. As a result, large volumes of NGS data available in public repositories are severely underutilised, despite providing a rich resource for data mining applications. Indeed, the bottleneck in genome and transcriptome sequencing experiments has shifted from data generation to bioinformatics analysis and interpretation. This thesis focuses on development of novel bioinformatics software to bridge the gap between data availability and interpretation. The work is split between two core topics – computational prioritisation/identification of disease gene variants and identification of RNA N6 -adenosine Methylation from sequencing data. The first chapter briefly discusses the emergence and establishment of NGS technology as a core tool in biology and its current applications and perspectives. Chapter 2 introduces the problem of variant prioritisation in the context of Mendelian disease, where tens of thousands of potential candidates are generated by a typical sequencing experiment. Novel software developed for candidate gene prioritisation is described that utilises data mining of tissue-specific gene expression profiles (Chapter 3). The second part of chapter investigates an alternative approach to candidate variant prioritisation by leveraging functional and phenotypic descriptions of genes and diseases from multiple biomedical domain ontologies (Chapter 4). Chapter 5 discusses N6 AdenosineMethylation, a recently re-discovered posttranscriptional modification of RNA. The core of the chapter describes novel software developed for transcriptome-wide detection of this epitranscriptomic mark from sequencing data. Chapter 6 presents a case study application of the software, reporting the previously uncharacterised RNA methylome of Kaposi’s Sarcoma Herpes Virus. The chapter further discusses a putative novel N6-methyl-adenosine -RNA binding protein and its possible roles in the progression of viral infection

    In silico analysis of the effects of non-synonymous single nucleotide polymorphisms on the human macrophage migration inhibitory factor gene and their possible role in human African trypanosomiasis susceptibility

    Get PDF
    Human African trypanosomiasis (HAT) is a public health problem in sub-Saharan Africa, with approximately 10,000 cases being reported per year. The Macrophage Migration Inhibitory Factor (MIF) which is encoded by a functionally polymorphic gene is important in both innate andadaptive immune responses, and has been implicated in affecting the outcome and processes of several inflammatory conditions. A recent study in mice to that effect showed that MIF deficient and anti-MIF antibody treated mice showed lowered inflammatory responses, liver damage and anaemia than the wild type mice when experimentally challenged with Trypanosomes. These findings could mean that the transcript levels and/or polymorphisms in this gene can possibly affect individual risk to trypanosomiasis. This is especially of interest because there have been reports of spontaneous recovery i.e self-cure/resistance in some HAT cases in West Africa. Prior to this discovery the general paradigm was that trypanosomiasis is fatal if left untreated. The aim of this study was to gain insights into how human genetic variation in forms of nonsynonymous SNPs affects the MIF structure and function and possibly HAT susceptibility. NsSNPs in the mif gene were obtained from dbSNP. Through homology modeling, SNP prediction tools, protein interface analysis, alanine scanning, changes in free energy of folding, protein interactions calculator (PIC), and molecular dynamics simulations, SNP effects on the protein structure and function were studied. The study cohort comprised of human genome sequence data from 50 North Western Uganda Lugbara endemic individuals of whom 20 were cases (previous HAT patients) and 30 were controls (HAT free individuals). None of the 26 nsSNPs retrieved from dbSNP (July 2015) were present in the mif gene region in the study cohort. Out of the eight variants called in the mif coding region there was only one missense variant rs36065127 whose clinical significance is unknown. It was not possible to test for association of this variant with HAT due to its low global MAF that was less than 0.05. Alanine scanning provided a fast and computationally cheap means of quickly assessing nsSNPs of importance. NsSNPs that were interface residues were more likely to be hotspots (important in protein stability). Assessment of possible compensatory mutations using PIC analysis showed that some nsSNP sites were interacting with others, but this requires further experimentation. Analysis of changes in free energy using FOLDX was not enough to predict which nsSNPs would adversely affect protein structure, function and kinetics. The MD simulations were unfortunately too short to glean any meaningful inferences. This was the first genetic study carried out on the people of Lugbara ethnicity from North Western Uganda

    Non-parametric machine learning for biological sequence data

    Get PDF
    In the past decade there has been a massive increase in the volume of biological sequence data, driven by massively parallel sequencing technologies. This has enabled data-driven statistical analyses using non-parametric predictive models (including those from machine learning) to complement more traditional, hypothesis-driven approaches. This thesis addresses several challenges that arise when applying non-parametric predictive models to biological sequence data. Some of these challenges arise due to the nature of the biological system of interest. For example, in the study of the human microbiome the phylogenetic relationships between microorganisms are often ignored in statistical analyses. This thesis outlines a novel approach to modelling phylogenetic similarity using string kernels and demonstrates its utility in the two-sample test and host-trait prediction. Other challenges arise from limitations in our understanding of the models themselves. For example, calculating variable importance (a key task in biomedical applications) is not possible for many models. This thesis describes a novel extension of an existing approach to compute importance scores for grouped variables in a Bayesian neural network. It also explores the behaviour of random forest classifiers when applied to microbial datasets, with a focus on the robustness of the biological findings under different modelling assumptions.Open Acces
    corecore