83 research outputs found

    Applications of Evolutionary Bioinformatics in Basic and Biomedical Research

    Get PDF
    With the revolutionary progress in sequencing technologies, computational biology emerged as a game-changing field which is applied in understanding molecular events of life for not only complementary but also exploratory purposes. Bioinformatics resources and tools significantly help in data generation, organization and analysis. However, there is still a need for developing new approaches built based on a biologist’s point of view. In protein bioinformatics, there are several fundamental problems such as (i) determining protein function; (ii) identifying protein-protein interactions; (iii) predicting the effect of amino acid variants. Here, I present three chapters addressing these problems from an evolutionary perspective. Firstly, I describe a novel search pipeline for protein domain identification. The algorithm chain provides sensitive domain assignments with the highest possible specificity. Secondly, I present a tool enabling large-scale visualization of presences and absences of proteins in hierarchically clustered genomes. This tool visualizes multi-layer information of any kind of genome-linked data with a special focus on domain architectures, enabling identification of coevolving domains/proteins, which can eventually help in identifying functionally interacting proteins. And finally, I propose an approach for distinguishing between benign and damaging missense mutations in a human disease by establishing the precise evolutionary history of the associated gene. This part introduces new criteria on how to determine functional orthologs via phylogenetic analysis. All three parts use comparative genomics and/or sequence analyses. Taken together, this study addresses important problems in protein bioinformatics and as a whole it can be utilized to describe proteins by their domains, coevolving partners and functionally important residues

    Bioinformatic approaches to determine pathogenicity and function of clinical genetic variants across ion channels and neurodevelopmental disorder associated genes

    Get PDF
    Clinical genetic testing for rare monogenic diseases has the scope of identifying the disease-causing variants. Identification of the molecular etiology of the disease can already today improve clinical care and is essential for the administration of precision medicines that are currently in development for many disorders. However, distinguishing pathogenic variants from benign genetic variants remains a challenge – in particular for missense variants where a single amino acid is substituted. The effects of a pathogenic variant on the protein function, for example, whether it causes a gain (GoF) or a loss (LoF) of the protein function, is most of the time not understood since most genetic variants are ultra-rare and have not been molecularly tested. In particular, for genes associated with severe developmental disorders, first-generation symptomatic treatments offer often only limited relief. Consequently, the development and application of targeted treatments that promise improvement is urgently needed. Identifying the disease-causing pathogenic and predicting their function is crucial as targeted therapies can only be administered to patients with classified pathogenic variants whose functional effects are known to avoid adverse treatment outcomes. In this dissertation, I present bioinformatic approaches to enhance the assessment of variant pathogenicity and understanding of the functional effects of genetic variants. The developed approaches were applied on an exome-wide scale using public datasets and for selected disorders for which I had expert-curated clinical-genetic data available from collaborators. The major focus of this thesis is on genes implicated in neurodevelopmental disorders and diseases associated with ion channel dysfunction for which collaboration with other research groups enabled the aggregation of required genetic, clinical, and functional datasets to develop and test the bioinformatic approaches. In the first study (Bruenger and Ivaniuk et al., in preparation for submission to Genetics in Medicine), we developed a novel approach to extend the application of current variant interpretation guidelines as proposed by the American College of Medical Genetics and Genomics (ACMG). Currently, a major limitation of interpreting the pathogenicity of variants with the ACMG guidelines presents the rare applicability of some of the proposed evidence criteria. We evaluated the potential of incorporating individual pathogenic variants observed in paralogous genes to extend the applicability of two criteria of the guidelines. Our results demonstrated that pathogenic variants in evolutionarily conserved paralogous genes can serve as evidence for a variant's pathogenicity and thus extend the current criteria's applicability by more than four times. We further explored whether the selection of the paralogous pathogenic variants can be improved by incorporating phenotype information. We assembled a clinically well-defined cohort of patients with variants in voltage-gated sodium channels (VGSC) and identified phenotype correlations among paralogous genes based on the shared variant properties. By integrating these phenotype correlations into our proposed extension of the ACMG criteria, we demonstrated an enhanced ability to provide evidence for the pathogenicity of genetic variants in VGSC-encoding genes. In the second study (Brunklaus, Feng, and Bruenger et al., Brain, 2022), we examined whether experimentally obtained functional effects of variants in one VGSC encoding gene could predict function in conserved variants in paralogous genes with high sequence similarity. We aggregated 437 in-vitro functionally tested variants from an intensive literature search and found that the functional effect across conserved variants in paralogous genes was conserved in 94% of cases. Our findings represent the first GoF versus LoF topological map of VGSC proteins, which could guide precision therapy as functionally tested variants are rare across VGSC. We integrated our findings into a publicly accessible webtool (http://SCN-viewer.broadinstitute.org) to facilitate functional variant interpretation across VGSC. In the third study (Bruenger et al., Brain, 2022), we systematically identified biological properties associated with variant pathogenicity across all major voltage and ligand-gated ion-channel families. We discovered and independently replicated that several pore residue properties and proximity to the pore axis were significantly enriched for pathogenic variants compared to population variants across all ion channels. Using a newly developed structural framework, we provide quantitative evidence that variants at the pore showed the strongest pathogenic variant enrichment. Moreover, we found that a hydrophobic pore environment was most strongly associated with variant pathogenicity. Finally, we showed that the identified biological properties correlated with in-vitro functional readouts from 679 variants and clinical phenotypes in 1,422 patients with neurodevelopmental disorders which were collected through collaboration with other research groups. In summary, we identified biological properties associated with ion-channel malfunction and show that these are correlated with in vitro functional readouts and clinical phenotypes in patients with neurodevelopmental disorders. Our results suggest that clinical decision support algorithms that predict variant pathogenicity and function are feasible in the future. In the fourth study (Iqbal and Bruenger et al., Brain, 2022), we developed a novel consensus approach that combines evolutionary and population-based genomic scores to identify 3D essential sites (Essential3D) on protein structures encoded by genes associated with neurodevelopmental disorders (NDDs). NDDs encompass severe clinical conditions caused by pathogenic variants in different genes. However, many of those genes were just recently associated with NDDs and are not well studied. We identified 14,377 Essential3D sites on protein structures encoded by 189 genes and found that these sites were eight-fold enriched for pathogenic versus population controls in an independent cohort of over 360,000 patient and population variants. The Essential3D sites offer insights into molecular mechanisms of protein function, such as key protein-protein interaction sites. The provided annotations are available at https://es-ndd.broadinstitute.org and will guide clinical variant interpretation. In summary, within these major studies in my Ph.D., we aggregated genetic, clinical, and functional datasets and developed bioinformatic approaches to enhance the assessment of variant pathogenicity and improve understanding of the functional effects of genetic variants on protein function. The advances made during my Ph.D. research demonstrate the power of integrating multiple data sources to study novel genetic variants and their implication for rare monogenic diseases. Our approaches specifically improve variant function and pathogenicity assessment in genes implicated in several severe diseases for which currently applied first-generation therapies cannot adequately lower the disease burden. Thus, our results contribute to a new era in precision medicine, where personalized treatments and improved clinical care become increasingly accessible to patients. Finally, the annotations developed in these can serve as a foundation for further studies, including the application of machine learning methods to predict variant pathogenicity and protein functional effects more accurately

    Expansion des familles de gènes impliquées dans des maladies par duplication du génome chez les premiers vertébrés

    Get PDF
    The emergence and evolutionary expansion of gene families implicated in cancers and other severegenetic diseases is an evolutionary oddity from a natural selection perspective. In this thesis, wehave shown that gene families prone to deleterious mutations in the human genome have beenpreferentially expanded by the retention of "ohnolog" genes from two rounds of whole‐genomeduplication (WGD) dating back from the onset of jawed vertebrates. Using advanced inferenceanalysis, we have further demonstrated that the retention of many ohnologs suspected to be dosagebalanced is in fact indirectly mediated by their susceptibility to deleterious mutations. This enhancedretention of "dangerous" ohnologs, defined as prone to autosomal‐dominant deleterious mutations,is shown to be a consequence of WGD‐induced speciation and the ensuing purifying selection inpost‐WGD species. We have also developed a statistical approach to identify ohnologs in vertebrategenomes with high confidence. These ohnologs can be easily accessed from a web server. Ourfindings highlight the importance of WGD‐induced non‐adaptive selection for the emergence ofvertebrate complexity, while rationalizing, from an evolutionary perspective, the expansion of genefamilies frequently implicated in genetic disorders and cancers. The high confidence ohnologsidentified by our approach will also pave the way for novel functional genomic analysesdistinguishing gene duplicates according to their origin.L'expansion au cours de l'évolution de familles de gènes impliquées dans les cancers et d'autresmaladies génétiques graves est surprenante du point de vue de la sélection naturelle. Dans cettethèse, nous avons montré que des familles de gènes sujettes à des mutations délétères dans legénome humain se sont principalement agrandies par rétention de gènes "ohnologues" issus dedeux duplications globales du génome (GGD) datant de l'origine des vertébrés à mâchoires. Enutilisant une méthode d'inférence avancée, nous avons aussi démontré que la rétention denombreux ohnologues soupçonnés d'être susceptibles aux équilibres de dosage d'expression était enfait plus directement liée à leur sensibilité aux mutations délétères. Cette rétention priviligiéed'ohnologues "dangereux", définis comme sujets à des mutations délétères dominantes, semble êtreune conséquence des évênements de spéciation provoqués par ces GGD et la sélection depurification qui a suivi dans les espèces post‐GGD. Nous avons également développé une approchequantitative pour identifier les ohnologues dans le génome des vertébrés. Ces ohnologues sontfacilement accessibles à partir d'un serveur Web. Nos résultats soulignent l' importance de lasélection non adaptative induite par GGD dans l'émergence de la complexité des vertébrés, tout enrationalisant, d'un point de vue évolutif, l'extension des familles de gènes fréquemment impliquéesdans les maladies génétiques et les cancers. Les ohnologues identifiés par notre approche ouvrentégalement la voie à de nouvelles analyses de génomique fonctionnelle distinguant l'origine desgènes dupliqués

    Characterization of genomic perturbation sensitivity using 1000 genomes population

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 의과대학 의과학과, 2018. 2. 김주한.연구 목적: 유전자의 발현은 수많은 유전체 돌연변이에 의해서 교란되며, 이는 세포의 기능과 개체의 표현형에 큰 영향을 준다. 최근의 대규모 차세대 염기서열분석 프로젝트에서 밝혀지고 있듯, 한사람의 유전체는 적어도 300만개의 돌연변이를 가지고 있는 것으로 알려져 있다. 본 논문에서는 이러한 유전체 교란을 해석하고 교란에 민감한 유전자의 특징을 살펴보고자 전사체 교란 네트워크를 1000 유전체 프로젝트 데이터를 통해 구성해 보았다. 연구 방법: 본 연구에서는 단백질 코딩 영역 내 비 동일 변이의 시프트 점수를 종합하여 유전자 손상 정도를 평가하였다. 이를 기반으로 전사체 교란 네트워크를 구성하고, 유전자의 내향 연결 정도를 교란 민감도로 정의하였다. 유전자를 교란 민감도에 따라 분류하고 교란 민감 유전자와 교란 둔감 유전자의 진화적, 생물학적, 그리고 임상적 특징을 조사하였다. 결과: 교란 민감 유전자는 단백질 상호작용 네트워크의 변방에 위치해 있었으나 진화적으로 보존되어 있었다. 이들은 상대적으로 적은 수의 미소 전사체와 전사인자에 의해 조절되고 있으며, 세포 간의 상호작용에 중요한 역할을 하고 있었다. 전사체 교란 네트워크의 외향 연결 정도는 중요한 생물학적 의미를 가지고 있지 않았다. 치사 유전자의 경우 교란 네트워크의 말단이면서 단백질 상호작용 네트워크의 중심부에 위치해 있었다. 반면, 대부분의 질병 유전자들의 경우 교란 네트워크의 중심이면서 단백질 상호작용 네트워크의 말단에 위치해 있었다. 두 네트워크를 모두 사용하여, 질병을 분류하기 위한 연합 네트워크 도표를 그려보았다. 결론: 효모에서의 연구와 마찬가지로, 교란 민감 유전자는 유전적으로 보존되어 있고 세포 간의 상호작용에 관여하여 개체의 생존에 필수적이었다. 또한, 내향 연결정도가 외향 연결정도에 비해 유전자 교란을 해석하는데 유용하다는 것을 확인하였다. 질병 유전자는 단백질 상호작용 네트워크와 교란 네트워크를 동시에 활용하여 시각화 되고 분류될 수 있었다. 결론적으로, 교란 민감도는 유전자의 생물학적 임상적 특성을 분석하고 유전체 교란을 평가하는데 가치 있는 지표가 될 것이다.Purpose: Transcriptome is perturbed by millions of genomic variants which could alter function of cells and phenotypes of organisms. As discovered in recent large Next-Generation Sequencing (NGS) project, Individual genome has at least 3 to 4 million variants. Here, we applied perturbation network to human data from 1000 genomes project data for interpreting genetic perturbation and characterized perturbation sensitive and tolerant genes. Methods: We integrated SIFT score of non-synonymous variants to calculate gene deleteriousness score and determine whether gene is perturbed or not. Perturbation network was constructed based on gene deleteriousness score and perturbation sensitivity was defined as in-degree of perturbation network. We categorized genes based on perturbation sensitivity and investigated evolutionarily, regulatory, and clinical properties of perturbation sensitive and tolerant genes. Results: Perturbation sensitive genes were in periphery of protein interaction network but evolutionarily conserved. They were regulated by less miRNA and transcription factor and played a key role in cell-cell interaction. Out-degree of perturbation network did not show any significant biological properties. Lethal genes were in periphery of perturbation network and hub of protein interaction network. On the contrary, most disease genes were in hub of perturbation network and showed various trends in protein-interaction network. We drew joint network map and categorized disease by degree of both network. Conclusions: As in yeast perturbation network, perturbation sensitive genes were essential in survival of organism since they were evolutionarily conserved and related to interaction between cells. We confirmed that in-degree of perturbation network is better than out-degree of perturbation network for interpreting genetic perturbation. Disease genes can be categorized and visualized using both protein-interaction network and perturbation network. In conclusion, perturbation sensitivity was valuable measure for interpreting genetic perturbation and assessing gene's biological and clinical properties.1. Introduction 1 1.1. Definition of Genetic Perturbation 1 1.2. Interpretation of genetic perturbation causing variants 2 1.3. Interpretation of genetic perturbation using biological networks 3 1.4. Perturbation Network approach in Yeast 5 1.5. Purpose of study 6 2. Materials and Methods 7 2.1. Genome and transcriptome data from 1000genomes populations. 7 2.2. Calculating gene deleteriousness scores. 8 2.3. Construction of perturbation network. 9 2.4. Construction of Protein Interaction Network. 9 2.5. Retrieving biological information for gene annotation 10 2.6. Excess retention. 10 2.7. Joint network map for visualization of gene sets. 11 2.8. Clinical annotation of PSN 11 3. Results 12 3.1. Building Perturbation network 12 3.2. Biological properties of perturbation network 13 3.2.1. Correlation between perturbation network and PPI network 14 3.2.2. Relationship of perturbation network to Evolutionary feature and regulatory feature 15 3.3. Clinical implication of perturbation network against PPI network 27 3.3.1. Lethal genes versus disease genes 27 3.3.2. Disease gene classification using both Kppi and Kin 31 4. Discussion 35 5. References 38Docto

    OGEE v2: an update of the online gene essentiality database with special focus on differentially essential genes in human cancer cell lines

    Get PDF
    OGEE is an Online GEne Essentiality database. To enhance our understanding of the essentiality of genes, in OGEE we collected experimentally tested essential and non-essential genes, as well as associated gene properties known to contribute to gene essentiality. We focus on large-scale experiments, and complement our data with text-mining results. We organized tested genes into data sets according to their sources, and tagged those with variable essentiality statuses across data sets as conditionally essential genes, intending to highlight the complex interplay between gene functions and environments/experimental perturbations. Developments since the last public release include increased numbers of species and gene essentiality data sets, inclusion of non-coding essential sequences and genes with intermediate essentiality statuses. In addition, we included 16 essentiality data sets from cancer cell lines, corresponding to 9 human cancers; with OGEE, users can easily explore the shared and differentially essential genes within and between cancer types. These genes, especially those derived from cell lines that are similar to tumor samples, could reveal the oncogenic drivers, paralogous gene expression pattern and chromosomal structure of the corresponding cancer types, and can be further screened to identify targets for cancer therapy and/or new drug development. OGEE is freely available at http://ogee.medgenius.info

    Increasing Alternative Promoter Repertories Is Positively Associated with Differential Expression and Disease Susceptibility

    Get PDF
    Background: Alternative Promoter (AP) usages have been shown to enable diversified transcriptional regulation of individual gene in a context-specific (e.g., pathway, cell lineage, tissue type, and development stage et. ac.) way. Aberrant uses of APs have been directly linked to mechanism of certain human diseases. However, whether or not there exists a general link between a gene’s AP repertoire and its expression diversity is currently unknown. The general relation between a gene’s AP repertoire and its disease susceptibility also remains largely unexplored. Methodology/Principal Findings: Based on the differential expression ratio inferred from all human microarray data in NCBI GEO and the list of disease genes curated in public repositories, we systemically analyzed the general relation of AP repertoire with expression diversity and disease susceptibility. We found that genes with APs are more likely to be differentially expressed and/or disease associated than those with Single Promoter (SP), and genes with more APs are more likely differentially expressed and disease susceptible than those with less APs. Further analysis showed that genes with increased number of APs tend to have increased length in all aspects of gene structure including 39 UTR, be associated with increased duplicability, and have increased connectivity in protein-protein interaction network. Conclusions: Our genome-wide analysis provided evidences that increasing alternative promoter repertories is positivel
    corecore