15 research outputs found

    Seeking an optimal variant calling pipeline for medical genetics

    Get PDF
    Accurate and comprehensive variant discovery is extremely important for rare disease diagnostics using next-generation sequencing (NGS) methods. Over the recent years, a plethora of methods have been developed for short variant calling from NGS data, and the most recent tools extensively use machine learning algorithms for both variant discovery and filtering. In our study, we took an effort to systematically evaluate the performance of different pipelines for short variant calling in the human genome. To perform such a systematic comparison, we collected a large dataset of both “gold standard” (provided by the Genome In A Bottle (GIAB) consortium) and in-house wholeexome sequencing (WES) and whole-genome sequencing (WGS) datasets. (a total of 20 different datasets was used). We tested all combinations of 4 popular short read aligners (BWA, Bowtie2, Isaac, and Novoalign) and 9 novel and well-established variant calling and filtering methods (Freebayes, Clair3, DeepVariant, Genome Analysis ToolKit (GATK), Octopus, Strelka2). We also used several different tools for preprocessing of short reads. Our analysis showed negligible effects of adapter trimming on the accuracy of short variant calling. Among read aligners, Bowtie2 performed significantly worse than other tools, suggesting it should not be used for medical variant calling. For pipelines based on BWA, Isaac, and Novoalign, the accuracy of variant discovery mostly depended on the variant caller and not the read aligner. DeepVariant consistently showed the best performance and the greatest robustness compared to all other tested variant callers. We have also compared the consistency of variant calls in GIAB and non-GIAB samples. With few important caveats, best-performing tools have shown little evidence of overfitting. Taken together, our study showed that modern strategies for NGS data analysis allow for high accuracy of genetic variant discovery within coding regions of the human genome. However, there is still a need for development of new library preparation and variant calling methods to enhance variant discovery in the challenging regions of the human genome.Book of abstract: 4th Belgrade Bioinformatics Conference, June 19-23, 202

    Differential Interactions of Molecular Chaperones and Yeast Prions

    Get PDF
    Baker's yeast Saccharomyces cerevisiae is an important model organism that is applied to study various aspects of eukaryotic cell biology. Prions in yeast are self-perpetuating heritable protein aggregates that can be leveraged to study the interaction between the protein quality control (PQC) machinery and misfolded proteins. More than ten prions have been identified in yeast, of which the most studied ones include [PSI+], [URE3], and [PIN+]. While all of the major molecular chaperones have been implicated in propagation of yeast prions, many of these chaperones differentially impact propagation of different prions and/or prion variants. In this review, we summarize the current understanding of the life cycle of yeast prions and systematically review the effects of different chaperone proteins on their propagation. Our analysis clearly shows that Hsp40 proteins play a central role in prion propagation by determining the fate of prion seeds and other amyloids. Moreover, direct prion-chaperone interaction seems to be critically important for proper recruitment of all PQC components to the aggregate. Recent results also suggest that the cell asymmetry apparatus, cytoskeleton, and cell signaling all contribute to the complex network of prion interaction with the yeast cell.RSF grant 18-14-00050 (Part «Yeast prions and their life cycle») and the RFBR grant 19-04-00173 (other parts

    Gene Amplification as a Mechanism of Yeast Adaptation to Nonsense Mutations in Release Factor Genes

    Get PDF
    Protein synthesis (translation) is one of the fundamental processes occurring in the cells of living organisms. Translation can be divided into three key steps: initiation, elongation, and termination. In the yeast Saccharomyces cerevisiae, there are two translation termination factors, eRF1 and eRF3. These factors are encoded by the SUP45 and SUP35 genes, which are essential; deletion of any of them leads to the death of yeast cells. However, viable strains with nonsense mutations in both the SUP35 and SUP45 genes were previously obtained in several groups. The survival of such mutants clearly involves feedback control of premature stop codon readthrough; however, the exact molecular basis of such feedback control remain unclear. To investigate the genetic factors supporting the viability of these SUP35 and SUP45 nonsense mutants, we performed whole-genome sequencing of strains carrying mutant sup35-n and sup45-n alleles; while no common SNPs or indels were found in these genomes, we discovered a systematic increase in the copy number of the plasmids carrying mutant sup35-n and sup45-n alleles. We used the qPCR method which confirmed the differences in the relative number of SUP35 and SUP45 gene copies between strains carrying wild-type or mutant alleles of SUP35 and SUP45 genes. Moreover, we compare the number of copies of the SUP35 and SUP45 genes in strains carrying different nonsense mutant variants of these genes as a single chromosomal copy. qPCR results indicate that the number of mutant gene copies is increased compared to the wild-type control. In case of several sup45-n alleles, this was due to a disomy of the entire chromosome II, while for the sup35-218 mutation we observed a local duplication of a segment of chromosome IV containing the SUP35 gene. Taken together, our results indicate that gene amplification is a common mechanism of adaptation to nonsense mutations in release factor genes in yeast.RSF grant 18-14-00050; State Research Program (0112-2016-0015

    Systematic dissection of biases in whole-exome and whole-genome sequencing reveals major determinants of coding sequence coverage

    Get PDF
    Advantages and diagnostic effectiveness of the two most widely used resequencing approaches, whole exome (WES) and whole genome (WGS) sequencing, are often debated. WES dominated large-scale resequencing projects because of lower cost and easier data storage and processing. Rapid development of 3(rd) generation sequencing methods and novel exome sequencing kits predicate the need for a robust statistical framework allowing informative and easy performance comparison of the emerging methods. In our study we developed a set of statistical tools to systematically assess coverage of coding regions provided by several modern WES platforms, as well as PCR-free WGS. We identified a substantial problem in most previously published comparisons which did not account for mappability limitations of short reads. Using regression analysis and simple machine learning, as well as several novel metrics of coverage evenness, we analyzed the contribution from the major determinants of CDS coverage. Contrary to a common view, most of the observed bias in modern WES stems from mappability limitations of short reads and exome probe design rather than sequence composition. We also identified the similar to 500kb region of human exome that could not be effectively characterized using short read technology and should receive special attention during variant analysis. Using our novel metrics of sequencing coverage, we identified main determinants of WES and WGS performance. Overall, our study points out avenues for improvement of enrichment-based methods and development of novel approaches that would maximize variant discovery at optimal cost

    Identification of Novel Candidate Markers of Type 2 Diabetes and Obesity in Russia by Exome Sequencing with a Limited Sample Size

    Get PDF
    Type 2 diabetes (T2D) and obesity are common chronic disorders with multifactorial etiology. In our study, we performed an exome sequencing analysis of 110 patients of Russian ethnicity together with a multi-perspective approach based on biologically meaningful filtering criteria to detect novel candidate variants and loci for T2D and obesity. We have identified several known single nucleotide polymorphisms (SNPs) as markers for obesity (rs11960429), T2D (rs9379084, rs1126930), and body mass index (BMI) (rs11553746, rs1956549 and rs7195386) (p < 0.05). We show that a method based on scoring of case-specific variants together with selection of protein-altering variants can allow for the interrogation of novel and known candidate markers of T2D and obesity in small samples. Using this method, we identified rs328 in LPL (p = 0.023), rs11863726 in HBQ1 (p = 8 × 10−5), rs112984085 in VAV3 (p = 4.8 × 10−4) for T2D and obesity, rs6271 in DBH (p = 0.043), rs62618693 in QSER1 (p = 0.021), rs61758785 in RAD51B (p = 1.7 × 10−4), rs34042554 in PCDHA1 (p = 1 × 10−4), and rs144183813 in PLEKHA5 (p = 1.7 × 10−4) for obesity; and rs9379084 in RREB1 (p = 0.042), rs2233984 in C6orf15 (p = 0.030), rs61737764 in ITGB6 (p = 0.035), rs17801742 in COL2A1 (p = 8.5 × 10−5), and rs685523 in ADAMTS13 (p = 1 × 10−6) for T2D as important susceptibility loci in Russian population. Our results demonstrate the effectiveness of whole exome sequencing (WES) technologies for searching for novel markers of multifactorial diseases in cohorts of limited size in poorly studied populations

    Systematic benchmark of state-of-the-art variant calling pipelines identifies major factors affecting accuracy of coding sequence variant discovery

    No full text
    BACKGROUND: Accurate variant detection in the coding regions of the human genome is a key requirement for molecular diagnostics of Mendelian disorders. Efficiency of variant discovery from next-generation sequencing (NGS) data depends on multiple factors, including reproducible coverage biases of NGS methods and the performance of read alignment and variant calling software. Although variant caller benchmarks are published constantly, no previous publications have leveraged the full extent of available gold standard whole-genome (WGS) and whole-exome (WES) sequencing datasets. RESULTS: In this work, we systematically evaluated the performance of 4 popular short read aligners (Bowtie2, BWA, Isaac, and Novoalign) and 9 novel and well-established variant calling and filtering methods (Clair3, DeepVariant, Octopus, GATK, FreeBayes, and Strelka2) using a set of 14 “gold standard” WES and WGS datasets available from Genome In A Bottle (GIAB) consortium. Additionally, we have indirectly evaluated each pipeline’s performance using a set of 6 non-GIAB samples of African and Russian ethnicity. In our benchmark, Bowtie2 performed significantly worse than other aligners, suggesting it should not be used for medical variant calling. When other aligners were considered, the accuracy of variant discovery mostly depended on the variant caller and not the read aligner. Among the tested variant callers, DeepVariant consistently showed the best performance and the highest robustness. Other actively developed tools, such as Clair3, Octopus, and Strelka2, also performed well, although their efficiency had greater dependence on the quality and type of the input data. We have also compared the consistency of variant calls in GIAB and non-GIAB samples. With few important caveats, best-performing tools have shown little evidence of overfitting. CONCLUSIONS: The results show surprisingly large differences in the performance of cutting-edge tools even in high confidence regions of the coding genome. This highlights the importance of regular benchmarking of quickly evolving tools and pipelines. We also discuss the need for a more diverse set of gold standard genomes that would include samples of African, Hispanic, or mixed ancestry. Additionally, there is also a need for better variant caller assessment in the repetitive regions of the coding genome. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12864-022-08365-3

    Biobanking as a Tool for Genomic Research: From Allele Frequencies to Cross-Ancestry Association Studies

    No full text
    In recent years, great advances have been made in the field of collection, storage, and analysis of biological samples. Large collections of samples, biobanks, have been established in many countries. Biobanks typically collect large amounts of biological samples and associated clinical information; the largest collections include over a million samples. In this review, we summarize the main directions in which biobanks aid medical genetics and genomic research, from providing reference allele frequency information to allowing large-scale cross-ancestry meta-analyses. The largest biobanks greatly vary in the size of the collection, and the amount of available phenotype and genotype data. Nevertheless, all of them are extensively used in genomics, providing a rich resource for genome-wide association analysis, genetic epidemiology, and statistical research into the structure, function, and evolution of the human genome. Recently, multiple research efforts were based on trans-biobank data integration, which increases sample size and allows for the identification of robust genetic associations. We provide prominent examples of such data integration and discuss important caveats which have to be taken into account in trans-biobank research

    A Data-Driven Review of the Genetic Factors of Pregnancy Complications

    No full text
    Over the recent years, many advances have been made in the research of the genetic factors of pregnancy complications. In this work, we use publicly available data repositories, such as the National Human Genome Research Institute GWAS Catalog, HuGE Navigator, and the UK Biobank genetic and phenotypic dataset to gain insights into molecular pathways and individual genes behind a set of pregnancy-related traits, including the most studied ones—preeclampsia, gestational diabetes, preterm birth, and placental abruption. Using both HuGE and GWAS Catalog data, we confirm that immune system and, in particular, T-cell related pathways are one of the most important drivers of pregnancy-related traits. Pathway analysis of the data reveals that cell adhesion and matrisome-related genes are also commonly involved in pregnancy pathologies. We also find a large role of metabolic factors that affect not only gestational diabetes, but also the other traits. These shared metabolic genes include IGF2, PPARG, and NOS3. We further discover that the published genetic associations are poorly replicated in the independent UK Biobank cohort. Nevertheless, we find novel genome-wide associations with pregnancy-related traits for the FBLN7, STK32B, and ACTR3B genes, and replicate the effects of the KAZN and TLE1 genes, with the latter being the only gene identified across all data resources. Overall, our analysis highlights central molecular pathways for pregnancy-related traits, and suggests a need to use more accurate and sophisticated association analysis strategies to robustly identify genetic risk factors for pregnancy complications

    Genetic and Phenotypic Factors Affecting Glycemic Response to Metformin Therapy in Patients with Type 2 Diabetes Mellitus

    No full text
    Metformin is an oral hypoglycemic agent widely used in clinical practice for treatment of patients with type 2 diabetes mellitus (T2DM). The wide interindividual variability of response to metformin therapy was shown, and recently the impact of several genetic variants was reported. To assess the independent and combined effect of the genetic polymorphism on glycemic response to metformin, we performed an association analysis of the variants in ATM, SLC22A1, SLC47A1, and SLC2A2 genes with metformin response in 299 patients with T2DM. Likewise, the distribution of allele and genotype frequencies of the studied gene variants was analyzed in an extended group of patients with T2DM (n = 464) and a population group (n = 129). According to our results, one variant, rs12208357 in the SLC22A1 gene, had a significant impact on response to metformin in T2DM patients. Carriers of TT genotype and T allele had a lower response to metformin compared to carriers of CC/CT genotypes and C allele (p-value = 0.0246, p-value = 0.0059, respectively). To identify the parameters that had the greatest importance for the prediction of the therapy response to metformin, we next built a set of machine learning models, based on the various combinations of genetic and phenotypic characteristics. The model based on a set of four parameters, including gender, rs12208357 genotype, familial T2DM background, and waist–hip ratio (WHR) showed the highest prediction accuracy for the response to metformin therapy in patients with T2DM (AUC = 0.62 in cross-validation). Further pharmacogenetic studies may aid in the discovery of the fundamental mechanisms of type 2 diabetes, the identification of new drug targets, and finally, it could advance the development of personalized treatment
    corecore