2,717 research outputs found

    An integrative approach to predicting the functional effects of small indels in non-coding regions of the human genome

    Get PDF
    Background: Small insertions and deletions (indels) have a significant influence in human disease and, in terms of frequency, they are second only to single nucleotide variants as pathogenic mutations. As the majority of mutations associated with complex traits are located outside the exome, it is crucial to investigate the potential pathogenic impact of indels in non-coding regions of the human genome. Results: We present FATHMM-indel, an integrative approach to predict the functional effect, pathogenic or neutral, of indels in non-coding regions of the human genome. Our method exploits various genomic annotations in addition to sequence data. When validated on benchmark data, FATHMM-indel significantly outperforms CADD and GAVIN, state of the art models in assessing the pathogenic impact of non-coding variants. FATHMM-indel is available via a web server at indels.biocompute.org.uk. Conclusions: FATHMM-indel can accurately predict the functional impact and prioritise small indels throughout the whole non-coding genome

    Analysis of five deep-sequenced trio-genomes of the Peninsular Malaysia Orang Asli and North Borneo populations

    Get PDF
    BackgroundRecent advances in genomic technologies have facilitated genome-wide investigation of human genetic variations. However, most efforts have focused on the major populations, yet trio genomes of indigenous populations from Southeast Asia have been under-investigated.ResultsWe analyzed the whole-genome deep sequencing data (30x) of five native trios from Peninsular Malaysia and North Borneo, and characterized the genomic variants, including single nucleotide variants (SNVs), small insertions and deletions (indels) and copy number variants (CNVs). We discovered approximately 6.9 million SNVs, 1.2 million indels, and 9000 CNVs in the 15 samples, of which 2.7% SNVs, 2.3% indels and 22% CNVs were novel, implying the insufficient coverage of population diversity in existing databases. We identified a higher proportion of novel variants in the Orang Asli (OA) samples, i.e., the indigenous people from Peninsular Malaysia, than that of the North Bornean (NB) samples, likely due to more complex demographic history and long-time isolation of the OA groups. We used the pedigree information to identify de novo variants and estimated the autosomal mutation rates to be 0.81x10(-8) - 1.33x10(-8), 1.0x10(-9) - 2.9x10(-9), and 0.001 per site per generation for SNVs, indels, and CNVs, respectively. The trio-genomes also allowed for haplotype phasing with high accuracy, which serves as references to the future genomic studies of OA and NB populations. In addition, high-frequency inherited CNVs specific to OA or NB were identified. One example is a 50-kb duplication in DEFA1B detected only in the Negrito trios, implying plausible effects on host defense against the exposure of diverse microbial in tropical rainforest environment of these hunter-gatherers. The CNVs shared between OA and NB groups were much fewer than those specific to each group. Nevertheless, we identified a 142-kb duplication in AMY1A in all the 15 samples, and this gene is associated with the high-starch diet. Moreover, novel insertions shared with archaic hominids were identified in our samples.ConclusionOur study presents a full catalogue of the genome variants of the native Malaysian populations, which is a complement of the genome diversity in Southeast Asians. It implies specific population history of the native inhabitants, and demonstrated the necessity of more genome sequencing efforts on the multi-ethnic native groups of Malaysia and Southeast Asia

    Prediction of driver variants in the cancer genome via machine learning methodologies

    Get PDF
    Sequencing technologies have led to the identification of many variants in the human genome which could act as disease-drivers. As a consequence, a variety of bioinformatics tools have been proposed for predicting which variants may drive disease, and which may be causatively neutral. After briefly reviewing generic tools, we focus on a subset of these methods specifically geared toward predicting which variants in the human cancer genome may act as enablers of unregulated cell proliferation. We consider the resultant view of the cancer genome indicated by these predictors and discuss ways in which these types of prediction tools may be progressed by further research

    Integrative Annotation of 21,037 Human Genes Validated by Full-Length cDNA Clones

    Get PDF
    The human genome sequence defines our inherent biological potential; the realization of the biology encoded therein requires knowledge of the function of each gene. Currently, our knowledge in this area is still limited. Several lines of investigation have been used to elucidate the structure and function of the genes in the human genome. Even so, gene prediction remains a difficult task, as the varieties of transcripts of a gene may vary to a great extent. We thus performed an exhaustive integrative characterization of 41,118 full-length cDNAs that capture the gene transcripts as complete functional cassettes, providing an unequivocal report of structural and functional diversity at the gene level. Our international collaboration has validated 21,037 human gene candidates by analysis of high-quality full-length cDNA clones through curation using unified criteria. This led to the identification of 5,155 new gene candidates. It also manifested the most reliable way to control the quality of the cDNA clones. We have developed a human gene database, called the H-Invitational Database (H-InvDB; http://www.h-invitational.jp/). It provides the following: integrative annotation of human genes, description of gene structures, details of novel alternative splicing isoforms, non-protein-coding RNAs, functional domains, subcellular localizations, metabolic pathways, predictions of protein three-dimensional structure, mapping of known single nucleotide polymorphisms (SNPs), identification of polymorphic microsatellite repeats within human genes, and comparative results with mouse full-length cDNAs. The H-InvDB analysis has shown that up to 4% of the human genome sequence (National Center for Biotechnology Information build 34 assembly) may contain misassembled or missing regions. We found that 6.5% of the human gene candidates (1,377 loci) did not have a good protein-coding open reading frame, of which 296 loci are strong candidates for non-protein-coding RNA genes. In addition, among 72,027 uniquely mapped SNPs and insertions/deletions localized within human genes, 13,215 nonsynonymous SNPs, 315 nonsense SNPs, and 452 indels occurred in coding regions. Together with 25 polymorphic microsatellite repeats present in coding regions, they may alter protein structure, causing phenotypic effects or resulting in disease. The H-InvDB platform represents a substantial contribution to resources needed for the exploration of human biology and pathology

    CADD: predicting the deleteriousness of variants throughout the human genome

    Get PDF
    Combined Annotation-Dependent Depletion (CADD) is a widely used measure of variant deleteriousness that can effectively prioritize causal variants in genetic analyses, particularly highly penetrant contributors to severe Mendelian disorders. CADD is an integrative annotation built from more than 60 genomic features, and can score human single nucleotide variants and short insertion and deletions anywhere in the reference assembly. CADD uses a machine learning model trained on a binary distinction between simulated de novo variants and variants that have arisen and become fixed in human populations since the split between humans and chimpanzees; the former are free of selective pressure and may thus include both neutral and deleterious alleles, while the latter are overwhelmingly neutral (or, at most, weakly deleterious) by virtue of having survived millions of years of purifying selection. Here we review the latest updates to CADD, including the most recent version, 1.4, which supports the human genome build GRCh38. We also present updates to our website that include simplified variant lookup, extended documentation, an Application Program Interface and improved mechanisms for integrating CADD scores into other tools or applications. CADD scores, software and documentation are available at https://cadd.gs.washington.edu

    Chapter Functional Annotation of Rare Genetic Variants

    Get PDF
    Genome-wide association studies have successfully identified a growing number of common variants that robustly associate with a wide range of complex diseases and phenotypes. In the majority of cases though, the variants are predicted to have small to modest effect sizes, and, due to the technologies used, many of the signals discovered so far may not be the causal loci. As rare variation studies begin to explore the lower ranges of the allele frequency spectrum, using whole genome or whole exome sequencing to capture a larger proportion of variants, we expect to find variants with a more direct causal role in the phenotype(s) of interest. Interpreting possible functional mechanisms linking variants with phenotypes will become increasingly important

    How to identify pathogenic mutations among all those variations: Variant annotation and filtration in the genome sequencing era

    Get PDF
    High-throughput sequencing technologies have become fundamental for the identification of disease-causing mutations in human genetic diseases both in research and clinical testing contexts. The cumulative number of genes linked to rare diseases is now close to 3,500 with more than 1,000 genes identified between 2010 and 2014 because of the early adoption of Exome Sequencing technologies. However, despite these encouraging figures, the success rate of clinical exome diagnosis remains low due to several factors including wrong variant annotation and nonoptimal filtration practices, which may lead to misinterpretation of disease-causing mutations. In this review, we describe the critical steps of variant annotation and filtration processes to highlight a handful of potential disease-causing mutations for downstream analysis. We report the key annotation elements to gather at multiple levels for each mutation, and which systems are designed to help in collecting this mandatory information. We describe the filtration options, their efficiency, and limits and provide a generic filtration workflow and highlight potential pitfalls through a use case

    Functional Analysis of Genomic Variation and Impact on Molecular and Higher Order Phenotypes

    Get PDF
    Reverse genetics methods, particularly the production of gene knockouts and knockins, have revolutionized the understanding of gene function. High throughput sequencing now makes it practical to exploit reverse genetics to simultaneously study functions of thousands of normal sequence variants and spontaneous mutations that segregate in intercross and backcross progeny generated by mating completely sequenced parental lines. To evaluate this new reverse genetic method we resequenced the genome of one of the oldest inbred strains of mice—DBA/2J—the father of the large family of BXD recombinant inbred strains. We analyzed ~100X wholegenome sequence data for the DBA/2J strain, relative to C57BL/6J, the reference strain for all mouse genomics and the mother of the BXD family. We generated the most detailed picture of molecular variation between the two mouse strains to date and identified 5.4 million sequence polymorphisms, including, 4.46 million single nucleotide polymorphisms (SNPs), 0.94 million intersections/deletions (indels), and 20,000 structural variants. We systematically scanned massive databases of molecular phenotypes and ~4,000 classical phenotypes to detect linked functional consequences of sequence variants. In majority of cases we successfully recovered known genotype-to-phenotype associations and in several cases we linked sequence variants to novel phenotypes (Ahr, Fh1, Entpd2, and Col6a5). However, our most striking and consistent finding is that apparently deleterious homozygous SNPs, indels, and structural variants have undetectable or very modest additive effects on phenotypes
    corecore