3,826 research outputs found

    Genome maps across 26 human populations reveal population-specific patterns of structural variation.

    Get PDF
    Large structural variants (SVs) in the human genome are difficult to detect and study by conventional sequencing technologies. With long-range genome analysis platforms, such as optical mapping, one can identify large SVs (>2 kb) across the genome in one experiment. Analyzing optical genome maps of 154 individuals from the 26 populations sequenced in the 1000 Genomes Project, we find that phylogenetic population patterns of large SVs are similar to those of single nucleotide variations in 86% of the human genome, while ~2% of the genome has high structural complexity. We are able to characterize SVs in many intractable regions of the genome, including segmental duplications and subtelomeric, pericentromeric, and acrocentric areas. In addition, we discover ~60 Mb of non-redundant genome content missing in the reference genome sequence assembly. Our results highlight the need for a comprehensive set of alternate haplotypes from different populations to represent SV patterns in the genome

    Analysis of five deep-sequenced trio-genomes of the Peninsular Malaysia Orang Asli and North Borneo populations

    Get PDF
    BackgroundRecent advances in genomic technologies have facilitated genome-wide investigation of human genetic variations. However, most efforts have focused on the major populations, yet trio genomes of indigenous populations from Southeast Asia have been under-investigated.ResultsWe analyzed the whole-genome deep sequencing data (30x) of five native trios from Peninsular Malaysia and North Borneo, and characterized the genomic variants, including single nucleotide variants (SNVs), small insertions and deletions (indels) and copy number variants (CNVs). We discovered approximately 6.9 million SNVs, 1.2 million indels, and 9000 CNVs in the 15 samples, of which 2.7% SNVs, 2.3% indels and 22% CNVs were novel, implying the insufficient coverage of population diversity in existing databases. We identified a higher proportion of novel variants in the Orang Asli (OA) samples, i.e., the indigenous people from Peninsular Malaysia, than that of the North Bornean (NB) samples, likely due to more complex demographic history and long-time isolation of the OA groups. We used the pedigree information to identify de novo variants and estimated the autosomal mutation rates to be 0.81x10(-8) - 1.33x10(-8), 1.0x10(-9) - 2.9x10(-9), and 0.001 per site per generation for SNVs, indels, and CNVs, respectively. The trio-genomes also allowed for haplotype phasing with high accuracy, which serves as references to the future genomic studies of OA and NB populations. In addition, high-frequency inherited CNVs specific to OA or NB were identified. One example is a 50-kb duplication in DEFA1B detected only in the Negrito trios, implying plausible effects on host defense against the exposure of diverse microbial in tropical rainforest environment of these hunter-gatherers. The CNVs shared between OA and NB groups were much fewer than those specific to each group. Nevertheless, we identified a 142-kb duplication in AMY1A in all the 15 samples, and this gene is associated with the high-starch diet. Moreover, novel insertions shared with archaic hominids were identified in our samples.ConclusionOur study presents a full catalogue of the genome variants of the native Malaysian populations, which is a complement of the genome diversity in Southeast Asians. It implies specific population history of the native inhabitants, and demonstrated the necessity of more genome sequencing efforts on the multi-ethnic native groups of Malaysia and Southeast Asia

    Multi-platform discovery of haplotype-resolved structural variation in human genomes

    Get PDF

    Comprehensive identification and characterisation of germline structural variation within the Iberian population

    Get PDF
    [eng] One of the central aims of biology and biomedicine has been the characterisation and understanding of genetic variation across humans, to answer important evolutionary questions and to explain phenotypic variability concerning the diseases. Understanding genetic variability, is key to study this relationship (through imputation and GWASs) and to translate the results into improved clinical protocols. Different initiatives have emerged around the world to systematically characterise the genetic variability of specific human populations from whole-genome sequences, usually by selecting geographical regions. Examples such as 1000 Genomes (1000G)1, GoNL2, HRC, UK10K3 or Estonian population4, have already identified and characterised millions of genetic variants across different populations. In combination with imputation analysis, these sequenced-based projects allow increasing the statistical power and resolution of Genome-Wide Association Studies (GWAS), identifying and discovering new disease-associated variants5. Additionally, genetic variability among population groups is associated with geographic ancestry and can affect the disease risk or treatment efficacy differently6,7. For this reason, population- specific reference panels are necessary to characterise their genetic diversity and to assess its effect on human phenotypes, improving GWAS studies, as one of the cornerstones of precision medicine7. Existing genetic variability panels include Single Nucleotide Variants (SNVs) and indels (<50bp) but are limited in large Structural Variants (SV) (≥50bp). Technical and methodological limitations hindered the discovery of SVs using Next-generation Sequencing (NGS) technologies, as it produced False-Discovery Rates between 9-89% and recall 10-70%, depending on the SV type and size8. On average, the genomic variation between two human genomes is around 0.1%, but this difference increases to 1.5% with SVs8. The SVs also affect 3-10 times more nucleotides than SNVs9 (4M SNVs per genome10), showing their potential effect on human phenotypes. For this reason, including a complete catalogue of SVs in reference panels will increase the power in GWAS studies and provide opportunities to find new disease-associated variants. To overcome these limitations, in this thesis, we have generated the first genome-wide Iberian haplotype reference panel, mainly focused on Structural Variants, using 785 samples whole-genome sequenced (WGS) at high coverage (30X) from the GCAT-Genomics for life project. We designed a complete strategy, including an extensive benchmarking of multiple variant calling programs and by building specific Logistic Regression Models (LRM) for SV types, as well as phasing strategies to come up with a high quality and comprehensive genetic variability panel. This strategy was benchmarked using different controlled sets of variants, showing high precision and recall values across all variant types and sizes. The application of this strategy to our GCAT whole-genome samples resulted in the identification of 35,431,441 genetic variants, classified as 30,325,064 SNPs, 5,017,19 small indels (< 50bp), and 89,178 larger SV (≥ 50bp). The latter group was further subclassified into 33,244 deletions, 6,269 duplications, 12,782 insertions, 10,115 inversions, 18,779 transposons and 7,989 translocations, covering all ranges of frequencies and sizes. Besides, 60% of the discovered SVs were not catalogued in any repository, thus increasing the insights of SV in humans. Additionally, 52.44% of common and 71.63% of low-frequency SVs were not included in any haplotype reference panel. Thus, new SVs could be used in GWAS, adding more value to the Iberian-GCAT catalogue. The prediction of the functional impact of the SVs shows that these variants might have a central role in several diseases. Of all SVs included in the Iberian-GCAT catalogue, 46% overlapped in genes (both protein-coding genes and non-protein-coding genes), highlighting their potential impact on human traits. Besides, 92.7% of protein-coding genes were located outside low-complexity (repeated) genomic regions, expecting short-reads from NGS to capture the most interpretable SVs in humans11. Moreover, 32.93% of SVs affected protein-coding genes with a predicted loss of function intolerance (pLI) effect, further supporting the potential implication of these variants on complex diseases and therefore enabling a better explanation of missing heritability. Importantly, taking advantage of high coverage (30X), we accurately determine the genotypes of SVs, enabling to phase together with SNVs and indels, and increasing the SV phasing accuracy, in contrast to 1000G and GoNL. Besides, high coverage allowed to use Phasing Informative Reads (PIRs), increasing the phasing performance. The overall strategy enables the community to expand and improve the imputation possibilities within GWAS. The Iberian-GCAT haplotype reference panel created in this thesis, imputes accurately common SVs, with near ~100% of agreement with sequencing results. Although the Iberian- GCAT haplotype reference panel can be used in all populations from different continental groups, due to closer ancestries, the imputation performance is high in European and Latin American populations, reflected in the amount of low-frequency (1% ≤ MAF MAF) variants imputed at high info scores. These results demonstrated the versatility of our resource, increasing their performance in closer ancestries. In general, we observed that when the allele frequency decreases, the imputation accuracy drops too, highlighting the necessity to include more samples in reference panels, to impute low-frequency and rare variants efficiently, which normally are expected to have more functional impact on diseases. Finally, we compared the imputation possibilities of the 1000G and GoNL reference panels, with our Iberian-GCAT reference panel. We observed that the Iberian-GCAT reference panel outperformed the imputation of high-quality SVs by 2.7 and 1.6-fold compared to 1000G and GoNL, respectively. Also, the overall imputation quality is higher, showing the value of this new resource in GWAS as it includes more SVs than previous reference panels. The combination of different reference panels will improve the resolution and statistical power of GWAS, thus increasing the chances to find more risk variants in complex diseases, and ultimately, translated this insight to precision medicine

    Identification of genomic indels and structural variations using split reads

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Recent studies have demonstrated the genetic significance of insertions, deletions, and other more complex structural variants (SVs) in the human population. With the development of the next-generation sequencing technologies, high-throughput surveys of SVs on the whole-genome level have become possible. Here we present split-read identification, calibrated (SRiC), a sequence-based method for SV detection.</p> <p>Results</p> <p>We start by mapping each read to the reference genome in standard fashion using gapped alignment. Then to identify SVs, we score each of the many initial mappings with an assessment strategy designed to take into account both sequencing and alignment errors (e.g. scoring more highly events gapped in the center of a read). All current SV calling methods have multilevel biases in their identifications due to both experimental and computational limitations (e.g. calling more deletions than insertions). A key aspect of our approach is that we calibrate all our calls against synthetic data sets generated from simulations of high-throughput sequencing (with realistic error models). This allows us to calculate sensitivity and the positive predictive value under different parameter-value scenarios and for different classes of events (e.g. long deletions <it>vs</it>. short insertions). We run our calculations on representative data from the 1000 Genomes Project. Coupling the observed numbers of events on chromosome 1 with the calibrations gleaned from the simulations (for different length events) allows us to construct a relatively unbiased estimate for the total number of SVs in the human genome across a wide range of length scales. We estimate in particular that an individual genome contains ~670,000 indels/SVs.</p> <p>Conclusions</p> <p>Compared with the existing read-depth and read-pair approaches for SV identification, our method can pinpoint the exact breakpoints of SV events, reveal the actual sequence content of insertions, and cover the whole size spectrum for deletions. Moreover, with the advent of the third-generation sequencing technologies that produce longer reads, we expect our method to be even more useful.</p

    Novel variation and <i>de novo </i>mutation rates in population-wide <i>de novo</i> assembled Danish trios

    Get PDF
    Building a population-specific catalogue of single nucleotide variants (SNVs), indels and structural variants (SVs) with frequencies, termed a national pan-genome, is critical for further advancing clinical and public health genetics in large cohorts. Here we report a Danish pan-genome obtained from sequencing 10 trios to high depth (50 × ). We report 536k novel SNVs and 283k novel short indels from mapping approaches and develop a population-wide de novo assembly approach to identify 132k novel indels larger than 10 nucleotides with low false discovery rates. We identify a higher proportion of indels and SVs than previous efforts showing the merits of high coverage and de novo assembly approaches. In addition, we use trio information to identify de novo mutations and use a probabilistic method to provide direct estimates of 1.27e−8 and 1.5e−9 per nucleotide per generation for SNVs and indels, respectively

    The detection of high-qualified indels in exomes and their effect on cognition

    Full text link
    Plusieurs insertions/délétions (indels) génétiques ont été identifiées en lien avec des troubles du neurodéveloppement, notamment le trouble du spectre de l’autisme (TSA) et la déficience intellectuelle (DI). Bien que ce soit le deuxième type de variant le plus courant, la détection et l’identification des indels demeure difficile à ce jour, et on y retrouve un grand nombre de faux positifs. Ce projet vise à trouver une méthode pour détecter des indels de haute qualité ayant une forte probabilité d’être des vrais positifs. Un « ensemble de vérité » a été construit à partir d’indels provenant de deux cohortes familiales basé sur un diagnostic d’autisme. Ces indels ont été filtrés selon un ensemble de paramètres prédéterminés et ils ont été appelés par plusieurs outils d’appel de variants. Cet ensemble a été utilisé pour entraîner trois modèles d’apprentissage automatique pour identifier des indels de haute qualité. Par la suite, nous avons utilisé ces modèles pour prédire des indels de haute qualité dans une cohorte de population générale, ayant été appelé par une technologie d’appel de variant. Les modèles ont pu identifier des indels de meilleure qualité qui ont une association avec le QI, malgré que cet effet soit petit. De plus, les indels prédits par les modèles affectent un plus petit nombre de gènes par individu que ceux ayant été filtrés par un seuil de rejet fixe. Les modèles ont tendance à améliorer la qualité des indels, mais nécessiteront davantage de travail pour déterminer si ce serait possible de prédire les indels qui ont un effet non-négligeable sur le QI.Genetic insertions/deletions (indels) have been linked to many neurodevelopmental disorders (NDDs) such as autism spectrum disorder (ASD) and intellectual disability (ID). However, although they are the second most common type of genetic variant, they remain to this day difficult to identify and verify, presenting a high number of false positives. We sought to find a method that would appropriately identify high-quality indels that are likely to be true positives. We built an indel “truth set” using indels from two diagnosis-based family cohorts that were filtered according to a set of threshold values and called by several variant calling tools in order to train three machine learning models to identify the highest quality indels. The two best performing models were then used to identify high quality indels in a general population cohort that was called using only one variant calling technology. The machine learning models were able to identify higher quality indels that showed a association with IQ, although the effect size was small. The indels predicted by the models also affected a much smaller number of genes per individual than those predicted through using minimum thresholds alone. The models tend to show an overall improvement in the quality of the indels but would require further work to see if it could a noticeable and significant effect on IQ

    Detection of Genomic Structural Variants from Next-Generation Sequencing Data

    Get PDF
    Structural variants are genomic rearrangements larger than 50?bp accounting for around 1% of the variation among human genomes. They impact on phenotypic diversity and play a role in various diseases including neurological/neurocognitive disorders and cancer development and progression. Dissecting structural variants from next-generation sequencing data presents several challenges and a number of approaches have been proposed in the literature. In this mini review, we describe and summarize the latest tools ? and their underlying algorithms ? designed for the analysis of whole-genome sequencing, whole-exome sequencing, custom captures, and amplicon sequencing data, pointing out the major advantages/drawbacks. We also report a summary of the most recent applications of third-generation sequencing platforms. This assessment provides a guided indication ? with particular emphasis on human genetics and copy number variants ? for researchers involved in the investigation of these genomic events
    corecore