12 research outputs found

    A framework for the detection of de novo mutations in family-based sequencing data

    Get PDF
    Germline mutation detection from human DNA sequence data is challenging due to the rarity of such events relative to the intrinsic error rates of sequencing technologies and the uneven coverage across the genome. We developed PhaseByTransmission (PBT) to identify de novo single nucleotide variants and short insertions and deletions (indels) from sequence data collected in parent-offspring trios. We compute the joint probability of the data given the genotype likelihoods in the individual family members, the known familial relationships and a prior probability for the mutation rate. Candidate de novo mutations (DNMs) are reported along with their posterior probability, providing a systematic way to prioritize them for validation. Our tool is integrated in the Genome Analysis Toolkit and can be used together with the ReadBackedPhasing module to infer the parental origin of DNMs based on phase-informative reads. Using simulated data, we show that PBT outperforms existing tools, especially in low coverage data and on the X chromosome. We further show that PBT displays high validation rates on empirical parent-offspring sequencing data for whole-exome data from 104 trios and X-chromosome data from 249 parent-offspring families. Finally, we demonstrate an association between father's age at conception and the number of DNMs in female offspring's X chromosome, consistent with previous literature reports

    Mapping known and novel genetic variation in the human genome: bioinformatic tool development and applications

    No full text
    The study of human genetics was greatly facilitated by the sequencing of the first human genome in 2001. A race to develop and perfectionize DNA sequencing technologies and data analysis followed this milestone project, that has enabled the sequencing of thousands of human genomes since. Based on the sequencing data from many human genomes, gathered through consortia such as the thousand genome project and the Genome of the Netherlands, an average human genome was found to vary at a few million loci compared to the genome of an unrelated human individual. Currently, roughly ~100 million genetic variations have been found so far, but new variation is discovered with every sequenced genome. Thousands of genetic variants were associated to common and/or rare disease. The processes through which genetic variation results in disease are sometimes linked directly to altering one of the ~20,000 known genes’ product content or abundance and have even enabled new therapies. In many cases however, the functional consequences of genetic variation were hard to identify precisely. These functional effects could be further explained by relating the genetic variation to more distal regions that interact with a gene or by affecting DNA organization and conformation. While information about the sequence content as well as about many other relevant DNA features (such as conformation and regulation) may be retrieved through sequencing, the type of different sequencing technology eventually used can have a significant impact on results. Thus, current sequencing technologies that produce short, but highly accurate readouts of the genome are successfully employed to determine the genetic content of most loci in the genome. Analyzing more complex structural variation within a genome, or reconstructing regions of a genome however, requires long-range information that is cumbersome, to obtain from the short read-outs. Alternative technologies have emerged, that are able to produce very large read-outs of our genome and can offer the information necessary to reconstruct complex regions. These longer read-outs are currently, relatively more erroneous, making the analysis of short genetic variation very hard. My work in this thesis concerns the development of appropriate methodologies to accurately extract and value all the information that state of the art sequencing technologies produce, and I show how different sequencing technologies are best suited for interrogating the human genome for different types of variation and information. Overall, this thesis illustrates how using the appropriate methodology and technology is key for reaching accurate and clear conclusions from large amounts of genetic data useful both in a research and in a diagnostic setting. Short-read accurate sequencing technologies are a benchmark for small and/or rare genetic variation, whereas emerging long-read technologies are perfectly suited for larger, structural variation. Furthermore, by reading longer stretches of DNA, nanopore sequencing may be instrumental for understanding functional consequences of genetic variation and facilitate data integration and a paradigm shift towards analyzing an individual’s genome in its entirety

    Mapping known and novel genetic variation in the human genome: bioinformatic tool development and applications

    No full text
    The study of human genetics was greatly facilitated by the sequencing of the first human genome in 2001. A race to develop and perfectionize DNA sequencing technologies and data analysis followed this milestone project, that has enabled the sequencing of thousands of human genomes since. Based on the sequencing data from many human genomes, gathered through consortia such as the thousand genome project and the Genome of the Netherlands, an average human genome was found to vary at a few million loci compared to the genome of an unrelated human individual. Currently, roughly ~100 million genetic variations have been found so far, but new variation is discovered with every sequenced genome. Thousands of genetic variants were associated to common and/or rare disease. The processes through which genetic variation results in disease are sometimes linked directly to altering one of the ~20,000 known genes’ product content or abundance and have even enabled new therapies. In many cases however, the functional consequences of genetic variation were hard to identify precisely. These functional effects could be further explained by relating the genetic variation to more distal regions that interact with a gene or by affecting DNA organization and conformation. While information about the sequence content as well as about many other relevant DNA features (such as conformation and regulation) may be retrieved through sequencing, the type of different sequencing technology eventually used can have a significant impact on results. Thus, current sequencing technologies that produce short, but highly accurate readouts of the genome are successfully employed to determine the genetic content of most loci in the genome. Analyzing more complex structural variation within a genome, or reconstructing regions of a genome however, requires long-range information that is cumbersome, to obtain from the short read-outs. Alternative technologies have emerged, that are able to produce very large read-outs of our genome and can offer the information necessary to reconstruct complex regions. These longer read-outs are currently, relatively more erroneous, making the analysis of short genetic variation very hard. My work in this thesis concerns the development of appropriate methodologies to accurately extract and value all the information that state of the art sequencing technologies produce, and I show how different sequencing technologies are best suited for interrogating the human genome for different types of variation and information. Overall, this thesis illustrates how using the appropriate methodology and technology is key for reaching accurate and clear conclusions from large amounts of genetic data useful both in a research and in a diagnostic setting. Short-read accurate sequencing technologies are a benchmark for small and/or rare genetic variation, whereas emerging long-read technologies are perfectly suited for larger, structural variation. Furthermore, by reading longer stretches of DNA, nanopore sequencing may be instrumental for understanding functional consequences of genetic variation and facilitate data integration and a paradigm shift towards analyzing an individual’s genome in its entirety

    Mapping known and novel genetic variation in the human genome: bioinformatic tool development and applications

    No full text
    The study of human genetics was greatly facilitated by the sequencing of the first human genome in 2001. A race to develop and perfectionize DNA sequencing technologies and data analysis followed this milestone project, that has enabled the sequencing of thousands of human genomes since. Based on the sequencing data from many human genomes, gathered through consortia such as the thousand genome project and the Genome of the Netherlands, an average human genome was found to vary at a few million loci compared to the genome of an unrelated human individual. Currently, roughly ~100 million genetic variations have been found so far, but new variation is discovered with every sequenced genome. Thousands of genetic variants were associated to common and/or rare disease. The processes through which genetic variation results in disease are sometimes linked directly to altering one of the ~20,000 known genes’ product content or abundance and have even enabled new therapies. In many cases however, the functional consequences of genetic variation were hard to identify precisely. These functional effects could be further explained by relating the genetic variation to more distal regions that interact with a gene or by affecting DNA organization and conformation. While information about the sequence content as well as about many other relevant DNA features (such as conformation and regulation) may be retrieved through sequencing, the type of different sequencing technology eventually used can have a significant impact on results. Thus, current sequencing technologies that produce short, but highly accurate readouts of the genome are successfully employed to determine the genetic content of most loci in the genome. Analyzing more complex structural variation within a genome, or reconstructing regions of a genome however, requires long-range information that is cumbersome, to obtain from the short read-outs. Alternative technologies have emerged, that are able to produce very large read-outs of our genome and can offer the information necessary to reconstruct complex regions. These longer read-outs are currently, relatively more erroneous, making the analysis of short genetic variation very hard. My work in this thesis concerns the development of appropriate methodologies to accurately extract and value all the information that state of the art sequencing technologies produce, and I show how different sequencing technologies are best suited for interrogating the human genome for different types of variation and information. Overall, this thesis illustrates how using the appropriate methodology and technology is key for reaching accurate and clear conclusions from large amounts of genetic data useful both in a research and in a diagnostic setting. Short-read accurate sequencing technologies are a benchmark for small and/or rare genetic variation, whereas emerging long-read technologies are perfectly suited for larger, structural variation. Furthermore, by reading longer stretches of DNA, nanopore sequencing may be instrumental for understanding functional consequences of genetic variation and facilitate data integration and a paradigm shift towards analyzing an individual’s genome in its entirety

    Mapping known and novel genetic variation in the human genome: bioinformatic tool development and applications

    No full text
    The study of human genetics was greatly facilitated by the sequencing of the first human genome in 2001. A race to develop and perfectionize DNA sequencing technologies and data analysis followed this milestone project, that has enabled the sequencing of thousands of human genomes since. Based on the sequencing data from many human genomes, gathered through consortia such as the thousand genome project and the Genome of the Netherlands, an average human genome was found to vary at a few million loci compared to the genome of an unrelated human individual. Currently, roughly ~100 million genetic variations have been found so far, but new variation is discovered with every sequenced genome. Thousands of genetic variants were associated to common and/or rare disease. The processes through which genetic variation results in disease are sometimes linked directly to altering one of the ~20,000 known genes’ product content or abundance and have even enabled new therapies. In many cases however, the functional consequences of genetic variation were hard to identify precisely. These functional effects could be further explained by relating the genetic variation to more distal regions that interact with a gene or by affecting DNA organization and conformation. While information about the sequence content as well as about many other relevant DNA features (such as conformation and regulation) may be retrieved through sequencing, the type of different sequencing technology eventually used can have a significant impact on results. Thus, current sequencing technologies that produce short, but highly accurate readouts of the genome are successfully employed to determine the genetic content of most loci in the genome. Analyzing more complex structural variation within a genome, or reconstructing regions of a genome however, requires long-range information that is cumbersome, to obtain from the short read-outs. Alternative technologies have emerged, that are able to produce very large read-outs of our genome and can offer the information necessary to reconstruct complex regions. These longer read-outs are currently, relatively more erroneous, making the analysis of short genetic variation very hard. My work in this thesis concerns the development of appropriate methodologies to accurately extract and value all the information that state of the art sequencing technologies produce, and I show how different sequencing technologies are best suited for interrogating the human genome for different types of variation and information. Overall, this thesis illustrates how using the appropriate methodology and technology is key for reaching accurate and clear conclusions from large amounts of genetic data useful both in a research and in a diagnostic setting. Short-read accurate sequencing technologies are a benchmark for small and/or rare genetic variation, whereas emerging long-read technologies are perfectly suited for larger, structural variation. Furthermore, by reading longer stretches of DNA, nanopore sequencing may be instrumental for understanding functional consequences of genetic variation and facilitate data integration and a paradigm shift towards analyzing an individual’s genome in its entirety

    Functionally distinct ERAP1 and ERAP2 are a hallmark of HLA-A29-(Birdshot) Uveitis

    No full text
    Contains fulltext : 199474.pdf (Publisher’s version ) (Open Access

    Functionally distinct ERAP1 and ERAP2 are a hallmark of HLA-A29-(Birdshot) Uveitis

    Get PDF
    Birdshot Uveitis (Birdshot) is a rare eye condition that affects HLA-A29-positive individuals and could be considered a prototypic member of the recently proposed ‘MHC-I (major histocompatibility complex class I)-opathy’ family. Genetic studies have pinpointed the endoplasmic reticulum aminopeptidase (ERAP1) and (ERAP2) genes as shared associations across MHC-I-opathies, which suggests ERAP dysfunction may be a root cause for MHC-I-opathies. We mapped the ERAP1 and ERAP2 haplotypes in 84 Dutch cases and 890 controls. We identified association at variant rs10044354, which mediated a marked increase in ERAP2 expression. We also identified and cloned an independently associated ERAP1 haplotype (tagged by rs2287987) present in more than half of the cases; this ERAP1 haplotype is also the primary risk and protective haplotype for other MHC-I-opathies. We show that the risk ERAP1 haplotype conferred significantly altered expression of ERAP1 isoforms in transcriptomic data (n = 360), resulting in lowered protein expression and distinct enzymatic activity. Both the association for rs10044354 (meta-analysis: odds ratio (OR) [95% CI]=2.07[1.58–2.71], P = 1.24 × 10(−7)) and rs2287987 (OR[95% CI]: =2.01[1.51–2.67], P = 1.41 × 10(−6)) replicated and showed consistent direction of effect in an independent Spanish cohort of 46 cases and 2103 controls. In both cohorts, the combined rs2287987-rs10044354 haplotype associated with Birdshot more strongly than either variant alone [meta-analysis: P=3.9 × 10(−9)]. Finally, we observed that ERAP2 protein expression is dependent on the ERAP1 background across three European populations (n = 3353). In conclusion, a functionally distinct combination of ERAP1 and ERAP2 are a hallmark of Birdshot and provide rationale for strategies designed to correct ERAP function for treatment of Birdshot and MHC-I-opathies more broadly
    corecore