1,219 research outputs found
A map of human genome variation from population-scale sequencing
The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype and phenotype. Here we present results of the pilot phase of the project, designed to develop and compare different strategies for genome-wide sequencing with high-throughput platforms. We undertook three projects: low-coverage whole-genome sequencing of 179 individuals from four populations; high-coverage sequencing of two mother–father–child trios; and exon-targeted sequencing of 697 individuals from seven populations. We describe the location, allele frequency and local haplotype structure of approximately 15 million single nucleotide polymorphisms, 1 million short insertions and deletions, and 20,000 structural variants, most of which were previously undescribed. We show that, because we have catalogued the vast majority of common variation, over 95% of the currently accessible variants found in any individual are present in this data set. On average, each person is found to carry approximately 250 to 300 loss-of-function variants in annotated genes and 50 to 100 variants previously implicated in inherited disorders. We demonstrate how these results can be used to inform association and functional studies. From the two trios, we directly estimate the rate of de novo germline base substitution mutations to be approximately 10−8 per base pair per generation. We explore the data with regard to signatures of natural selection, and identify a marked reduction of genetic variation in the neighbourhood of genes, due to selection at linked sites. These methods and public data will support the next phase of human genetic research
An integrated map of genetic variation from 1,092 human genomes
By characterizing the geographic and functional spectrum of human genetic variation, the 1000 Genomes Project aims to build a resource to help to understand the genetic contribution to disease. Here we describe the genomes of 1,092 individuals from 14 populations, constructed using a combination of low-coverage whole-genome and exome sequencing. By developing methods to integrate information across several algorithms and diverse data sources, we provide a validated haplotype map of 38 million single nucleotide polymorphisms, 1.4 million short insertions and deletions, and more than 14,000 larger deletions. We show that individuals from different populations carry different profiles of rare and common variants, and that low-frequency variants show substantial geographic differentiation, which is further increased by the action of purifying selection. We show that evolutionary conservation and coding consequence are key determinants of the strength of purifying selection, that rare-variant load varies substantially across biological pathways, and that each individual contains hundreds of rare non-coding variants at conserved sites, such as motif-disrupting changes in transcription-factor-binding sites. This resource, which captures up to 98% of accessible single nucleotide polymorphisms at a frequency of 1% in related populations, enables analysis of common and low-frequency variants in individuals from diverse, including admixed, populations
A global reference for human genetic variation
The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes >99% of SNP variants with a frequency of >1% for a variety of ancestries. We describe the distribution of genetic variation across the global sample, and discuss the implications for common disease studies
The landscape of human STR variation
Short tandem repeats are among the most polymorphic loci in the human genome. These loci play a role in the etiology of a range of genetic diseases and have been frequently utilized in forensics, population genetics, and genetic genealogy. Despite this plethora of applications, little is known about the variation of most STRs in the human population. Here, we report the largest-scale analysis of human STR variation to date. We collected information for nearly 700,000 STR loci across more than 1000 individuals in Phase 1 of the 1000 Genomes Project. Extensive quality controls show that reliable allelic spectra can be obtained for close to 90% of the STR loci in the genome. We utilize this call set to analyze determinants of STR variation, assess the human reference genome’s representation of STR alleles, find STR loci with common loss-of-function alleles, and obtain initial estimates of the linkage disequilibrium between STRs and common SNPs. Overall, these analyses further elucidate the scale of genetic variation beyond classical point mutations.American Society for Engineering Education. National Defense Science and Engineering Graduate Fellowshi
Variant calling on the GRCh38 assembly with the data from phase three of the 1000 Genomes Project
We present biallelic SNVs called from 2,548 samples across 26 populationsfrom the 1000 Genomes Project, called directly on GRCh38. We believethis will be a useful reference resource for those using GRCh38,representing an improvement over the “lift-overs” of the 1000 GenomesProject data that have been available to date and providing a resourcenecessary for the full adoption of GRCh38 by the community. Here, wedescribe how the call set was created and provide benchmarking datadescribing how our call set compares to that produced by the final phase ofthe 1000 Genomes Project on GRCh37
Investigating genome-wide transcriptional and methylomic consequences of a balanced t(1;11) translocation linked to major mental illness
Schizophrenia, bipolar disorder and major depressive disorder are devastating
psychiatric conditions with a complex, overlapping genetic and environmental
architecture. Previously, a family has been reported where a balanced chromosomal
translocation between chromosomes 1 and 11 [t(1;11)] shows significant linkage to
these disorders. This translocation transects three genes: Disrupted in schizophrenia-
1 (DISC1) on chromosome 1, a non-coding RNA, Disrupted in schizophrenia-2
(DISC2) antisense to DISC1, and a non-coding transcript, DISC1 fusion partner-1
(DISC1FP1) on chromosome 11, all of which could result in pathogenic properties in
the context of the translocation. This thesis focuses on the genome-wide effects of the
t(1;11) translocation, primarily examining differences in gene expression and DNA
methylation, using various biological samples from the t(1;11) family.
To assess the genome-wide effects of the t(1;11) translocation on methylation, DNA
methylation was profiled in whole-blood from 41 family members using the Infinium
HumanMethylation450 BeadChip. Significant differential methylation was observed
within the translocation breakpoint regions on chromosomes 1 and 11. Downstream
analysis identified additional regions of differential methylation outwith these
chromosomes, while pathway analysis showed terms related to psychiatric disorders
and neurodevelopment were enriched amongst differentially methylated genes, in
addition to more general terms pertaining to cellular function. Using induced
pluripotent stem cell (iPSC) technology, neuronal samples were developed from
fibroblasts in a subset of individuals profiled for genome-wide methylation in whole
blood (N = 6) with an aim to replicate the significant findings around the breakpoint
regions. Here, methylation was profiled using the Infinium HumanMethylation450
BeadChip’s successor: the Infinium MethylationEPIC BeadChip. The results from the
blood-based study failed to replicate in the neuronal samples, which could be attributed
to low statistical power or tissue-specific factors such as methylation quantitative trait
loci. The differences in methylation in the most significantly differentially methylated
loci were found to be driven by a single individual, rendering further interpretation of
the findings from this analysis difficult without additional samples. Cross-tissue
analyses of DNA methylation were performed on blood and neuronal DNA from these
six individuals, revealing little correlation between cell types.
DISC1 is central to a network of interacting protein partners, including the
transcription factor ATF4, and PDE4; both of which are associated with the cAMP
signalling pathway. Haploinsufficiency of DISC1 due to the translocation may
therefore be disruptive to cAMP-mediated gene expression. In order to identify
transcriptomic effects which may be related to the t(1;11) translocation, genome-wide
expression profiling was performed in lymphoblastoid cell line RNA from 13 family
members. No transcripts were found to be differentially expressed at the genome-wide
significant level. A post-hoc power analysis suggested that more samples would be
required in order to detect genome-wide significant differential expression. However,
imposing a fold-change cut-off to the data identified a number of candidate genes for
follow-up analysis, including SORL1: a member of the brain-expressed Sortilin gene
family. Sortilin genes have been linked to multiple psychiatric disorders including
schizophrenia, bipolar disorder and Alzheimer’s disease. Follow-up analyses of
Sortilin family members were performed in a Disc1 mouse model of schizophrenia,
containing an amino acid substitution (L100P). Here, developmental gene expression
profiling was performed with an additional aim to optimise and validate work
performed by others using this mouse model. However, results from these experiments
were variable between two independent batches mice tested. Additional investigation
of Sortilin family genes was performed using GWAS data from human samples, using
machine learning techniques to identify epistatic interactions linked to depression and
brain function, revealing no statistically significant interactions.
The results presented in this thesis suggest a potential mechanism for differential DNA
methylation in the context of chromosomal translocations, and suggests mechanisms
whereby increased risk of illness is conferred upon translocation carriers through
dysregulation of transcription and DNA methylation
Analysis of variable retroduplications in human populations suggests coupling of retrotransposition to cell division
In primates and other animals, reverse transcription of mRNA followed by genomic integration creates retroduplications. Expressed retroduplications are either “retrogenes” coding for functioning proteins, or expressed “processed pseudogenes,” which can function as noncoding RNAs. To date, little is known about the variation in retroduplications in terms of their presence or absence across individuals in the human population. We have developed new methodologies that allow us to identify “novel” retroduplications (i.e., those not present in the reference genome), to find their insertion points, and to genotype them. Using these methods, we catalogued and analyzed 174 retroduplication variants in almost one thousand humans, which were sequenced as part of Phase 1 of The 1000 Genomes Project Consortium. The accuracy of our data set was corroborated by (1) multiple lines of sequencing evidence for retroduplication (e.g., depth of coverage in exons vs. introns), (2) experimental validation, and (3) the fact that we can reconstruct a correct phylogenetic tree of human subpopulations based solely on retroduplications. We also show that parent genes of retroduplication variants tend to be expressed at the M-to-G1 transition in the cell cycle and that M-to-G1 expressed genes have more copies of fixed retroduplications than genes expressed at other times. These findings suggest that cell division is coupled to retrotransposition and, perhaps, is even a requirement for it
HapZipper: sharing HapMap populations just got easier
The rapidly growing amount of genomic sequence data being generated and made publicly available necessitate the development of new data storage and archiving methods. The vast amount of data being shared and manipulated also create new challenges for network resources. Thus, developing advanced data compression techniques is becoming an integral part of data production and analysis. The HapMap project is one of the largest public resources of human single-nucleotide polymorphisms (SNPs), characterizing over 3 million SNPs genotyped in over 1000 individuals. The standard format and biological properties of HapMap data suggest that a dedicated genetic compression method can outperform generic compression tools. We propose a compression methodology for genetic data by introducing H ap Z ipper , a lossless compression tool tailored to compress HapMap data beyond benchmarks defined by generic tools such as gzip , bzip2 and lzma . We demonstrate the usefulness of H ap Z ipper by compressing HapMap 3 populations to <5% of their original sizes. H ap Z ipper is freely downloadable from https://bitbucket.org/pchanda/hapzipper/downloads/HapZipper.tar.bz
- …