175 research outputs found
Tools for the identification of variable and potentially variable tandem repeats
BACKGROUND: Tandem repeat arrays showing variation between sequences within a population, between strains or across species may have functional effects. The increasing availability of genomic sequence data makes routine description of observed variation possible, creating a need for tools to describe such variability. RESULTS: We present a set of programs that facilitate the identification of tandem repeats showing variation across multiple sequences or genomes, and the prediction of potentially polymorphic tandem repeats. The VNTRfinder (Variable Number of Tandem Repeats finder) program enables the detection of sequence length variation between arrays of inter-specific or intra-specific tandem repeats. In the absence of comparable sequences to explore observed variation, predictions are provided describing which tandem repeats are more likely to be variable, to help guide and focus further experimental evaluation. CONCLUSION: These tools represent a resource for researchers interested in tandem repeats in nucleotide sequences that are most likely to be of clinical and evolutionary interest. The tools are available at . Downloadable versions for UNIX/LINUX and WINDOWS which permit the consideration of longer and more numerous sequences are also available
Genetic Classification of Populations using Supervised Learning
There are many instances in genetics in which we wish to determine whether
two candidate populations are distinguishable on the basis of their genetic
structure. Examples include populations which are geographically separated,
case--control studies and quality control (when participants in a study have
been genotyped at different laboratories). This latter application is of
particular importance in the era of large scale genome wide association
studies, when collections of individuals genotyped at different locations are
being merged to provide increased power. The traditional method for detecting
structure within a population is some form of exploratory technique such as
principal components analysis. Such methods, which do not utilise our prior
knowledge of the membership of the candidate populations. are termed
\emph{unsupervised}. Supervised methods, on the other hand are able to utilise
this prior knowledge when it is available.
In this paper we demonstrate that in such cases modern supervised approaches
are a more appropriate tool for detecting genetic differences between
populations. We apply two such methods, (neural networks and support vector
machines) to the classification of three populations (two from Scotland and one
from Bulgaria). The sensitivity exhibited by both these methods is considerably
higher than that attained by principal components analysis and in fact
comfortably exceeds a recently conjectured theoretical limit on the sensitivity
of unsupervised methods. In particular, our methods can distinguish between the
two Scottish populations, where principal components analysis cannot. We
suggest, on the basis of our results that a supervised learning approach should
be the method of choice when classifying individuals into pre-defined
populations, particularly in quality control for large scale genome wide
association studies.Comment: Accepted PLOS On
Iron Age and Anglo-Saxon genomes from East England reveal British migration history
British population history has been shaped by a series of immigrations, including the early Anglo-Saxon migrations after 400 CE. It remains an open question how these events affected the genetic composition of the current British population. Here, we present whole-genome sequences from 10 individuals excavated close to Cambridge in the East of England, ranging from the late Iron Age to the middle Anglo-Saxon period. By analysing shared rare variants with hundreds of modern samples from Britain and Europe, we estimate that on average the contemporary East English population derives 38% of its ancestry from Anglo-Saxon migrations. We gain further insight with a new method, rarecoal, which infers population history and identifies fine-scale genetic ancestry from rare variants. Using rarecoal we find that the Anglo-Saxon samples are closely related to modern Dutch and Danish populations, while the Iron Age samples share ancestors with multiple Northern European populations including Britain
Multiplex Target Enrichment Using DNA Indexing for Ultra-High Throughput SNP Detection
Screening large numbers of target regions in multiple DNA samples for sequence variation is an important application of next-generation sequencing but an efficient method to enrich the samples in parallel has yet to be reported. We describe an advanced method that combines DNA samples using indexes or barcodes prior to target enrichment to facilitate this type of experiment. Sequencing libraries for multiple individual DNA samples, each incorporating a unique 6-bp index, are combined in equal quantities, enriched using a single in-solution target enrichment assay and sequenced in a single reaction. Sequence reads are parsed based on the index, allowing sequence analysis of individual samples. We show that the use of indexed samples does not impact on the efficiency of the enrichment reaction. For three- and nine-indexed HapMap DNA samples, the method was found to be highly accurate for SNP identification. Even with sequence coverage as low as 8x, 99% of sequence SNP calls were concordant with known genotypes. Within a single experiment, this method can sequence the exonic regions of hundreds of genes in tens of samples for sequence and structural variation using as little as 1 μg of input DNA per sample
The geography of recent genetic ancestry across Europe
The recent genealogical history of human populations is a complex mosaic
formed by individual migration, large-scale population movements, and other
demographic events. Population genomics datasets can provide a window into this
recent history, as rare traces of recent shared genetic ancestry are detectable
due to long segments of shared genomic material. We make use of genomic data
for 2,257 Europeans (the POPRES dataset) to conduct one of the first surveys of
recent genealogical ancestry over the past three thousand years at a
continental scale. We detected 1.9 million shared genomic segments, and used
the lengths of these to infer the distribution of shared ancestors across time
and geography. We find that a pair of modern Europeans living in neighboring
populations share around 10-50 genetic common ancestors from the last 1500
years, and upwards of 500 genetic ancestors from the previous 1000 years. These
numbers drop off exponentially with geographic distance, but since genetic
ancestry is rare, individuals from opposite ends of Europe are still expected
to share millions of common genealogical ancestors over the last 1000 years.
There is substantial regional variation in the number of shared genetic
ancestors: especially high numbers of common ancestors between many eastern
populations likely date to the Slavic and/or Hunnic expansions, while much
lower levels of common ancestry in the Italian and Iberian peninsulas may
indicate weaker demographic effects of Germanic expansions into these areas
and/or more stably structured populations. Recent shared ancestry in modern
Europeans is ubiquitous, and clearly shows the impact of both small-scale
migration and large historical events. Population genomic datasets have
considerable power to uncover recent demographic history, and will allow a much
fuller picture of the close genealogical kinship of individuals across the
world.Comment: Full size figures available from
http://www.eve.ucdavis.edu/~plralph/research.html; or html version at
http://ralphlab.usc.edu/ibd/ibd-paper/ibd-writeup.xhtm
Evidence that duplications of 22q11.2 protect against schizophrenia.
A number of large, rare copy number variants (CNVs) are deleterious for neurodevelopmental disorders, but large, rare, protective CNVs have not been reported for such phenotypes. Here we show in a CNV analysis of 47 005 individuals, the largest CNV analysis of schizophrenia to date, that large duplications (1.5-3.0 Mb) at 22q11.2--the reciprocal of the well-known, risk-inducing deletion of this locus--are substantially less common in schizophrenia cases than in the general population (0.014% vs 0.085%, OR=0.17, P=0.00086). 22q11.2 duplications represent the first putative protective mutation for schizophrenia
Multi-locus genome-wide association analysis supports the role of glutamatergic synaptic transmission in the etiology of major depressive disorder
Major depressive disorder (MDD) is a common psychiatric illness characterized by low mood and loss of interest in pleasurable activities. Despite years of effort, recent genome-wide association studies (GWAS) have identified few susceptibility variants or genes that are robustly associated with MDD. Standard single-SNP (single nucleotide polymorphism)-based GWAS analysis typically has limited power to deal with the extensive heterogeneity and substantial polygenic contribution of individually weak genetic effects underlying the pathogenesis of MDD. Here, we report an alternative, gene-set-based association analysis of MDD in an effort to identify groups of biologically related genetic variants that are involved in the same molecular function or cellular processes and exhibit a significant level of aggregated association with MDD. In particular, we used a text-mining-based data analysis to prioritize candidate gene sets implicated in MDD and conducted a multi-locus association analysis to look for enriched signals of nominally associated MDD susceptibility loci within each of the gene sets. Our primary analysis is based on the meta-analysis of three large MDD GWAS data sets (total N = 4346 cases and 4430 controls). After correction for multiple testing, we found that genes involved in glutamatergic synaptic neurotransmission were significantly associated with MDD (set-based association P = 6.9 X 10(-4)). This result is consistent with previous studies that support a role of the glutamatergic system in synaptic plasticity and MDD and support the potential utility of targeting glutamatergic neurotransmission in the treatment of MDD
Cryptic Distant Relatives Are Common in Both Isolated and Cosmopolitan Genetic Samples
Although a few hundred single nucleotide polymorphisms (SNPs) suffice to infer close familial relationships, high density genome-wide SNP data make possible the inference of more distant relationships such as 2nd to 9th cousinships. In order to characterize the relationship between genetic similarity and degree of kinship given a timeframe of 100–300 years, we analyzed the sharing of DNA inferred to be identical by descent (IBD) in a subset of individuals from the 23andMe customer database (n = 22,757) and from the Human Genome Diversity Panel (HGDP-CEPH, n = 952). With data from 121 populations, we show that the average amount of DNA shared IBD in most ethnolinguistically-defined populations, for example Native American groups, Finns and Ashkenazi Jews, differs from continentally-defined populations by several orders of magnitude. Via extensive pedigree-based simulations, we determined bounds for predicted degrees of relationship given the amount of genomic IBD sharing in both endogamous and ‘unrelated’ population samples. Using these bounds as a guide, we detected tens of thousands of 2nd to 9th degree cousin pairs within a heterogenous set of 5,000 Europeans. The ubiquity of distant relatives, detected via IBD segments, in both ethnolinguistic populations and in large ‘unrelated’ populations samples has important implications for genetic genealogy, forensics and genotype/phenotype mapping studies
- …