Search CORE

Public Library of Science (PLOS)

Analysis of concordance of different haplotype block partitioning algorithms

Author: Indap Amit R
Marth Gabor T
Olivier Michael
Struble Craig A
Tonellato Peter
Publication venue: BioMed Central
Publication date: 01/01/2005
Field of study

BACKGROUND: Different classes of haplotype block algorithms exist and the ideal dataset to assess their performance would be to comprehensively re-sequence a large genomic region in a large population. Such data sets are expensive to collect. Alternatively, we performed coalescent simulations to generate haplotypes with a high marker density and compared block partitioning results from diversity based, LD based, and information theoretic algorithms under different values of SNP density and allele frequency. RESULTS: We simulated 1000 haplotypes using the standard coalescent for three world populations – European, African American, and East Asian – and applied three classes of block partitioning algorithms – diversity based, LD based, and information theoretic. We assessed algorithm differences in number, size, and coverage of blocks inferred under different conditions of SNP density, allele frequency, and sample size. Each algorithm inferred blocks differing in number, size, and coverage under different density and allele frequency conditions. Different partitions had few if any matching block boundaries. However they still overlapped and a high percentage of total chromosomal region was common to all methods. This percentage was generally higher with a higher density of SNPs and when rarer markers were included. CONCLUSION: A gold standard definition of a haplotype block is difficult to achieve, but collecting haplotypes covered with a high density of SNPs, partitioning them with a variety of block algorithms, and identifying regions common to all methods may be the best way to identify genomic regions that harbor SNP variants that cause disease

Directory of Open Access Journals

Whole genome profiling of spontaneous and chemically induced mutations in Toxoplasma gondii

Author: Benenati Brian
Blader Ira J
Brown Kevin M
Coleman Bradley I
Farrell Andrew
Gubbels Marc-Jan
Marth Gabor T
Publication venue: Digital Commons@Becker
Publication date: 01/01/2014
Field of study

BACKGROUND: Next generation sequencing is helping to overcome limitations in organisms less accessible to classical or reverse genetic methods by facilitating whole genome mutational analysis studies. One traditionally intractable group, the Apicomplexa, contains several important pathogenic protozoan parasites, including the Plasmodium species that cause malaria. Here we apply whole genome analysis methods to the relatively accessible model apicomplexan, Toxoplasma gondii, to optimize forward genetic methods for chemical mutagenesis using N-ethyl-N-nitrosourea (ENU) and ethylmethane sulfonate (EMS) at varying dosages. RESULTS: By comparing three different lab-strains we show that spontaneously generated mutations reflect genome composition, without nucleotide bias. However, the single nucleotide variations (SNVs) are not distributed randomly over the genome; most of these mutations reside either in non-coding sequence or are silent with respect to protein coding. This is in contrast to the random genomic distribution of mutations induced by chemical mutagenesis. Additionally, we report a genome wide transition vs transversion ratio (ti/tv) of 0.91 for spontaneous mutations in Toxoplasma, with a slightly higher rate of 1.20 and 1.06 for variants induced by ENU and EMS respectively. We also show that in the Toxoplasma system, surprisingly, both ENU and EMS have a proclivity for inducing mutations at A/T base pairs (78.6% and 69.6%, respectively). CONCLUSIONS: The number of SNVs between related laboratory strains is relatively low and managed by purifying selection away from changes to amino acid sequence. From an experimental mutagenesis point of view, both ENU (24.7%) and EMS (29.1%) are more likely to generate variation within exons than would naturally accumulate over time in culture (19.1%), demonstrating the utility of these approaches for yielding proportionally greater changes to the amino acid sequence. These results will not only direct the methods of future chemical mutagenesis in Toxoplasma, but also aid in designing forward genetic approaches in less accessible pathogenic protozoa as well. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/1471-2164-15-354) contains supplementary material, which is available to authorized users

Digital Commons@Becker

The Sequence Alignment/Map format and SAMtools

Author: A. Wysoker
B. Handsaker
G. Abecasis
G. Marth
H. Li
J. Ruan
Langmead
Mardis
N. Homer
R. Durbin
T. Fennell
Publication venue: Oxford University Press
Publication date: 30/01/2013
Field of study

Summary: The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences, supporting short and long reads (up to 128 Mbp) produced by different sequencing platforms. It is flexible in style, compact in size, efficient in random access and is the format in which alignments from the 1000 Genomes Project are released. SAMtools implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments

Harvard University - DASH

A standard variation file format for human genome sequences

Author: Batchelor Colin
Cunningham Fiona
Eilbeck Karen
Flicek Paul
Marth Gabor T
Moore Barry
Reese Martin G
Salas Fidel
Stein Lincoln
Yandell Mark
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Here we describe the Genome Variation Format (GVF) and the 10Gen dataset. GVF, an extension of Generic Feature Format version 3 (GFF3), is a simple tab-delimited format for DNA variant files, which uses Sequence Ontology to describe genome variation data. The 10Gen dataset, ten human genomes in GVF format, is freely available for community analysis from the Sequence Ontology website and from an Amazon elastic block storage (EBS) snapshot for use in Amazon's EC2 cloud computing environment

Harvard University - DASH

Recommended from our members

Expression divergence measured by transcriptome sequencing of four yeast species

Author: Barnett Derek
Busby Michele A
Chuang Jeffrey H
Costa Allen M
Gray Jesse M
Marth Gabor T
Springer Michael
Stewart Chip
Stromberg Michael P
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background The evolution of gene expression is a challenging problem in evolutionary biology, for which accurate, well-calibrated measurements and methods are crucial. Results We quantified gene expression with whole-transcriptome sequencing in four diploid, prototrophic strains of <it>Saccharomyces </it>species grown under the same condition to investigate the evolution of gene expression. We found that variation in expression is gene-dependent with large variations in each gene's expression between replicates of the same species. This confounds the identification of genes differentially expressed across species. To address this, we developed a statistical approach to establish significance bounds for inter-species differential expression in RNA-Seq data based on the variance measured across biological replicates. This metric estimates the combined effects of technical and environmental variance, as well as Poisson sampling noise by isolating each component. Despite a paucity of large expression changes, we found a strong correlation between the variance of gene expression change and species divergence (R2 = 0.90). Conclusion We provide an improved methodology for measuring gene expression changes in evolutionary diverged species using RNA Seq, where experimental artifacts can mimic evolutionary effects. GEO Accession Number: GSE32679</p

Directory of Open Access Journals

AutoSNPdb: an annotated single nucleotide polymorphism database for crop plants

Author: Altschul
Barker
Batley
C. Duran
D. Edwards
D. Wood
Huang
J. Batley
M. Imelfort
Marth
N. Appleby
Savage
Syv nen
T. Clark
Taillon-Miller
Publication venue: Oxford University Press
Publication date: 01/01/2009
Field of study

Single nucleotide polymorphisms (SNPs) may be considered the ultimate genetic marker as they represent the finest resolution of a DNA sequence (a single nucleotide), are generally abundant in populations and have a low mutation rate. Analysis of assembled EST sequence data provides a cost-effective means to identify large numbers of SNPs associated with functional genes. We have developed an integrated SNP discovery pipeline, which identifies SNPs from assembled EST sequences. The results are maintained in a custom relational database along with EST source and annotation information. The current database hosts data for the important crops rice, barley and Brassica. Users may rapidly identify polymorphic sequences of interest through BLAST sequence comparison, keyword searches of annotations derived from UniRef90 and GenBank comparisons, GO annotations or in genes corresponding to syntenic regions of reference genomes. In addition, SNPs between specific varieties may be identified for targeted mapping and association studies. SNPs are viewed using a user-friendly graphical interface. The database is freely accessible at http://autosnpdb.qfab.org.au/

University of Queensland eSpace

The variant call format and VCFtools

Author: A. Auton
C. A. Albers
Durbin
E. Banks
G. Abecasis
G. Lunter
G. McVean
G. T. Marth
M. A. DePristo
P. Danecek
R. Durbin
R. E. Handsaker
S. T. Sherry
Publication venue: Oxford University Press
Publication date: 01/01/2011
Field of study

Summary: The variant call format (VCF) is a generic format for storing DNA polymorphism data such as SNPs, insertions, deletions and structural variants, together with rich annotations. VCF is usually stored in a compressed manner and can be indexed for fast data retrieval of variants from a range of positions on the reference genome. The format was developed for the 1000 Genomes Project, and has also been adopted by other projects such as UK10K, dbSNP and the NHLBI Exome Project. VCFtools is a software suite that implements various utilities for processing VCF files, including validation, merging, comparing and also provides a general Perl API

Oxford University Research Archive

Sequence analysis and characterization of active human alu subfamilies based on the 1000 genomes pilot project

Author: Ashley B. Hotard
Catherine C. Fontenot
Chip Stewart
Gabor T. Marth
Jerilyn A. Walker
Jessica Storer
Mark A. Batzer
Megan C. Ranck
Miriam K. Konkel
The Genomes Consortium
Publication venue: LSU Digital Commons
Publication date: 01/01/2015
Field of study

© The Author(s) 2015. The goal of the 1000 Genomes Consortium is to characterize human genome structural variation (SV), including forms of copy number variations such as deletions, duplications, and insertions. Mobile element insertions, particularly Alu elements, are major contributors to genomic SV among humans. During the pilot phase of the project we experimentally validated 645 (611 intergenic and 34 exon targeted) polymorphic young Alu insertion events, absent fromthe human reference genome. Here, we report high resolution sequencing of 343 (322 unique) recent Alu insertion events, along with their respective target site duplications, precise genomic breakpoint coordinates, subfamily assignment, percent divergence, and estimated A-rich tail lengths.All the sequenced Alu lociwerederived from the Alu Y lineagewith no evidence of retrotransposition activity involving older Alu families (e.g., AluJandAluS). AluYa5 is currently themost active Alu subfamily in the human lineage, followed by AluYb8, andmany others including three newly identified subfamilieswe have termed AluYb7a3, AluYb8b1, and AluYa4a1. This report provides the structural details of 322 unique Alu variants from individual human genomes collectively adding about 100 kb of genomic variation. Many Alu subfamilies are currently active in human populations, including a surprising level of AluY retrotransposition. Human Alu subfamilies exhibit continuous evolution with potential drivers sprouting new Alu lineages