Search CORE

74 research outputs found

Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome

Author: Margulies Elliott H.
Pachter Lior
Publication venue: Cold Spring Harbor Laboratory Press
Publication date: 01/06/2007
Field of study

A key component of the ongoing ENCODE project involves rigorous comparative sequence analyses for the initially targeted 1% of the human genome. Here, we present orthologous sequence generation, alignment, and evolutionary constraint analyses of 23 mammalian species for all ENCODE targets. Alignments were generated using four different methods; comparisons of these methods reveal large-scale consistency but substantial differences in terms of small genomic rearrangements, sensitivity (sequence coverage), and specificity (alignment accuracy). We describe the quantitative and qualitative trade-offs concomitant with alignment method choice and the levels of technical error that need to be accounted for in applications that require multisequence alignments. Using the generated alignments, we identified constrained regions using three different methods. While the different constraint-detecting methods are in general agreement, there are important discrepancies relating to both the underlying alignments and the specific algorithms. However, by integrating the results across the alignments and constraint-detecting methods, we produced constraint annotations that were found to be robust based on multiple independent measures. Analyses of these annotations illustrate that most classes of experimentally annotated functional elements are enriched for constrained sequences; however, large portions of each class (with the exception of protein-coding sequences) do not overlap constrained regions. The latter elements might not be under primary sequence constraint, might not be constrained across all mammals, or might have expendable molecular functions. Conversely, 40% of the constrained sequences do not overlap any of the functional elements that have been experimentally identified. Together, these findings demonstrate and quantify how many genomic functional elements await basic molecular characterization

Caltech Authors

Pebble and Rock Band: Heuristic Resolution of Repeats and Scaffolding in the Velvet Short-Read de Novo Assembler

Author: Birney Ewan
Margulies Elliott H.
McEwen Gayle K.
Zerbino Daniel R.
Publication venue: Public Library of Science
Publication date
Field of study

BACKGROUND: Despite the short length of their reads, micro-read sequencing technologies have shown their usefulness for de novo sequencing. However, especially in eukaryotic genomes, complex repeat patterns are an obstacle to large assemblies. PRINCIPAL FINDINGS: We present a novel heuristic algorithm, Pebble, which uses paired-end read information to resolve repeats and scaffold contigs to produce large-scale assemblies. In simulations, we can achieve weighted median scaffold lengths (N50) of above 1 Mbp in Bacteria and above 100 kbp in more complex organisms. Using real datasets we obtained a 96 kbp N50 in Pseudomonas syringae and a unique 147 kbp scaffold of a ferret BAC clone. We also present an efficient algorithm called Rock Band for the resolution of repeats in the case of mixed length assemblies, where different sequencing platforms are combined to obtain a cost-effective assembly. CONCLUSIONS: These algorithms extend the utility of short read only assemblies into large complex genomes. They have been implemented and made available within the open-source Velvet short-read de novo assembler

Directory of Open Access Journals

PubMed Central

A Bioinformatics Approach for Determining Sample Identity from Different Lanes of High-Throughput Sequencing Data

Author: Ajay Subramanian S.
Goldfeder Rachel L.
Margulies Elliott H.
Ozel Abaan Hatice
Parker Stephen C. J.
Publication venue: Public Library of Science
Publication date: 01/01/2011
Field of study

The ability to generate whole genome data is rapidly becoming commoditized. For example, a mammalian sized genome (∼3Gb) can now be sequenced using approximately ten lanes on an Illumina HiSeq 2000. Since lanes from different runs are often combined, verifying that each lane in a genome's build is from the same sample is an important quality control. We sought to address this issue in a post hoc bioinformatic manner, instead of using upstream sample or “barcode” modifications. We rely on the inherent small differences between any two individuals to show that genotype concordance rates can be effectively used to test if any two lanes of HiSeq 2000 data are from the same sample. As proof of principle, we use recent data from three different human samples generated on this platform. We show that the distributions of concordance rates are non-overlapping when comparing lanes from the same sample versus lanes from different samples. Our method proves to be robust even when different numbers of reads are analyzed. Finally, we provide a straightforward method for determining the gender of any given sample. Our results suggest that examining the concordance of detected genotypes from lanes purported to be from the same sample is a relatively simple approach for confirming that combined lanes of data are of the same identity and quality

CiteSeerX

Public Library of Science (PLOS)

Directory of Open Access Journals

PubMed Central

Early History of Mammals Is Elucidated with the ENCODE Multiple Species Sequencing Data

Author: Antonarakis Stylianos E
Margulies Elliott H
Montoya-Burgos Juan I
Nikolaev Sergey
Nyffeler Bruno
Program NISC Comparative Sequencing
Rougemont Jacques
Publication venue: Public Library of Science
Publication date: 01/01/2007
Field of study

Understanding the early evolution of placental mammals is one of the most challenging issues in mammalian phylogeny. Here, we addressed this question by using the sequence data of the ENCODE consortium, which include 1% of mammalian genomes in 18 species belonging to all main mammalian lineages. Phylogenetic reconstructions based on an unprecedented amount of coding sequences taken from 218 genes resulted in a highly supported tree placing the root of Placentalia between Afrotheria and Exafroplacentalia (Afrotheria hypothesis). This topology was validated by the phylogenetic analysis of a new class of genomic phylogenetic markers, the conserved noncoding sequences. Applying the tests of alternative topologies on the coding sequence dataset resulted in the rejection of the Atlantogenata hypothesis (Xenarthra grouping with Afrotheria), while this test rejected the second alternative scenario, the Epitheria hypothesis (Xenarthra at the base), when using the noncoding sequence dataset. Thus, the two datasets support the Afrotheria hypothesis; however, none can reject both of the remaining topological alternatives

Public Library of Science (PLOS)

CiteSeerX

Directory of Open Access Journals

PubMed Central

Archive ouverte UNIGE

SNPs in Multi-Species Conserved Sequences (MCS) as useful markers in association studies: a practical approach

Author: Gregory Simon G
Haines Jonathan L
Hauser Stephen L
Kenealy Shannon J
Margulies Elliott H
McCauley Jacob L
Mortlock Douglas P
Oksenberg Jorge R
Pericak-Vance Margaret A
Schnetz-Boutaud Nathalie
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Abstract Background Although genes play a key role in many complex diseases, the specific genes involved in most complex diseases remain largely unidentified. Their discovery will hinge on the identification of key sequence variants that are conclusively associated with disease. While much attention has been focused on variants in protein-coding DNA, variants in noncoding regions may also play many important roles in complex disease by altering gene regulation. Since the vast majority of noncoding genomic sequence is of unknown function, this increases the challenge of identifying "functional" variants that cause disease. However, evolutionary conservation can be used as a guide to indicate regions of noncoding or coding DNA that are likely to have biological function, and thus may be more likely to harbor SNP variants with functional consequences. To help bias marker selection in favor of such variants, we devised a process that prioritizes annotated SNPs for genotyping studies based on their location within Multi-species Conserved Sequences (MCSs) and used this process to select SNPs in a region of linkage to a complex disease. This allowed us to evaluate the utility of the chosen SNPs for further association studies. Previously, a region of chromosome 1q43 was linked to Multiple Sclerosis (MS) in a genome-wide screen. We chose annotated SNPs in the region based on location within MCSs (termed MCS-SNPs). We then obtained genotypes for 478 MCS-SNPs in 989 individuals from MS families. Results Analysis of our MCS-SNP genotypes from the 1q43 region and comparison to HapMap data confirmed that annotated SNPs in MCS regions are frequently polymorphic and show subtle signatures of selective pressure, consistent with previous reports of genome-wide variation in conserved regions. We also present an online tool that allows MCS data to be directly exported to the UCSC genome browser so that MCS-SNPs can be easily identified within genomic regions of interest. Conclusion Our results showed that MCS can easily be used to prioritize markers for follow-up and candidate gene association studies. We believe that this novel approach demonstrates a paradigm for expediting the search for genes contributing to complex diseases.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

University of Miami: Scholarship Miami

Recommended from our members

A high-resolution map of human evolutionary constraint using 29 mammals.

Author: Alföldi Jessica
Baldwin Jen
Baylor College of Medicine Human Genome Sequencing Center Sequencing Team
Beal Kathryn
Birney Ewan
Bloom Toby
Broad Institute Sequencing Platform and Whole Genome Assembly Team
Chang Jean
Chin Chee Whye
Clamp Michele
Clawson Hiram
Cree Andrew
Cuff James
Delehaunty Kim
Di Palma Federica
Dihn Huyen H
Dooling David
Ernst Jason
Fitzgerald Stephen
Flicek Paul
Fowler Gerald
Fronik Catrina
Fulton Bob
Fulton Lucinda
Garber Manuel
Genome Institute at Washington University
Gibbs Richard A
Gnerre Sante
Goldman Nick
Graves Tina
Green Eric D
Guttman Mitchell
Haussler David
Heiman Dave
Herrero Javier
Holloway Alisha K
Hubisz Melissa J
Jaffe David B
Jhangiani Shalili
Jordan Gregory
Joshi Vandita
Jungreis Irwin
Kellis Manolis
Kent W James
Kheradpour Pouya
Kostka Dennis
Kovar Christie L
Lander Eric S
Lara Marcia
Lee Sandra
Lewis Lora R
Lin Michael F
Lindblad-Toh Kerstin
Lowe Craig B
Mardis Elaine R
Margulies Elliott H
Martins Andre L
Massingham Tim
Mauceli Evan
Minx Patrick
Moltke Ida
Muzny Donna M
Nazareth Lynne V
Nicol Robert
Nusbaum Chad
Okwuonu Geoffrey
Parker Brian J
Pedersen Jakob S
Pollard Katherine S
Raney Brian J
Rasmussen Matthew D
Robinson Jim
Santibanez Jireh
Siepel Adam
Sodergren Erica
Stark Alexander
Vilella Albert J
Ward Lucas D
Warren Wesley C
Washietl Stefan
Weinstock George M
Wen Jiayu
Wilkinson Jane
Wilson Richard K
Worley Kim C
Xie Xiaohui
Young Sarah
Zody Michael C
Zuk Or
Publication venue: eScholarship, University of California
Publication date: 01/10/2011
Field of study

The comparison of related genomes has emerged as a powerful lens for genome interpretation. Here we report the sequencing and comparative analysis of 29 eutherian genomes. We confirm that at least 5.5% of the human genome has undergone purifying selection, and locate constrained elements covering ∼4.2% of the genome. We use evolutionary signatures and comparisons with experimental data sets to suggest candidate functions for ∼60% of constrained bases. These elements reveal a small number of new coding exons, candidate stop codon readthrough events and over 10,000 regions of overlapping synonymous constraint within protein-coding exons. We find 220 candidate RNA structural families, and nearly a million elements overlapping potential promoter, enhancer and insulator regions. We report specific amino acid residues that have undergone positive selection, 280,000 non-coding elements exapted from mobile elements and more than 1,000 primate- and human-accelerated elements. Overlap with disease-associated variants indicates that our findings will be relevant for studies of human biology, health and disease

eScholarship - University of California

Extensive Evolutionary Changes in Regulatory Element Activity during Human Origins Are Associated with Altered Gene Expression and Positive Selection

Author: A Siepel
AG Xu
AK Holloway
Alok K. Tewari
AP Boyle
AP Boyle
AS Hinrichs
B Paten
B Paten
Bum-Kyu Lee
CC Babbitt
CE Cain
CL Zhang
Courtney C. Babbitt
CY McLean
CY McLean
D Brawand
D Graur
D Schmidt
Darin London
DT Odom
E Consortium
E Consortium
E Shaulian
Elliott H. Margulies
GA Wray
GE Crawford
GM Cooper
Gregory A. Wray
Gregory E. Crawford
H Li
H Xi
HY Chang
J Ernst
J Zhang
JF Degner
JL Rinn
JM Good
Joshua M. Akey
JT Robinson
KS Pollard
KS Pollard
L Song
L Song
Lingyun Song
M Caceres
M Robertson
M Somel
Matthew Wortham
MC King
MD Robinson
MJ Buck
MV Karamouzis
MV Olson
Nathan C. Sheffield
ND Heintzman
Olivier Fedrigo
P Khaitovich
PA t Hoen
PJ Bickel
PJ Sabo
PJ Sabo
R Blekhman
R Blekhman
R Eferl
R Haygood
RK Bradley
S Prabhakar
S Prabhakar
S Prabhakar
SB Carroll
SC Biddie
SC Parker
SJ Sholtis
SL Pond
Stephen C. J. Parker
Terrence S. Furey
TS Mikkelsen
V Orgogozo
V Papadopoulou
Vishwanath R. Iyer
W Enard
W Gilbert
WS Wong
Y Gilad
Y Shibata
Yoichiro Shibata
Publication venue: Public Library of Science
Publication date: 01/01/2012
Field of study

Understanding the molecular basis for phenotypic differences between humans and other primates remains an outstanding challenge. Mutations in non-coding regulatory DNA that alter gene expression have been hypothesized as a key driver of these phenotypic differences. This has been supported by differential gene expression analyses in general, but not by the identification of specific regulatory elements responsible for changes in transcription and phenotype. To identify the genetic source of regulatory differences, we mapped DNaseI hypersensitive (DHS) sites, which mark all types of active gene regulatory elements, genome-wide in the same cell type isolated from human, chimpanzee, and macaque. Most DHS sites were conserved among all three species, as expected based on their central role in regulating transcription. However, we found evidence that several hundred DHS sites were gained or lost on the lineages leading to modern human and chimpanzee. Species-specific DHS site gains are enriched near differentially expressed genes, are positively correlated with increased transcription, show evidence of branch-specific positive selection, and overlap with active chromatin marks. Species-specific sequence differences in transcription factor motifs found within these DHS sites are linked with species-specific changes in chromatin accessibility. Together, these indicate that the regulatory elements identified here are genetic contributors to transcriptional and phenotypic differences among primate species

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Carolina Digital Repository

Texas ScholarWorks

FigShare

A High-Resolution Map of Human Evolutionary Constraint Using 29 Mammals

Author: A Keinan
A Siepel
A Siepel
A Stark
Adam Siepel
Albert J. Vilella
Alexander Stark
Alisha K. Holloway
Andre L. Martins
Brian J. Parker
Brian J. Raney
CB Lowe
Christie L. Kovar
Craig B. Lowe
D Altshuler
D Baek
D Pillas
D Schmidt
David B. Jaffe
David Haussler
Dennis Kostka
Donna M. Muzny
Elaine R. Mardis
Elliott H. Margulies
Eric D. Green
Eric S. Lander
ES Lander
ET Wang
EV Davydov
Evan Mauceli
Ewan Birney
F Chiaromonte
Federica Di Palma
G Bejerano
Genome 10K Community Of Scientists
George M. Weinstock
GM Cooper
Gregory Jordan
Hiram Clawson
Ida Moltke
Irwin Jungreis
J Ernst
J Ernst
J Harrow
JA Drake
Jakob S. Pedersen
James Cuff
Jason Ernst
Javier Herrero
Jean Chang
Jessica Alföldi
Jiayu Wen
Jim Robinson
JS Pedersen
JT Lee
JW Thomas
K Lindblad-Toh
Katherine S. Pollard
Kathryn Beal
KD Pruitt
Kerstin Lindblad-Toh
Kim C. Worley
KS Pollard
Lucas D. Ward
M Clamp
M Garber
M Guttman
M Kellis
Manolis Kellis
Manuel Garber
Marcia Lara
Maria L. Martínez-Chantar
Matthew D. Rasmussen
Melissa J. Hubisz
MF Lin
MF Lin
Michael C. Zody
Michael F. Lin
Michele Clamp
Mitchell Guttman
MJ Hubisz
Nick Goldman
Or Zuk
P Kheradpour
Paul Flicek
Pouya Kheradpour
RA Gibbs
RH Waterston
Richard A. Gibbs
Richard K. Wilson
S Gnerre
S Maenner
S Meader
S Prabhakar
S Tumpel
S Washietl
Sante Gnerre
Stefan Washietl
Stephen Fitzgerald
Tim Massingham
TS Mikkelsen
W. James Kent
Wesley C. Warren
X Lampe
X Xie
Xiaohui Xie
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2011
Field of study

The comparison of related genomes has emerged as a powerful lens for genome interpretation. Here we report the sequencing and comparative analysis of 29 eutherian genomes. We confirm that at least 5.5% of the human genome has undergone purifying selection, and locate constrained elements covering ~4.2% of the genome. We use evolutionary signatures and comparisons with experimental data sets to suggest candidate functions for ~60% of constrained bases. These elements reveal a small number of new coding exons, candidate stop codon readthrough events and over 10,000 regions of overlapping synonymous constraint within protein-coding exons. We find 220 candidate RNA structural families, and nearly a million elements overlapping potential promoter, enhancer and insulator regions. We report specific amino acid residues that have undergone positive selection, 280,000 non-coding elements exapted from mobile elements and more than 1,000 primate- and human-accelerated elements. Overlap with disease-associated variants indicates that our findings will be relevant for studies of human biology, health and disease.National Human Genome Research Institute (U.S.)National Institute of General Medical Sciences (U.S.) (Grant number GM82901)National Science Foundation (U.S.). Postdoctural Fellowship (Award 0905968)National Science Foundation (U.S.). Career (0644282)National Institutes of Health (U.S.) (R01-HG004037)Alfred P. Sloan Foundation.Austrian Science Fund. Erwin Schrodinger Fellowshi

DSpace@MIT

Crossref

Cold Spring Harbor Laboratory Institutional Repository

Copenhagen University Research Information System

PubMed Central

eScholarship - University of California

Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project

We report the generation and analysis of functional data from multiple, diverse experiments performed on a targeted 1% of the human genome as part of the pilot phase of the ENCODE Project. These data have been further integrated and augmented by a number of evolutionary and computational analyses. Together, our results advance the collective knowledge about human genome function in several major areas. First, our studies provide convincing evidence that the genome is pervasively transcribed, such that the majority of its bases can be found in primary transcripts, including non-protein-coding transcripts, and those that extensively overlap one another. Second, systematic examination of transcriptional regulation has yielded new understanding about transcription start sites, including their relationship to specific regulatory sequences and features of chromatin accessibility and histone modification. Third, a more sophisticated view about chromatin structure has emerged, including its interrelationship with DNA replication and transcriptional regulation. Finally, integration of these new sources of information, in particular with respect to mammalian evolution based on inter- and intra-species sequence comparisons, has yielded novel mechanistic and evolutionary insights about the functional landscape of the human genome. Together, these studies are defining a path forward to pursue a more-comprehensive characterisation of human genome function

Carolina Digital Repository

Functional constraint and small insertions and deletions in the ENCODE regions of the human genome.

Author: Andrew Toby
Balding David J
Clark Taane G
Cooper Gregory M
Margulies Elliott H
Mullikin James C
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2007
Field of study

BACKGROUND: We describe the distribution of indels in the 44 Encyclopedia of DNA Elements (ENCODE) regions (about 1% of the human genome) and evaluate the potential contributions of small insertion and deletion polymorphisms (indels) to human genetic variation. We relate indels to known genomic annotation features and measures of evolutionary constraint. RESULTS: Indel rates are observed to be reduced approximately 20-fold to 60-fold in exonic regions, 5-fold to 10-fold in sequence that exhibits high evolutionary constraint in mammals, and up to 2-fold in some classes of regulatory elements (for instance, formaldehyde assisted isolation of regulatory elements [FAIRE] and hypersensitive sites). In addition, some noncoding transcription and other chromatin mediated regulatory sites also have reduced indel rates. Overall indel rates for these data are estimated to be smaller than single nucleotide polymorphism (SNP) rates by a factor of approximately 2, with both rates measured as base pairs per 100 kilobases to facilitate comparison. CONCLUSION: Indel rates exhibit a broadly similar distribution across genomic features compared with SNP density rates, with a reduction in rates in coding transcription and evolutionarily constrained sequence. However, unlike indels, SNP rates do not appear to be reduced in some noncoding functional sequences, such as pseudo-exons, and FAIRE and hypersensitive sites. We conclude that indel rates are greatly reduced in transcribed and evolutionarily constrained DNA, and discuss why indel (but not SNP) rates appear to be constrained at some regulatory sites

LSHTM Research Online

Springer - Publisher Connector

PubMed Central

University of Melbourne Institutional Repository