Search CORE

25 research outputs found

Comparison of the Equine Reference Sequence with Its Sanger Source Data and New Illumina Reads

Author: Coleman Stephen J.
Hestand Matthew S.
Kalbfleisch Ted
Macleod James N.
Orlando Ludovic
Rebolledo-Mendez Jovan
Zeng Zheng
Publication venue: UKnowledge
Publication date: 01/01/2015
Field of study

The reference assembly for the domestic horse, EquCab2, published in 2009, was built using approximately 30 million Sanger reads from a Thoroughbred mare named Twilight. Contiguity in the assembly was facilitated using nearly 315 thousand BAC end sequences from Twilight\u27s half brother Bravo. Since then, it has served as the foundation for many genome-wide analyses that include not only the modern horse, but ancient horses and other equid species as well. As data mapped to this reference has accumulated, consistent variation between mapped datasets and the reference, in terms of regions with no read coverage, single nucleotide variants, and small insertions/deletions have become apparent. In many cases, it is not clear whether these differences are the result of true sequence variation between the research subjects\u27 and Twilight\u27s genome or due to errors in the reference. EquCab2 is regarded as The Twilight Assembly. The objective of this study was to identify inconsistencies between the EquCab2 assembly and the source Twilight Sanger data used to build it. To that end, the original Sanger and BAC end reads have been mapped back to this equine reference and assessed with the addition of approximately 40X coverage of new Illumina Paired-End sequence data. The resulting mapped datasets identify those regions with low Sanger read coverage, as well as variation in genomic content that is not consistent with either the original Twilight Sanger data or the new genomic sequence data generated from Twilight on the Illumina platform. As the haploid EquCab2 reference assembly was created using Sanger reads derived largely from a single individual, the vast majority of variation detected in a mapped dataset comprised of those same Sanger reads should be heterozygous. In contrast, homozygous variations would represent either errors in the reference or contributions from Bravo\u27s BAC end sequences. Our analysis identifies 720,843 homozygous discrepancies between new, high throughput genomic sequence data generated for Twilight and the EquCab2 reference assembly. Most of these represent errors in the assembly, while approximately 10,000 are demonstrated to be contributions from another horse. Other results are presented that include the binary alignment map file of the mapped Sanger reads, a list of variants identified as discrepancies between the source data and resulting reference, and a BED annotation file that lists the regions of the genome whose consensus was likely derived from low coverage alignments

Crossref

Directory of Open Access Journals

PubMed Central

Copenhagen University Research Information System

University of Kentucky

Comparison of the equine reference sequence with its Sanger source data and new Illumina reads

Author: Coleman Stephen J.
Hestand Matthew S.
Kalbfleisch Ted
MacLeod James N.
Orlando Ludovic Antoine Alexandre
Rebolledo-Mendez Jovan
Zeng Zheng
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2015
Field of study

The reference assembly for the domestic horse, EquCab2, published in 2009, was built using approximately 30 million Sanger reads from a Thoroughbred mare named Twilight. Contiguity in the assembly was facilitated using nearly 315 thousand BAC end sequences from Twilight's half brother Bravo. Since then, it has served as the foundation for many genome-wide analyses that include not only the modern horse, but ancient horses and other equid species as well. As data mapped to this reference has accumulated, consistent variation between mapped datasets and the reference, in terms of regions with no read coverage, single nucleotide variants, and small insertions/deletions have become apparent. In many cases, it is not clear whether these differences are the result of true sequence variation between the research subjects' and Twilight's genome or due to errors in the reference. EquCab2 is regarded as "The Twilight Assembly." The objective of this study was to identify inconsistencies between the EquCab2 assembly and the source Twilight Sanger data used to build it. To that end, the original Sanger and BAC end reads have been mapped back to this equine reference and assessed with the addition of approximately 40X coverage of new Illumina Paired-End sequence data. The resulting mapped datasets identify those regions with low Sanger read coverage, as well as variation in genomic content that is not consistent with either the original Twilight Sanger data or the new genomic sequence data generated from Twilight on the Illumina platform. As the haploid EquCab2 reference assembly was created using Sanger reads derived largely from a single individual, the vast majority of variation detected in a mapped dataset comprised of those same Sanger reads should be heterozygous. In contrast, homozygous variations would represent either errors in the reference or contributions from Bravo's BAC end sequences. Our analysis identifies 720,843 homozygous discrepancies between new, high throughput genomic sequence data generated for Twilight and the EquCab2 reference assembly. Most of these represent errors in the assembly, while approximately 10,000 are demonstrated to be contributions from another horse. Other results are presented that include the binary alignment map file of the mapped Sanger reads, a list of variants identified as discrepancies between the source data and resulting reference, and a BED annotation file that lists the regions of the genome whose consensus was likely derived from low coverage alignments

Directory of Open Access Journals

Copenhagen University Research Information System

A Revamped Rat Reference Genome Improves the Discovery of Genetic Diversity in Laboratory Rats

The seventh iteration of the reference genome assembly for Rattus norvegicus-mRatBN7.2-corrects numerous misplaced segments and reduces base-level errors by approximately 9-fold and increases contiguity by 290-fold compared with its predecessor. Gene annotations are now more complete, improving the mapping precision of genomic, transcriptomic, and proteomics datasets. We jointly analyzed 163 short-read whole-genome sequencing datasets representing 120 laboratory rat strains and substrains using mRatBN7.2. We defined ∼20.0 million sequence variations, of which 18,700 are predicted to potentially impact the function of 6,677 genes. We also generated a new rat genetic map from 1,893 heterogeneous stock rats and annotated transcription start sites and alternative polyadenylation sites. The mRatBN7.2 assembly, along with the extensive analysis of genomic variations among rat strains, enhances our understanding of the rat genome, providing researchers with an expanded resource for studies involving rats

DigitalCommons@The Texas Medical Center

Variant Call Format Records.

Author: James N. MacLeod (3286986)
Jovan Rebolledo-Mendez (5664052)
Ludovic Orlando (204746)
Matthew S. Hestand (439188)
Stephen J. Coleman (439187)
Ted Kalbfleisch (3386948)
Zheng Zeng (383660)
Publication venue
Publication date
Field of study

An example of three variant call format records. The first is a called heterozygote with an allele depth of 11, 6 reads containing the reference allele, and 5 reads containing the non-reference. The second, a called homozygote in the current analysis with an allele depth of 4, three high quality reads containing the non-reference allele and one low quality read containing the reference. The third, is an example of a called homozygote with an allele depth of 5, where all 5 reads contained the non-reference allele. Images of the reads used to derive the VCF records are shown within the IGV Browser. Base calls within the reads that agree with the reference are not rendered.</p

The Francis Crick Institute

DOI for the annotation data produced in this work.

Author: James N. MacLeod (3286986)
Jovan Rebolledo-Mendez (5664052)
Ludovic Orlando (204746)
Matthew S. Hestand (439188)
Stephen J. Coleman (439187)
Ted Kalbfleisch (3386948)
Zheng Zeng (383660)
Publication venue
Publication date
Field of study

DOI for the annotation data produced in this work.</p

The Francis Crick Institute

Variation identified in mapped Twilight Sanger data set.

Author: James N. MacLeod (3286986)
Jovan Rebolledo-Mendez (5664052)
Ludovic Orlando (204746)
Matthew S. Hestand (439188)
Stephen J. Coleman (439187)
Ted Kalbfleisch (3386948)
Zheng Zeng (383660)
Publication venue
Publication date
Field of study

Variation identified in mapped Twilight Sanger data set.</p

The Francis Crick Institute

Incorrect Base Assignments.

Author: James N. MacLeod (3286986)
Jovan Rebolledo-Mendez (5664052)
Ludovic Orlando (204746)
Matthew S. Hestand (439188)
Stephen J. Coleman (439187)
Ted Kalbfleisch (3386948)
Zheng Zeng (383660)
Publication venue
Publication date
Field of study

An example of incorrect base assignments. A) Variants (indicated by the red shading) that are called homozygous (indicated by the turquois shading) differences measured by the UnifiedGenotyper in both the Illumina and Sanger datasets are shown here. In B), it is demonstrated that a single low quality Sanger read was used as the basis for the consensus sequence in this region. The UnifiedGenotyper, however, ignores this read due to the low phred quality scores in the region. The phred based quality scores are indicated for the 4 miscalled bases. The corresponding NCBI Trace Archive Trace Name and TI# are G836P5757RI23 and 1325049864 respectively, with bases 594–630 as the region of interest. This region may be viewed in IGV at <a href="http://dx.doi.org/10.13013/J6VD6WCM" target="_blank">http://dx.doi.org/10.13013/J6VD6WCM</a>. The link will download a JNLP file that webstarts IGV centered on region of interest. Bases within the reads that agree with the reference are not rendered.</p

The Francis Crick Institute