Search CORE

74 research outputs found

A New Rhesus Macaque Assembly and Annotation for Next-Generation Sequencing Analyses

Author: Bosinger Steven E.
Cornish Adam S.
Ferguson Betsy
Fox Howard S.
Gibbs Robert M.
Johnson Zachary P.
Marçais Guillaume
Maudhoo Mnirnal D.
Meehan Daniel T.
Norgren Robert B.
Pandey Sanjit
Roberts Michael
Salzberg Steven L.
Tharp Gregory K.
Treangen Todd
Wipfler Kristin
Yorke James A.
Zhang Xiongfei
Zimin Aleksey V.
Publication venue: DigitalCommons@UNMC
Publication date: 01/01/2014
Field of study

BACKGROUND: The rhesus macaque (Macaca mulatta) is a key species for advancing biomedical research. Like all draft mammalian genomes, the draft rhesus assembly (rheMac2) has gaps, sequencing errors and misassemblies that have prevented automated annotation pipelines from functioning correctly. Another rhesus macaque assembly, CR_1.0, is also available but is substantially more fragmented than rheMac2 with smaller contigs and scaffolds. Annotations for these two assemblies are limited in completeness and accuracy. High quality assembly and annotation files are required for a wide range of studies including expression, genetic and evolutionary analyses. RESULTS: We report a new de novo assembly of the rhesus macaque genome (MacaM) that incorporates both the original Sanger sequences used to assemble rheMac2 and new Illumina sequences from the same animal. MacaM has a weighted average (N50) contig size of 64 kilobases, more than twice the size of the rheMac2 assembly and almost five times the size of the CR_1.0 assembly. The MacaM chromosome assembly incorporates information from previously unutilized mapping data and preliminary annotation of scaffolds. Independent assessment of the assemblies using Ion Torrent read alignments indicates that MacaM is more complete and accurate than rheMac2 and CR_1.0. We assembled messenger RNA sequences from several rhesus tissues into transcripts which allowed us to identify a total of 11,712 complete proteins representing 9,524 distinct genes. Using a combination of our assembled rhesus macaque transcripts and human transcripts, we annotated 18,757 transcripts and 16,050 genes with complete coding sequences in the MacaM assembly. Further, we demonstrate that the new annotations provide greatly improved accuracy as compared to the current annotations of rheMac2. Finally, we show that the MacaM genome provides an accurate resource for alignment of reads produced by RNA sequence expression studies. CONCLUSIONS: The MacaM assembly and annotation files provide a substantially more complete and accurate representation of the rhesus macaque genome than rheMac2 or CR_1.0 and will serve as an important resource for investigators conducting next-generation sequencing studies with nonhuman primates. REVIEWERS: This article was reviewed by Dr. Lutz Walter, Dr. Soojin Yi and Dr. Kateryna Makova

Crossref

Springer - Publisher Connector

PubMed Central

Digital Repository at the University of Maryland

University of Nebraska Medical Center Research: DigitalCommons@UNMC

Space-efficient and exact de Bruijn graph representation based on a Bloom filter

Author: A Bowe
A Kirsch
B Chazelle
C Kingsford
C Ye
G Marçais
G Rizk
G Rizk
G Sacomoto
Guillaume Rizk
J Pell
JR Miller
JT Simpson
MG Grabherr
P Peterlongo
P Peterlongo
R Chikhi
R Li
Rayan Chikhi
RL Warren
RM Idury
SL Salzberg
TC Conway
Y Peng
Z Iqbal
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Decoding the massive genome of loblolly pine using haploid DNA and novel assembly strategies

Author: Cardeno Charis
Crepeau Marc W
Davis John M
Dean Jeffrey FD
deJong Pieter J
Dougherty William M
Fuentes-Soriano Sara
Gilbert Don
Holt Carson
Holtz-Morris Ann E
Koriabine Maxim
Langley Charles H
Liechty John D
Lin Brian Y
Loopstra Carol A
Lorenz W Walter
Main Doreen
Martínez-García Pedro J
Marçais Guillaume
McGuire Patrick E
Mockaitis Keithanne
Neale David B
Puiu Daniela
Roberts Michael
Salzberg Steven L
Sederoff Ronald
Smith Katherine E
Stevens Kristian A
Vasquez-Gross Hans A
Wegrzyn Jill L
Wheeler Nicholas
Whetten Ross W
Wu Le-Shin
Yandell Mark
Yorke James A
Zieve Jacob J
Zimin Aleksey V
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

BACKGROUND: The size and complexity of conifer genomes has, until now, prevented full genome sequencing and assembly. The large research community and economic importance of loblolly pine, Pinus taeda L., made it an early candidate for reference sequence determination. RESULTS: We develop a novel strategy to sequence the genome of loblolly pine that combines unique aspects of pine reproductive biology and genome assembly methodology. We use a whole genome shotgun approach relying primarily on next generation sequence generated from a single haploid seed megagametophyte from a loblolly pine tree, 20-1010, that has been used in industrial forest tree breeding. The resulting sequence and assembly was used to generate a draft genome spanning 23.2 Gbp and containing 20.1 Gbp with an N50 scaffold size of 66.9 kbp, making it a significant improvement over available conifer genomes. The long scaffold lengths allow the annotation of 50,172 gene models with intron lengths averaging over 2.7 kbp and sometimes exceeding 100 kbp in length. Analysis of orthologous gene sets identifies gene families that may be unique to conifers. We further characterize and expand the existing repeat library based on the de novo analysis of the repetitive content, estimated to encompass 82% of the genome. CONCLUSIONS: In addition to its value as a resource for researchers and breeders, the loblolly pine genome sequence and assembly reported here demonstrates a novel approach to sequencing the large and complex genomes of this important group of plants that can now be widely applied

Crossref

Springer - Publisher Connector

OAKTrust Digital Repository (Texas A&M Univ)

PubMed Central

eScholarship - University of California

Digital Repository at the University of Maryland

Digital.CSIC

Lower Density Selection Schemes via Small Universal Hitting Sets with Short Remaining Path Length

Author: Carl Kingsford
Guillaume Marçais
Hongyu Zheng
Publication venue: Mary Ann Liebert Inc
Publication date: 01/04/2021
Field of study

Crossref

Sequence-specific minimizers via polar sets

Author: Carl Kingsford
Guillaume Marçais
Hongyu Zheng
Publication venue: Cold Spring Harbor Laboratory
Publication date: 02/02/2021
Field of study

AbstractMinimizers are efficient methods to samplek-mers from genomic sequences that unconditionally preserve sufficiently long matches between sequences. Well-established methods to construct efficient minimizers focus on sampling fewerk-mers on a random sequence and use universal hitting sets (sets ofk-mers that appear frequently enough) to upper bound the sketch size. In contrast, the problem of sequence-specific minimizers, which is to construct efficient minimizers to sample fewerk-mers on a specific sequence such as the reference genome, is less studied. Currently, the theoretical understanding of this problem is lacking, and existing methods do not specialize well to sketch specific sequences. We propose the concept of polar sets, complementary to the existing idea of universal hitting sets. Polar sets arek-mer sets that are spread out enough on the reference, and provably specialize well to specific sequences. Link energy measures how well spread out a polar set is, and with it, the sketch size can be bounded from above and below in a theoretically sound way. This allows for direct optimization of sketch size. We propose efficient heuristics to construct polar sets, and via experiments on the human reference genome, show their practical superiority in designing efficient sequence-specific minimizers. A reference implementation and code for analyses under an open-source license are athttps://github.com/kingsford-group/polarset.</jats:p

Crossref

Improved design and analysis of practical minimizers

Author: Carl Kingsford
Guillaume Marçais
Hongyu Zheng
Publication venue: Cold Spring Harbor Laboratory
Publication date: 07/02/2020
Field of study

AbstractMotivationMinimizers are methods to sample k-mers from a sequence, with the guarantee that similar set of k-mers will be chosen on similar sequences. It is parameterized by the k-mer length k, a window length w and an order on the k-mers. Minimizers are used in a large number of softwares and pipelines to improve computation efficiency and decrease memory usage. Despite the method’s popularity, many theoretical questions regarding its performance remain open. The core metric for measuring performance of a minimizer is the density, which measures the sparsity of sampled k-mers. The theoretical optimal density for a minimizer is 1/w, provably not achievable in general. For given k and w, little is known about asymptotically optimal minimizers, that is minimizers with density O(1/w).ResultsWe derive a necessary and sufficient condition for existence of asymptotically optimal minimizers. We also provide a randomized algorithm, called the Miniception, to design minimizers with the best theoretical guarantee to date on density in practical scenarios. Constructing and using the Miniception is as easy as constructing and using a random minimizer, which allows the design of efficient minimizers that scale to the values of k and w used in current bioinformatics software programs.AvailabilityReference implementation of the Miniception and the codes for analysis can be found at https://github.com/kingsford-group/[email protected]</jats:sec

Crossref

Asymptotically optimal minimizers schemes

Author: Carl Kingsford
Dan DeBlasio
Guillaume Marçais
Publication venue: Cold Spring Harbor Laboratory
Publication date: 30/01/2018
Field of study

AbstractMotivationThe minimizers technique is a method to sample k-mers that is used in many bioinformatics software to reduce computation, memory usage and run time. The number of applications using minimizers keeps on growing steadily. Despite its many uses, the theoretical understanding of minimizers is still very limited. In many applications, selecting as few k-mers as possible (i.e. having a low density) is beneficial. The density is highly dependent on the choice of the order on the k-mers. Different applications use different orders, but none of these orders are optimal. A better understanding of minimizers schemes, and the related local and forward schemes, will allow designing schemes with lower density, and thereby making existing and future bioinformatics tools even more efficient.ResultsFrom the analysis of the asymptotic behavior of minimizers, forward and local schemes, we show that the previously believed lower bound on minimizers schemes does not hold, and that schemes with density lower than thought possible actually exist. The proof is constructive and leads to an efficient algorithm to compare k-mers. These orders are the first known orders that are asymptotically optimal. Additionally, we give improved bounds on the density achievable by the 3 type of [email protected]@cs.cmu.edu</jats:sec

Crossref