Search CORE

105 research outputs found

Informed and Automated k-Mer Size Selection for Genome Assembly

Author: Chikhi Rayan
Medvedev Paul
Publication venue
Publication date: 20/04/2013
Field of study

Genome assembly tools based on the de Bruijn graph framework rely on a parameter k, which represents a trade-off between several competing effects that are difficult to quantify. There is currently a lack of tools that would automatically estimate the best k to use and/or quickly generate histograms of k-mer abundances that would allow the user to make an informed decision. We develop a fast and accurate sampling method that constructs approximate abundance histograms with a several orders of magnitude performance improvement over traditional methods. We then present a fast heuristic that uses the generated abundance histograms for putative k values to estimate the best possible value of k. We test the effectiveness of our tool using diverse sequencing datasets and find that its choice of k leads to some of the best assemblies. Our tool KmerGenie is freely available at: http://kmergenie.bx.psu.edu/Comment: HiTSeq 201

arXiv.org e-Print Archive

HAL Descartes

Cultivar-specific transcriptome prediction and annotation in Ficus carica L.

Author: Bernardi R.
Caruso G.
Cavallini A.
Conti B.
D'Onofrio C.
Giordani T.
Gucci R.
Lucchi A.
Mascagni F.
Natali L.
Picciarelli P.
Quartacci M. F.
Solorzano Zambrano L.
Usai G.
Vangelisti A.
Publication venue: 'Elsevier BV'
Publication date: 01/01/2017
Field of study

The availability of transcriptomic data sequence is a key step for functional genomics studies. Recently, a repertoire of predicted genes of a Japanese cultivar of fig (Ficus carica L.) was released. Because of the great phenotypic variability that can be found in this species, we decided to study another fig genotype, the Italian cv. Dottato, in order to perform comparative studies between the two cultivars and extend the pan genome of this species. We isolated, sequenced and assembled fig genomic DNA from young fruits of cv. Dottato. Then, putative gene sequences were predicted and annotated. Finally, a comparison was performed between cvs. Dottato and Horaishi predicted transcriptomes. Our data provide a resource (available at the Sequence Read Archive database under SRP109082) to be used for functional genomics of fig, in order to fill the gap of knowledge still existing in this species concerning plant development, defense and adaptation to the environment

Crossref

Directory of Open Access Journals

Archivio della Ricerca - Università di Pisa

Gerbil: A Fast and Memory-Efficient $k$ -mer Counter with GPU-Support

Author: Erbert Marius
Müller-Hannemann Matthias
Rechner Steffen
Publication venue
Publication date: 22/07/2016
Field of study

A basic task in bioinformatics is the counting of

k

-mers in genome strings. The

k

-mer counting problem is to build a histogram of all substrings of length

k

in a given genome sequence. We present the open source

k

-mer counting software Gerbil that has been designed for the efficient counting of

k

-mers for

k\geq32

. Given the technology trend towards long reads of next-generation sequencers, support for large

k

becomes increasingly important. While existing

k

-mer counting tools suffer from excessive memory resource consumption or degrading performance for large

k

, Gerbil is able to efficiently support large

k

without much loss of performance. Our software implements a two-disk approach. In the first step, DNA reads are loaded from disk and distributed to temporary files that are stored at a working disk. In a second step, the temporary files are read again, split into

k

-mers and counted via a hash table approach. In addition, Gerbil can optionally use GPUs to accelerate the counting step. For large

k

, we outperform state-of-the-art open source

k

-mer counting tools for large genome data sets.Comment: A short version of this paper will appear in the proceedings of WABI 201

arXiv.org e-Print Archive

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

An insight into structure and composition of the fig genome

Author: BARGHINI ELENA
CAVALLINI ANDREA
GIORDANI TOMMASO
MASCAGNI FLAVIA
NATALI LUCIA
SOLORZANO ZAMBRANO LICETH JANINA
Publication venue: 'International Society for Horticultural Science (ISHS)'
Publication date: 01/01/2017
Field of study

Ficus carica L. is a diploid species, with a genome size of 0.36 pg/2C, still poorly characterized at genetic and genomic level. With the aim of analysing the fig genome structure, we used Illumina technology to produce 25.64 genome equivalents of 35-511 nt long MiSeq sequences and 12.96 genome equivalents of 25-100 nt long HiSeq paired-end reads. The two libraries were subject to a first assembly run separately, then a hybrid assembly was performed; finally, contigs and supercontigs were scaffolded. This first rough assembly is composed of 264,088 scaffolds, up to 41,760 nt in length, covering 323,708,138 nt, that corresponds to 87.5% of the fig genome, with N50 = 2,523. Masking the scaffolds with a transcriptome of Rosaceae, from which sequences related to repetitive elements were removed, allowed us to establish that coding genes account for at least 6.8% of the fig genome. Gene prediction analysis produced 44,419 putative genes. A sample of around 5,000 predicted genes were annotated with regard to gene ontology and function. Concerning the repetitive component, the fig genome resulted composed for 58.3% of repeated sequences, of which none was especially redundant. Among identified repeats, the most represented were LTR-retrotransposons, with Gypsy elements more frequent than Copia

Crossref

Archivio della Ricerca - Università di Pisa

Hybrid genome assembly and annotation of Danionella translucida

Author: Judkewitz Benjamin
Kadobianskyi Mykola
Schuelke Markus
Schulze Lisanne
Publication venue
Publication date: 01/01/2019
Field of study

Studying neuronal circuits at cellular resolution is very challenging in vertebrates due to the size and optical turbidity of their brains. Danionella translucida, a close relative of zebrafish, was recently introduced as a model organism for investigating neural network interactions in adult individuals. Danionella remains transparent throughout its life, has the smallest known vertebrate brain and possesses a rich repertoire of complex behaviours. Here we sequenced, assembled and annotated the Danionella translucida genome employing a hybrid Illumina/Nanopore read library as well as RNA-seq of embryonic, larval and adult mRNA. We achieved high assembly continuity using low-coverage long-read data and annotated a large fraction of the transcriptome. This dataset will pave the way for molecular research and targeted genetic manipulation of this novel model organism

Institutional Repository of the Freie Universität Berlin

A framework for space-efficient string kernels

Author: A Apostolico
A Apostolico
AJ Smola
AM İleri
B Chor
D Belazzougui
G Reinert
GE Sims
J Herold
J Qi
J Shawe-Taylor
M Crochemore
R Chikhi
S Chairungsee
Publication venue
Publication date: 23/02/2015
Field of study

String kernels are typically used to compare genome-scale sequences whose length makes alignment impractical, yet their computation is based on data structures that are either space-inefficient, or incur large slowdowns. We show that a number of exact string kernels, like the

k

-mer kernel, the substrings kernels, a number of length-weighted kernels, the minimal absent words kernel, and kernels with Markovian corrections, can all be computed in

O(nd)

time and in

o(n)

bits of space in addition to the input, using just a

\mathtt{rangeDistinct}

data structure on the Burrows-Wheeler transform of the input strings, which takes

O(d)

time per element in its output. The same bounds hold for a number of measures of compositional complexity based on multiple value of

k

, like the

k

-mer profile and the

k

-th order empirical entropy, and for calibrating the value of

k

using the data

arXiv.org e-Print Archive

Crossref

Recommended from our members

Deconvolute individual genomes from metagenome sequences through short read clustering.

Author: Deng Li
Li Kexue
Lu Yakang
Shi Lizhen
Wang Lili
Wang Zhong
Publication venue: eScholarship, University of California
Publication date: 01/01/2020
Field of study

Metagenome assembly from short next-generation sequencing data is a challenging process due to its large scale and computational complexity. Clustering short reads by species before assembly offers a unique opportunity for parallel downstream assembly of genomes with individualized optimization. However, current read clustering methods suffer either false negative (under-clustering) or false positive (over-clustering) problems. Here we extended our previous read clustering software, SpaRC, by exploiting statistics derived from multiple samples in a dataset to reduce the under-clustering problem. Using synthetic and real-world datasets we demonstrated that this method has the potential to cluster almost all of the short reads from genomes with sufficient sequencing coverage. The improved read clustering in turn leads to improved downstream genome assembly quality

eScholarship - University of California