29,601 research outputs found
De Novo Assembly of Nucleotide Sequences in a Compressed Feature Space
Sequencing technologies allow for an in-depth analysis
of biological species but the size of the generated datasets
introduce a number of analytical challenges. Recently, we
demonstrated the application of numerical sequence representations
and data transformations for the alignment of short
reads to a reference genome. Here, we expand out approach
for de novo assembly of short reads. Our results demonstrate
that highly compressed data can encapsulate the signal suffi-
ciently to accurately assemble reads to big contigs or complete
genomes
Graph theoretic methods for the analysis of structural relationships in biological macromolecules
Subgraph isomorphism and maximum common subgraph isomorphism algorithms from graph theory provide an effective and an efficient way of identifying structural relationships between biological macromolecules. They thus provide a natural complement to the pattern matching algorithms that are used in bioinformatics to identify sequence relationships. Examples are provided of the use of graph theory to analyze proteins for which three-dimensional crystallographic or NMR structures are available, focusing on the use of the Bron-Kerbosch clique detection algorithm to identify common folding motifs and of the Ullmann subgraph isomorphism algorithm to identify patterns of amino acid residues. Our methods are also applicable to other types of biological macromolecule, such as carbohydrate and nucleic acid structures
Spaced seeds improve k-mer-based metagenomic classification
Metagenomics is a powerful approach to study genetic content of environmental
samples that has been strongly promoted by NGS technologies. To cope with
massive data involved in modern metagenomic projects, recent tools [4, 39] rely
on the analysis of k-mers shared between the read to be classified and sampled
reference genomes. Within this general framework, we show in this work that
spaced seeds provide a significant improvement of classification accuracy as
opposed to traditional contiguous k-mers. We support this thesis through a
series a different computational experiments, including simulations of
large-scale metagenomic projects. Scripts and programs used in this study, as
well as supplementary material, are available from
http://github.com/gregorykucherov/spaced-seeds-for-metagenomics.Comment: 23 page
Highly Scalable Algorithms for Robust String Barcoding
String barcoding is a recently introduced technique for genomic-based
identification of microorganisms. In this paper we describe the engineering of
highly scalable algorithms for robust string barcoding. Our methods enable
distinguisher selection based on whole genomic sequences of hundreds of
microorganisms of up to bacterial size on a well-equipped workstation, and can
be easily parallelized to further extend the applicability range to thousands
of bacterial size genomes. Experimental results on both randomly generated and
NCBI genomic data show that whole-genome based selection results in a number of
distinguishers nearly matching the information theoretic lower bounds for the
problem
- …