Search CORE

15 research outputs found

Inference of Markovian Properties of Molecular Sequences from NGS Data and Applications to Comparative Genomics

Author: Cannon Charles H.
Deng Minghua
Reinert Gesine
Ren Jie
Song Kai
Sun Fengzhu
Publication venue
Publication date: 03/04/2015
Field of study

Next Generation Sequencing (NGS) technologies generate large amounts of short read data for many different organisms. The fact that NGS reads are generally short makes it challenging to assemble the reads and reconstruct the original genome sequence. For clustering genomes using such NGS data, word-count based alignment-free sequence comparison is a promising approach, but for this approach, the underlying expected word counts are essential. A plausible model for this underlying distribution of word counts is given through modelling the DNA sequence as a Markov chain (MC). For single long sequences, efficient statistics are available to estimate the order of MCs and the transition probability matrix for the sequences. As NGS data do not provide a single long sequence, inference methods on Markovian properties of sequences based on single long sequences cannot be directly used for NGS short read data. Here we derive a normal approximation for such word counts. We also show that the traditional Chi-square statistic has an approximate gamma distribution, using the Lander-Waterman model for physical mapping. We propose several methods to estimate the order of the MC based on NGS reads and evaluate them using simulations. We illustrate the applications of our results by clustering genomic sequences of several vertebrate and tree species based on NGS reads using alignment-free sequence dissimilarity measures. We find that the estimated order of the MC has a considerable effect on the clustering results, and that the clustering results that use a MC of the estimated order give a plausible clustering of the species.Comment: accepted by RECOMB-SEQ 201

arXiv.org e-Print Archive

CiteSeerX

Alignment-free sequence comparison with spaced k-mers

Author: Boden Marcus
Horwege Sebastian
Leimeister Chris
Lindner Sebastian
Morgenstern Burkhard
Publication venue: OASIcs - OpenAccess Series in Informatics. German Conference on Bioinformatics 2013
Publication date: 01/01/2013
Field of study

Alignment-free methods are increasingly used for genome analysis and phylogeny reconstruction since they circumvent various difficulties of traditional approaches that rely on multiple sequence alignments. In particular, they are much faster than alignment-based methods. Most alignment-free approaches work by analyzing the k-mer composition of sequences. In this paper, we propose to use \u27spaced k-mers\u27, i.e. patterns of deterministic and \u27don\u27t care\u27 positions instead of contiguous k-mers. Using simulated and real-world sequence data, we demonstrate that this approach produces better phylogenetic trees than alignment-free methods that rely on contiguous k-mers. In addition, distances calculated with spaced k-mers appear to be statistically more stable than distances based on contiguous k-mers

Dagstuhl Research Online Publication Server

Ksak: A high-throughput tool for alignment-free phylogenetics

Author: Bozhen Ren
Dongmei Ai
Guohao Xu
Jiemin Xie
Li Charlie Xia
Xudong Liu
Xuemei Liu
Yangxin Chen
Yangxin Chen
Ziqi Cheng
Publication venue: 'Frontiers Media SA'
Publication date: 01/03/2023
Field of study

Phylogenetic tools are fundamental to the studies of evolutionary relationships. In this paper, we present Ksak, a novel high-throughput tool for alignment-free phylogenetic analysis. Ksak computes the pairwise distance matrix between molecular sequences, using seven widely accepted k-mer based distance measures. Based on the distance matrix, Ksak constructs the phylogenetic tree with standard algorithms. When benchmarked with a golden standard 16S rRNA dataset, Ksak was found to be the most accurate tool among all five tools compared and was 19% more accurate than ClustalW2, a high-accuracy multiple sequence aligner. Above all, Ksak was tens to hundreds of times faster than ClustalW2, which helps eliminate the computation limit currently encountered in large-scale multiple sequence alignment. Ksak is freely available at https://github.com/labxscut/ksak

Directory of Open Access Journals

Assembly-free genome comparison based on next-generation sequencing reads and variable length patterns

Author
Publication venue: BioMed Central
Publication date: 10/09/2014
Field of study

Springer - Publisher Connector

Comparison of metagenomic samples using sequence signatures

Author: Bai Jiang
Fengzhu Sun
Jie Ren
Kai Song
Minghua Deng
Xuegong Zhang
Publication venue: Springer Nature
Publication date: 27/12/2012
Field of study

BACKGROUND: Sequence signatures, as defined by the frequencies of k-tuples (or k-mers, k-grams), have been used extensively to compare genomic sequences of individual organisms, to identify cis-regulatory modules, and to study the evolution of regulatory sequences. Recently many next-generation sequencing (NGS) read data sets of metagenomic samples from a variety of different environments have been generated. The assembly of these reads can be difficult and analysis methods based on mapping reads to genes or pathways are also restricted by the availability and completeness of existing databases. Sequence-signature-based methods, however, do not need the complete genomes or existing databases and thus, can potentially be very useful for the comparison of metagenomic samples using NGS read data. Still, the applications of sequence signature methods for the comparison of metagenomic samples have not been well studied. RESULTS: We studied several dissimilarity measures, including d(2), d(2)(*) and d(2)(S) recently developed from our group, a measure (hereinafter noted as Hao) used in CVTree developed from Hao’s group (Qi et al., 2004), measures based on relative di-, tri-, and tetra-nucleotide frequencies as in Willner et al. (2009), as well as standard l(p) measures between the frequency vectors, for the comparison of metagenomic samples using sequence signatures. We compared their performance using a series of extensive simulations and three real next-generation sequencing (NGS) metagenomic datasets: 39 fecal samples from 33 mammalian host species, 56 marine samples across the world, and 13 fecal samples from human individuals. Results showed that the dissimilarity measure d(2)(S) can achieve superior performance when comparing metagenomic samples by clustering them into different groups as well as recovering environmental gradients affecting microbial samples. New insights into the environmental factors affecting microbial compositions in metagenomic samples are obtained through the analyses. Our results show that sequence signatures of the mammalian gut are closely associated with diet and gut physiology of the mammals, and that sequence signatures of marine communities are closely related to location and temperature. CONCLUSIONS: Sequence signatures can successfully reveal major group and gradient relationships among metagenomic samples from NGS reads without alignment to reference databases. The d(2)(S) dissimilarity measure is a good choice in all application scenarios. The optimal choice of tuple size depends on sequencing depth, but it is quite robust within a range of choices for moderate sequencing depths

Springer - Publisher Connector

PubMed Central

Clustering of reads with alignment-free measures and quality values

Author: A Solovyov
Andrea Leoni
B Ewing
BE Blaisdell
CA Albers
D Medini
DR Zerbino
E Bao
E Birney
G Reinert
GE Sims
H Li
J Göke
J Qi
K Song
L Gao
L Wan
M Comin
M Comin
M Comin
M Comin
M Comin
M Comin
M Comin
M Comin
Matteo Comin
Michele Schimd
MO Carneiro
MR Kantorovitz
Q Dai
R Jothi
RA Lippert
S Vinga
SF Altschul
W Qu
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

BACKGROUND: The data volume generated by Next-Generation Sequencing (NGS) technologies is growing at a pace that is now challenging the storage and data processing capacities of modern computer systems. In this context an important aspect is the reduction of data complexity by collapsing redundant reads in a single cluster to improve the run time, memory requirements, and quality of post-processing steps like assembly and error correction. Several alignment-free measures, based on k-mers counts, have been used to cluster reads. Quality scores produced by NGS platforms are fundamental for various analysis of NGS data like reads mapping and error detection. Moreover future-generation sequencing platforms will produce long reads but with a large number of erroneous bases (up to 15 %). RESULTS: In this scenario it will be fundamental to exploit quality value information within the alignment-free framework. To the best of our knowledge this is the first study that incorporates quality value information and k-mers counts, in the context of alignment-free measures, for the comparison of reads data. Based on this principles, in this paper we present a family of alignment-free measures called D(q)-type. A set of experiments on simulated and real reads data confirms that the new measures are superior to other classical alignment-free statistics, especially when erroneous reads are considered. Also results on de novo assembly and metagenomic reads classification show that the introduction of quality values improves over standard alignment-free measures. These statistics are implemented in a software called QCluster (http://www.dei.unipd.it/~ciompin/main/qcluster.html)

Crossref

PubMed Central

Archivio istituzionale della ricerca - Università di Padova

Centroid based clustering of high throughput sequencing reads based on -mer counts

Author
Publication venue: BioMed Central
Publication date
Field of study

Springer - Publisher Connector

kWIP: The k-mer weighted inner product, a de novo estimator of genetic similarity

Author: Borevitz Justin
Murray Kevin
Ong Cheng Soon
Warthmann Norman
Webers Christfried (Chris)
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 23/11/2020
Field of study

Modern genomics techniques generate overwhelming quantities of data. Extracting population genetic variation demands computationally efficient methods to determine genetic relatedness between individuals (or “samples”) in an unbiased manner, preferably de novo. Rapid estimation of genetic relatedness directly from sequencing data has the potential to overcome reference genome bias, and to verify that individuals belong to the correct genetic lineage before conclusions are drawn using mislabelled, or misidentified samples. We present the k-mer Weighted Inner Product (kWIP), an assembly-, and alignment-free estimator of genetic similarity. kWIP combines a probabilistic data structure with a novel metric, the weighted inner product (WIP), to efficiently calculate pairwise similarity between sequencing runs from their k-mer counts. It produces a distance matrix, which can then be further analysed and visualised. Our method does not require prior knowledge of the underlying genomes and applications include establishing sample identity and detecting mix-up, non-obvious genomic variation, and population structure. We show that kWIP can reconstruct the true relatedness between samples from simulated populations. By re-analysing several published datasets we show that our results are consistent with marker-based analyses. kWIP is written in C++, licensed under the GNU GPL, and is available from https://github.com/kdmurray91/kwip.This project was supported by the Australian Research Council Centre of Excellence in Plant Energy Biology (CE140100008) and by NICTA which was funded by the Australian Government through the Department of Communications and the Australian Research Council through the ICT Centre of Excellence Program. The research was undertaken with the assistance of resources from the National Computational Infrastructure (NCI), which is supported by the Australian Government. KDM is supported by an Australian Government Research Training Program (RTP) Scholarship

The Australian National University

Alignment-free Phylogeny Reconstruction Based On Quartet Trees

Author: Dencker Thomas
Publication venue
Publication date: 04/03/2020
Field of study

Georg-August-University Göttingen