Centroid based clustering of high throughput sequencing reads based on n-mer counts

Lipkin, W. Ian; Solovyov, Alexander

Centroid based clustering of high throughput sequencing reads based on n-mer counts

Authors: W. Ian Lipkin
Alexander Solovyov
Publication date: 1 January 2013
Publisher: 'Columbia University Libraries/Information Services'
Doi

Abstract

Background: Many problems in computational biology require alignment-free sequence comparisons. One of the common tasks involving sequence comparison is sequence clustering. Here we apply methods of alignment-free comparison (in particular, comparison using sequence composition) to the challenge of sequence clustering. Results: We study several centroid based algorithms for clustering sequences based on word counts. Study of their performance shows that using k-means algorithm with or without the data whitening is efficient from the computational point of view. A higher clustering accuracy can be achieved using the soft expectation maximization method, whereby each sequence is attributed to each cluster with a specific probability. We implement an open source tool for alignment-free clustering. It is publicly available from github: https://github.com/luscinius/afcluster. Conclusions: We show the utility of alignment-free sequence clustering for high throughput sequencing analysis despite its limitations. In particular, it allows one to perform assembly with reduced resources and a minimal loss of quality. The major factor affecting performance of alignment-free read clustering is the length of the read

Similar works

Full text

Open in the Core reader

Download PDF

Available Versions

Sustaining member

Columbia University Academic Commons

oai:academiccommons.columbia.e...

Last time updated on 02/10/2018

Crossref

info:doi/10.1186%2F1471-2105-1...

Last time updated on 19/02/2019

Springer - Publisher Connector

Last time updated on 05/06/2019