Clustering exact matches of pairwise sequence alignments by weighted linear regression

Alvaro J González; F Sanger; Li Liao; PA Pevzner; S Kurtz; SF Altschul; TF Smith; WJ Kent; WR Pearson

Clustering exact matches of pairwise sequence alignments by weighted linear regression

Authors: Alvaro J González
F Sanger
Li Liao
PA Pevzner
S Kurtz
SF Altschul
TF Smith
WJ Kent
WR Pearson
Publication date: 1 January 2008
Publisher: BioMed Central
Doi

Abstract

Abstract Background At intermediate stages of genome assembly projects, when a number of contigs have been generated and their validity needs to be verified, it is desirable to align these contigs to a reference genome when it is available. The interest is not to analyze a detailed alignment between a contig and the reference genome at the base level, but rather to have a rough estimate of where the contig aligns to the reference genome, specifically, by identifying the starting and ending positions of such a region. This information is very useful in ordering the contigs, facilitating post-assembly analysis such as gap closure and resolving repeats. There exist programs, such as BLAST and MUMmer, that can quickly align and identify high similarity segments between two sequences, which, when seen in a dot plot, tend to agglomerate along a diagonal but can also be disrupted by gaps or shifted away from the main diagonal due to mismatches between the contig and the reference. It is a tedious and practically impossible task to visually inspect the dot plot to identify the regions covered by a large number of contigs from sequence assembly projects. A forced global alignment between a contig and the reference is not only time consuming but often meaningless. Results We have developed an algorithm that uses the coordinates of all the exact matches or high similarity local alignments, clusters them with respect to the main diagonal in the dot plot using a weighted linear regression technique, and identifies the starting and ending coordinates of the region of interest. Conclusion This algorithm complements existing pairwise sequence alignment packages by replacing the time-consuming seed extension phase with a weighted linear regression for the alignment seeds. It was experimentally shown that the gain in execution time can be outstanding without compromising the accuracy. This method should be of great utility to sequence assembly and genome comparison projects.</p

Similar works

Full text

Open in the Core reader

Download PDF

Available Versions

Directory of Open Access Journals

oai:doaj.org/article:cdac30146...

Last time updated on 17/12/2014

Crossref

Last time updated on 05/06/2019

Springer - Publisher Connector

Last time updated on 05/06/2019