Efficient large-scale protein sequence comparison and gene matching to identify orthologs and co-orthologs

Altschul; Altschul; Arun S. Konagurthu; Arunachalam; Bandyopadhyay; Bansal; Calabrese; Dehal; Dice; Edgar; Edgar; Flicek; Fukuhara; Geoffrey I. Webb; Gordân; Haas; Hachiya; James C. Whisstock; Jiangning Song; Jun; Khalid Mahmood; Koohy; Koonin; Kriventseva; Kuhn; Kärkkäinen; Li; Mahmood; Needleman; Papadimitriou; Pearson; Pruess; Remm; Sakarya; Sankoff; Santini; Sjolander; Smith; Smith; Sonnhammer; Sorensen; Swidan; Vandepoele; Vinga; Vingron; Widmann; Woolfe; Xu; Yu; Zhi

Efficient large-scale protein sequence comparison and gene matching to identify orthologs and co-orthologs

Authors: Altschul
Altschul
Arun S. Konagurthu
Arunachalam
Bandyopadhyay
Bansal
Calabrese
Dehal
Dice
Edgar
Edgar
Flicek
Fukuhara
Geoffrey I. Webb
Gordân
Haas
Hachiya
James C. Whisstock
Jiangning Song
Jun
Khalid Mahmood
Koohy
Koonin
Kriventseva
Kuhn
Kärkkäinen
Li
Mahmood
Needleman
Papadimitriou
Pearson
Pruess
Remm
Sakarya
Sankoff
Santini
Sjolander
Smith
Smith
Sonnhammer
Sorensen
Swidan
Vandepoele
Vinga
Vingron
Widmann
Woolfe
Xu
Yu
Zhi
Publication date: 1 January 2012
Publisher: Oxford University Press
Doi

Abstract

Broadly, computational approaches for ortholog assignment is a three steps process: (i) identify all putative homologs between the genomes, (ii) identify gene anchors and (iii) link anchors to identify best gene matches given their order and context. In this article, we engineer two methods to improve two important aspects of this pipeline [specifically steps (ii) and (iii)]. First, computing sequence similarity data [step (i)] is a computationally intensive task for large sequence sets, creating a bottleneck in the ortholog assignment pipeline. We have designed a fast and highly scalable sort-join method (afree) based on k-mer counts to rapidly compare all pairs of sequences in a large protein sequence set to identify putative homologs. Second, availability of complex genomes containing large gene families with prevalence of complex evolutionary events, such as duplications, has made the task of assigning orthologs and co-orthologs difficult. Here, we have developed an iterative graph matching strategy where at each iteration the best gene assignments are identified resulting in a set of orthologs and co-orthologs. We find that the afree algorithm is faster than existing methods and maintains high accuracy in identifying similar genes. The iterative graph matching strategy also showed high accuracy in identifying complex gene relationships. Standalone afree available from http://vbc.med.monash.edu.au/∼kmahmood/afree. EGM2, complete ortholog assignment pipeline (including afree and the iterative graph matching method) available from http://vbc.med.monash.edu.au/∼kmahmood/EGM2