Search CORE

Münstersches Informations und Archivsystem für Multimediale Inhalte

Domain similarity based orthology detection

Author: Bitard-Feildel T. (Tristan)
Bornberg-Bauer E. (Erich)
Greenwood J.M. (Jenny)
Kemena C. (Carsten)
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 13/05/2015
Field of study

Background: Orthologous protein detection software mostly uses pairwise comparisons of amino-acid sequences to assert whether two proteins are orthologous or not. Accordingly, when the number of sequences for comparison increases, the number of comparisons to compute grows in a quadratic order. A current challenge of bioinformatic research, especially when taking into account the increasing number of sequenced organisms available, is to make this ever-growing number of comparisons computationally feasible in a reasonable amount of time. We propose to speed up the detection of orthologous proteins by using strings of domains to characterize the proteins. Results: We present two new protein similarity measures, a cosine and a maximal weight matching score based on domain content similarity, and new software, named porthoDom. The qualities of the cosine and the maximal weight matching similarity measures are compared against curated datasets. The measures show that domain content similarities are able to correctly group proteins into their families. Accordingly, the cosine similarity measure is used inside porthoDom, the wrapper developed for proteinortho. porthoDom makes use of domain content similarity measures to group proteins together before searching for orthologs. By using domains instead of amino acid sequences, the reduction of the search space decreases the computational complexity of an all-against-all sequence comparison. Conclusion: We demonstrate that representing and comparing proteins as strings of discrete domains, i.e. as a concatenation of their unique identifiers, allows a drastic simplification of search space. porthoDom has the advantage of speeding up orthology detection while maintaining a degree of accuracy similar to proteinortho. The implementation of porthoDom is released using python and C++ languages and is available under the GNU GPL licence 3 at http://www.bornberglab.org/pages/porthoda.<br

Springer - Publisher Connector

Münstersches Informations und Archivsystem für Multimediale Inhalte

Who Watches the Watchmen? An Appraisal of Benchmarks for Multiple Sequence Alignment

Author: A Löytynoja
A Löytynoja
B Sipos
BG Hall
BG Hall
BP Blackburne
C Chothia
C Dessimoz
C Kemena
C Kemena
C Notredame
CB Do
CL Strope
DA Dalquen
DA Morrison
DH Mathews
ER Mardis
G Blackshields
G Jordan
G Landan
GP Raghava
I Walle Van
J Kim
J Stoye
JD Thompson
JD Thompson
JD Thompson
JD Thompson
JD Thompson
JD Thompson
JH Havgaard
JP Huelsenbeck
K Mizuguchi
LA Stebbings
M Anisimova
M Pop
MR Aniba
P Gardner
RA Cartwright
RB Russell
RC Edgar
RC Edgar
SA Berger
SF Altschul
T Golubchik
T Koestler
T Lassmann
T Lassmann
T Lassmann
W Fletcher
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 09/11/2012
Field of study

Multiple sequence alignment (MSA) is a fundamental and ubiquitous technique in bioinformatics used to infer related residues among biological sequences. Thus alignment accuracy is crucial to a vast range of analyses, often in ways difficult to assess in those analyses. To compare the performance of different aligners and help detect systematic errors in alignments, a number of benchmarking strategies have been pursued. Here we present an overview of the main strategies--based on simulation, consistency, protein structure, and phylogeny--and discuss their different advantages and associated risks. We outline a set of desirable characteristics for effective benchmarking, and evaluate each strategy in light of them. We conclude that there is currently no universally applicable means of benchmarking MSA, and that developers and users of alignment tools should base their choice of benchmark depending on the context of application--with a keen awareness of the assumptions underlying each benchmarking strategy.Comment: Revie

arXiv.org e-Print Archive

UCL Discovery

Detecting Remote Evolutionary Relationships among Proteins by Large-Scale Semantic Embedding

Author: A Heger
A Heger
AG Murzin
AR Ortiz
B Bai
C Burges
C Kemena
C Yeats
Christina Leslie
D Grangier
I Melvin
Iain Melvin
J Soding
J Weston
Jason Weston
JD Storey
L Rychlewski
Michael Levitt
R Collobert
R Herbrich
SE Brenner
SF Altschul
SF Altschul
SR Eddy
T Jaakkola
T Joachims
T Smith
William Stafford Noble
Y Benjamini
Publication venue: Public Library of Science
Publication date: 01/01/2011
Field of study

Virtually every molecular biologist has searched a protein or DNA sequence database to find sequences that are evolutionarily related to a given query. Pairwise sequence comparison methods—i.e., measures of similarity between query and target sequences—provide the engine for sequence database search and have been the subject of 30 years of computational research. For the difficult problem of detecting remote evolutionary relationships between protein sequences, the most successful pairwise comparison methods involve building local models (e.g., profile hidden Markov models) of protein sequences. However, recent work in massive data domains like web search and natural language processing demonstrate the advantage of exploiting the global structure of the data space. Motivated by this work, we present a large-scale algorithm called ProtEmbed, which learns an embedding of protein sequences into a low-dimensional “semantic space.” Evolutionarily related proteins are embedded in close proximity, and additional pieces of evidence, such as 3D structural similarity or class labels, can be incorporated into the learning process. We find that ProtEmbed achieves superior accuracy to widely used pairwise sequence methods like PSI-BLAST and HHSearch for remote homology detection; it also outperforms our previous RankProp algorithm, which incorporates global structure in the form of a protein similarity network. Finally, the ProtEmbed embedding space can be visualized, both at the global level and local to a given query, yielding intuition about the structure of protein sequence space

CiteSeerX

Alignathon: A competitive assessment of whole-genome alignment methods

Author: Beal K
Brudno M
Chang JM
Clawson H
Darling AE
Dubchak I
Earl D
Erb I
Fitzgerald S
Harris RS
Haussler D
Herrero J
Hickey G
Hou M
Kemena C
Kent WJ
Kim J
Ma J
Molodtsov V
Nguyen N
Notredame C
Paten B
Poliakov A
Raney BJ
Seledtsov I
Solovyev V
Publication venue: 'Cold Spring Harbor Laboratory'
Publication date: 01/01/2014
Field of study

© 2014 Earl et al. Multiple sequence alignments (MSAs) are a prerequisite for a wide variety of evolutionary analyses. Published assessments and benchmark data sets for protein and, to a lesser extent, global nucleotide MSAs are available, but less effort has been made to establish benchmarks in the more general problem of whole-genome alignment (WGA). Using the same model as the successful Assemblathon competitions, we organized a competitive evaluation in which teams submitted their alignments and then assessments were performed collectively after all the submissions were received. Three data sets were used: Two were simulated and based on primate and mammalian phylogenies, and one was comprised of 20 real fly genomes. In total, 35 submissions were assessed, submitted by 10 teams using 12 different alignment pipelines. We found agreement between independent simulation-based and statistical assessments, indicating that there are substantial accuracy differences between contemporary alignment tools. We saw considerable differences in the alignment quality of differently annotated regions and found that few tools aligned the duplications analyzed. We found that many tools worked well at shorter evolutionary distances, but fewer performed competitively at longer distances. We provide all data sets, submissions, and assessment programs for further study and provide, as a resource for future benchmarking, a convenient repository of code and data for reproducing the simulation assessments

OPUS - University of Technology Sydney

eScholarship - University of California

Cystinosin, MPDU1, SWEETs and KDELR Belong to a Well-Defined Protein Family with Putative Function of Cargo Receptors Involved in Vesicle Trafficking

Author: A Biegert
A Dereeper
AJ Stokes
B Gasnier
C Antignac
C Cole
C Kemena
C Notredame
CG Frank
CP Ponting
EN Moriyama
F Dubouloz
FM Townsley
Franca Fraternali
H Kim
H Schachter
I Letunic
J Dancourt
J Helenius
J Pei
J Soding
K Yoshiura
KB Nicholas
KG Hardwick
KG Hardwick
L Kall
LM Mashburn
LQ Chen
M Anand
M Binda
M Gao
M Podar
MJ Wilmer
ML Taub
RD Finn
S Hunter
S Sanyal
SF Altschul
SR Eddy
T Baroni
T Pomorski
T Rafnar
TJ Pucadyil
V Anantharaman
V Kalatzis
Vladimir Saudek
VW Hsu
WA Gahl
X Yu
XD Gao
Y Zhai
Publication venue: Public Library of Science
Publication date: 01/01/2011
Field of study

Classification of proteins into families based on remote homology often helps prediction of their biological function. Here we describe prediction of protein cargo receptors involved in vesicle formation and protein trafficking. Hidden Markov model profile-to-profile searches in protein databases using endoplasmic reticulum lumen protein retaining receptors (KDEL, Erd2) as query reveal a large and diverse family of proteins with seven transmembrane helices and common topology and, most likely, similar function. Their coding genes exist in all eukaryota and in several prokaryota. Some are responsible for metabolic diseases (cystinosis, congenital disorder of glycosylation), others are candidate genes for genetic disorders (cleft lip and palate, certain forms of cancer) or solute uptake and efflux (SWEETs) and many have not yet been assigned a function. Comparison with the properties of KDEL receptors suggests that the family members could be involved in protein trafficking and serve as cargo receptors. This prediction sheds new light on a range of biologically, medically and agronomically important proteins and could open the way to discovering the function of many genes not yet annotated. Experimental testing is suggested

CiteSeerX

Murasaki: A Fast, Parallelizable Algorithm to Find Anchors from Multiple Genomes

Author: A Delcher
A Smit
AC Darling
B Ma
C Kemena
CN Dewey
Darren P. Martin
DR Bentley
E Ohlebusch
EJ Vallender
FP Preparata
G Bejerano
G Bourque
Hachiya Tsuyoshi
I Tabus
JT Simpson
K Liolios
K Mathee
Kris Popendorf
LB Kish
M Blanchette
M Brudno
M Farach
P Pevzner
Pearson
R Rivest
RA Gibbs
RH Waterston
S Quinlan
S Schwartz
SF Altschul
T Hachiya
T Hubbard
TF Smith
W Miller
Y Osana
Yasubumi Sakakibara
Yasunori Osana
Publication venue: Public Library of Science
Publication date: 24/09/2010
Field of study

BACKGROUND: With the number of available genome sequences increasing rapidly, the magnitude of sequence data required for multiple-genome analyses is a challenging problem. When large-scale rearrangements break the collinearity of gene orders among genomes, genome comparison algorithms must first identify sets of short well-conserved sequences present in each genome, termed anchors. Previously, anchor identification among multiple genomes has been achieved using pairwise alignment tools like BLASTZ through progressive alignment tools like TBA, but the computational requirements for sequence comparisons of multiple genomes quickly becomes a limiting factor as the number and scale of genomes grows. METHODOLOGY/PRINCIPAL FINDINGS: Our algorithm, named Murasaki, makes it possible to identify anchors within multiple large sequences on the scale of several hundred megabases in few minutes using a single CPU. Two advanced features of Murasaki are (1) adaptive hash function generation, which enables efficient use of arbitrary mismatch patterns (spaced seeds) and therefore the comparison of multiple mammalian genomes in a practical amount of computation time, and (2) parallelizable execution that decreases the required wall-clock and CPU times. Murasaki can perform a sensitive anchoring of eight mammalian genomes (human, chimp, rhesus, orangutan, mouse, rat, dog, and cow) in 21 hours CPU time (42 minutes wall time). This is the first single-pass in-core anchoring of multiple mammalian genomes. We evaluated Murasaki by comparing it with the genome alignment programs BLASTZ and TBA. We show that Murasaki can anchor multiple genomes in near linear time, compared to the quadratic time requirements of BLASTZ and TBA, while improving overall accuracy. CONCLUSIONS/SIGNIFICANCE: Murasaki provides an open source platform to take advantage of long patterns, cluster computing, and novel hash algorithms to produce accurate anchors across multiple genomes with computational efficiency significantly greater than existing methods. Murasaki is available under GPL at http://murasaki.sourceforge.net

Improving the Alignment Quality of Consistency Based Aligners with an Evaluation Function Using Synonymous Protein Words

Author: AR Panchenko
B Morgenstern
B Rost
C Chothia
C Kemena
C Notredame
CB Do
Cédric Notredame
D Baker
DG Higgins
DT Jones
Eugene A. Permyakov
F Armougom
G Yona
GH Gonnet
H-N Lin
Hsin-Nan Lin
HY Zhou
HY Zhou
J Skolnick
J Soding
JD Thompson
Jia-Ming Chang
JM Pei
JM Pei
JM Pei
K Katoh
L Rychlewski
L Wang
LA Kelley
MJ Sternberg
MO Dayhoff
O O'Sullivan
P Hogeweg
R Hagopian
R Sadreyev
RC Edgar
RC Edgar
RC Edgar
RC Edgar
RC Edgar
RC Edgar
S Henikoff
SF Altschul
SF Altschul
T Hara
T Müller
Ting-Yi Sung
U Roshan
VA Simossis
W Kabsch
W Kabsch
Wen-Lian Hsu
Y Zhang
Y Zhang
Y Zhang
Publication venue: Public Library of Science
Publication date: 02/12/2011
Field of study

Most sequence alignment tools can successfully align protein sequences with higher levels of sequence identity. The accuracy of corresponding structure alignment, however, decreases rapidly when considering distantly related sequences (<20% identity). In this range of identity, alignments optimized so as to maximize sequence similarity are often inaccurate from a structural point of view. Over the last two decades, most multiple protein aligners have been optimized for their capacity to reproduce structure-based alignments while using sequence information. Methods currently available differ essentially in the similarity measurement between aligned residues using substitution matrices, Fourier transform, sophisticated profile-profile functions, or consistency-based approaches, more recently