Search CORE

UCL Discovery

Detecting Remote Evolutionary Relationships among Proteins by Large-Scale Semantic Embedding

Author: A Heger
A Heger
AG Murzin
AR Ortiz
B Bai
C Burges
C Kemena
C Yeats
Christina Leslie
D Grangier
I Melvin
Iain Melvin
J Soding
J Weston
Jason Weston
JD Storey
L Rychlewski
Michael Levitt
R Collobert
R Herbrich
SE Brenner
SF Altschul
SF Altschul
SR Eddy
T Jaakkola
T Joachims
T Smith
William Stafford Noble
Y Benjamini
Publication venue: Public Library of Science
Publication date: 01/01/2011
Field of study

Virtually every molecular biologist has searched a protein or DNA sequence database to find sequences that are evolutionarily related to a given query. Pairwise sequence comparison methods—i.e., measures of similarity between query and target sequences—provide the engine for sequence database search and have been the subject of 30 years of computational research. For the difficult problem of detecting remote evolutionary relationships between protein sequences, the most successful pairwise comparison methods involve building local models (e.g., profile hidden Markov models) of protein sequences. However, recent work in massive data domains like web search and natural language processing demonstrate the advantage of exploiting the global structure of the data space. Motivated by this work, we present a large-scale algorithm called ProtEmbed, which learns an embedding of protein sequences into a low-dimensional “semantic space.” Evolutionarily related proteins are embedded in close proximity, and additional pieces of evidence, such as 3D structural similarity or class labels, can be incorporated into the learning process. We find that ProtEmbed achieves superior accuracy to widely used pairwise sequence methods like PSI-BLAST and HHSearch for remote homology detection; it also outperforms our previous RankProp algorithm, which incorporates global structure in the form of a protein similarity network. Finally, the ProtEmbed embedding space can be visualized, both at the global level and local to a given query, yielding intuition about the structure of protein sequence space

CiteSeerX

Alignathon: A competitive assessment of whole-genome alignment methods

Author: Beal K
Brudno M
Chang JM
Clawson H
Darling AE
Dubchak I
Earl D
Erb I
Fitzgerald S
Harris RS
Haussler D
Herrero J
Hickey G
Hou M
Kemena C
Kent WJ
Kim J
Ma J
Molodtsov V
Nguyen N
Notredame C
Paten B
Poliakov A
Raney BJ
Seledtsov I
Solovyev V
Publication venue: 'Cold Spring Harbor Laboratory'
Publication date: 01/01/2014
Field of study

© 2014 Earl et al. Multiple sequence alignments (MSAs) are a prerequisite for a wide variety of evolutionary analyses. Published assessments and benchmark data sets for protein and, to a lesser extent, global nucleotide MSAs are available, but less effort has been made to establish benchmarks in the more general problem of whole-genome alignment (WGA). Using the same model as the successful Assemblathon competitions, we organized a competitive evaluation in which teams submitted their alignments and then assessments were performed collectively after all the submissions were received. Three data sets were used: Two were simulated and based on primate and mammalian phylogenies, and one was comprised of 20 real fly genomes. In total, 35 submissions were assessed, submitted by 10 teams using 12 different alignment pipelines. We found agreement between independent simulation-based and statistical assessments, indicating that there are substantial accuracy differences between contemporary alignment tools. We saw considerable differences in the alignment quality of differently annotated regions and found that few tools aligned the duplications analyzed. We found that many tools worked well at shorter evolutionary distances, but fewer performed competitively at longer distances. We provide all data sets, submissions, and assessment programs for further study and provide, as a resource for future benchmarking, a convenient repository of code and data for reproducing the simulation assessments

OPUS - University of Technology Sydney

eScholarship - University of California

Murasaki: A Fast, Parallelizable Algorithm to Find Anchors from Multiple Genomes

Author: A Delcher
A Smit
AC Darling
B Ma
C Kemena
CN Dewey
Darren P. Martin
DR Bentley
E Ohlebusch
EJ Vallender
FP Preparata
G Bejerano
G Bourque
Hachiya Tsuyoshi
I Tabus
JT Simpson
K Liolios
K Mathee
Kris Popendorf
LB Kish
M Blanchette
M Brudno
M Farach
P Pevzner
Pearson
R Rivest
RA Gibbs
RH Waterston
S Quinlan
S Schwartz
SF Altschul
T Hachiya
T Hubbard
TF Smith
W Miller
Y Osana
Yasubumi Sakakibara
Yasunori Osana
Publication venue: Public Library of Science
Publication date: 24/09/2010
Field of study

BACKGROUND: With the number of available genome sequences increasing rapidly, the magnitude of sequence data required for multiple-genome analyses is a challenging problem. When large-scale rearrangements break the collinearity of gene orders among genomes, genome comparison algorithms must first identify sets of short well-conserved sequences present in each genome, termed anchors. Previously, anchor identification among multiple genomes has been achieved using pairwise alignment tools like BLASTZ through progressive alignment tools like TBA, but the computational requirements for sequence comparisons of multiple genomes quickly becomes a limiting factor as the number and scale of genomes grows. METHODOLOGY/PRINCIPAL FINDINGS: Our algorithm, named Murasaki, makes it possible to identify anchors within multiple large sequences on the scale of several hundred megabases in few minutes using a single CPU. Two advanced features of Murasaki are (1) adaptive hash function generation, which enables efficient use of arbitrary mismatch patterns (spaced seeds) and therefore the comparison of multiple mammalian genomes in a practical amount of computation time, and (2) parallelizable execution that decreases the required wall-clock and CPU times. Murasaki can perform a sensitive anchoring of eight mammalian genomes (human, chimp, rhesus, orangutan, mouse, rat, dog, and cow) in 21 hours CPU time (42 minutes wall time). This is the first single-pass in-core anchoring of multiple mammalian genomes. We evaluated Murasaki by comparing it with the genome alignment programs BLASTZ and TBA. We show that Murasaki can anchor multiple genomes in near linear time, compared to the quadratic time requirements of BLASTZ and TBA, while improving overall accuracy. CONCLUSIONS/SIGNIFICANCE: Murasaki provides an open source platform to take advantage of long patterns, cluster computing, and novel hash algorithms to produce accurate anchors across multiple genomes with computational efficiency significantly greater than existing methods. Murasaki is available under GPL at http://murasaki.sourceforge.net

Cystinosin, MPDU1, SWEETs and KDELR Belong to a Well-Defined Protein Family with Putative Function of Cargo Receptors Involved in Vesicle Trafficking

Author: A Biegert
A Dereeper
AJ Stokes
B Gasnier
C Antignac
C Cole
C Kemena
C Notredame
CG Frank
CP Ponting
EN Moriyama
F Dubouloz
FM Townsley
Franca Fraternali
H Kim
H Schachter
I Letunic
J Dancourt
J Helenius
J Pei
J Soding
K Yoshiura
KB Nicholas
KG Hardwick
KG Hardwick
L Kall
LM Mashburn
LQ Chen
M Anand
M Binda
M Gao
M Podar
MJ Wilmer
ML Taub
RD Finn
S Hunter
S Sanyal
SF Altschul
SR Eddy
T Baroni
T Pomorski
T Rafnar
TJ Pucadyil
V Anantharaman
V Kalatzis
Vladimir Saudek
VW Hsu
WA Gahl
X Yu
XD Gao
Y Zhai
Publication venue: Public Library of Science
Publication date: 01/01/2011
Field of study

Classification of proteins into families based on remote homology often helps prediction of their biological function. Here we describe prediction of protein cargo receptors involved in vesicle formation and protein trafficking. Hidden Markov model profile-to-profile searches in protein databases using endoplasmic reticulum lumen protein retaining receptors (KDEL, Erd2) as query reveal a large and diverse family of proteins with seven transmembrane helices and common topology and, most likely, similar function. Their coding genes exist in all eukaryota and in several prokaryota. Some are responsible for metabolic diseases (cystinosis, congenital disorder of glycosylation), others are candidate genes for genetic disorders (cleft lip and palate, certain forms of cancer) or solute uptake and efflux (SWEETs) and many have not yet been assigned a function. Comparison with the properties of KDEL receptors suggests that the family members could be involved in protein trafficking and serve as cargo receptors. This prediction sheds new light on a range of biologically, medically and agronomically important proteins and could open the way to discovering the function of many genes not yet annotated. Experimental testing is suggested

CiteSeerX

Improving the Alignment Quality of Consistency Based Aligners with an Evaluation Function Using Synonymous Protein Words

Author: AR Panchenko
B Morgenstern
B Rost
C Chothia
C Kemena
C Notredame
CB Do
Cédric Notredame
D Baker
DG Higgins
DT Jones
Eugene A. Permyakov
F Armougom
G Yona
GH Gonnet
H-N Lin
Hsin-Nan Lin
HY Zhou
HY Zhou
J Skolnick
J Soding
JD Thompson
Jia-Ming Chang
JM Pei
JM Pei
JM Pei
K Katoh
L Rychlewski
L Wang
LA Kelley
MJ Sternberg
MO Dayhoff
O O'Sullivan
P Hogeweg
R Hagopian
R Sadreyev
RC Edgar
RC Edgar
RC Edgar
RC Edgar
RC Edgar
RC Edgar
S Henikoff
SF Altschul
SF Altschul
T Hara
T Müller
Ting-Yi Sung
U Roshan
VA Simossis
W Kabsch
W Kabsch
Wen-Lian Hsu
Y Zhang
Y Zhang
Y Zhang
Publication venue: Public Library of Science
Publication date: 02/12/2011
Field of study

Most sequence alignment tools can successfully align protein sequences with higher levels of sequence identity. The accuracy of corresponding structure alignment, however, decreases rapidly when considering distantly related sequences (<20% identity). In this range of identity, alignments optimized so as to maximize sequence similarity are often inaccurate from a structural point of view. Over the last two decades, most multiple protein aligners have been optimized for their capacity to reproduce structure-based alignments while using sequence information. Methods currently available differ essentially in the similarity measurement between aligned residues using substitution matrices, Fourier transform, sophisticated profile-profile functions, or consistency-based approaches, more recently

Genome alignment with graph data structures: a comparison

Author: A Bergeron
A Löytynoja
ACE Darling
AE Darling
AL Halpern
B Kehr
B Paten
B Paten
B Paten
B Raphael
Birte Kehr
BP Blackburne
C Kemena
C Lee
C Notredame
CB Do
CN Dewey
CN Dewey
D Sankoff
DF Feng
DR Zerbino
F Harary
I Dubchak
I Minkin
J Fostier
J Kececioglu
JD Kececioglu
JD Thompson
K Katoh
K Reinert
Kathrin Trappe
Knut Reinert
L Feuk
M Blanchette
M Brudno
M Höhl
MA Alekseyev
Manuel Holtgrewe
N El-Mabrouk
NA Belal
NG de Bruijn
O Gotoh
PA Pevzner
PEC Compeau
RC Edgar
RK Bradley
S Hannenhalli
S Yancopoulos
SK Pham
SV Angiuoli
T Rausch
TF Smith
TH Cormen
V Bafna
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

DECIPHER: harnessing local sequence context to improve protein multiple sequence alignment

Author: A Godzik
A Kloczkowski
AS Konagurthu
C Kemena
D Kihara
DA Morrison
DR Noguera
DT Jones
E Bindewald
Erik S. Wright
ES Wright
ES Wright
ES Wright
F Armougom
F Morcos
F Sievers
F Sievers
G Blackshields
G Jordan
G Tan
GE Crooks
GPS Raghava
H Zhou
I Walle Van
J Garnier
J Jorda
J Pei
J Pei
JD Thompson
JD Thompson
JD Thompson
JG Henikoff
JM Hancock
JM Sauder
K Boyce
K Katoh
K Katoh
K Mizuguchi
M Cline
MK Kalita
MR Aniba
MS Breen
MSS Chang
P Katsonis
Q Li
R Core Team
R Kim
R Szklarczyk
RC Edgar
RC Edgar
RC Edgar
RC Edgar
RC Gentleman
RD Finn
S Iantorno
S Mirarab
S Pascarella
SF Altschul
TM Phuong
VA Simossis
W Fletcher
W Kabsch
X Deng
Y Wang
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study