Search CORE

Considering scores between unrelated proteins in the search database improves profile comparison

Author: AA Schaffer
DT Jones
G Yona
J Soding
L Rychlewski
M Frenkel-Morgenstern
M Madera
Nick V Grishin
R Sadreyev
RI Sadreyev
Ruslan I Sadreyev
S Karlin
S Pietrokovski
S Shi
SF Altschul
SF Altschul
Y Qi
Y Wang
Y Zhang
YK Yu
Yong Wang
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background Profile-based comparison of multiple sequence alignments is a powerful methodology for the detection remote protein sequence similarity, which is essential for the inference and analysis of protein structure, function, and evolution. Accurate estimation of statistical significance of detected profile similarities is essential for further development of this methodology. Here we analyze a novel approach to estimate the statistical significance of profile similarity: the explicit consideration of background score distributions for each database template (subject). Results Using a simple scheme to combine and analytically approximate query- and subject-based distributions, we show that (i) inclusion of background distributions for the subjects increases the quality of homology detection; (ii) this increase is higher when the distributions are based on the scores to all known non-homologs of the subject rather than a small calibration subset of the database representatives; and (iii) these all known non-homolog distributions of scores for the subject make the dominant contribution to the improved performance: adding the calibration distribution of the query has a negligible additional effect. Conclusion The construction of distributions based on the complete sets of non-homologs for each subject is particularly relevant in the setting of structure prediction where the database consists of proteins with solved 3D structure (PDB, SCOP, CATH, etc.) and therefore structural relationships between proteins are known. These results point to a potential new direction in the development of more powerful methods for remote homology detection.</p

Automatically extracting functionally equivalent proteins from SwissProt

Author: A Amores
A Meyer
A Wagner
AA Akindahunsi
Andrew CR Martin
CH Wu
E Kretschmann
EJ Stellwag
EV Koonin
F Chen
GX Yu
II Artamonova
JM Hurst
KP O'Brien
LB Koski
Lisa EM McMillan
MC Lill
MY Galperin
RA Notebaart
RL Tatusov
RL Tatusov
S Shibata
SB Rice
SF Altschul
T Hulsen
T Hulsen
V Kunin
V van Noort
WM Fitch
Y Lee
Y Yaron
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2008
Field of study

In summary, FOSTA provides an automated analysis of annotations in UniProtKB/Swiss-Prot to enable groups of proteins already annotated as functionally equivalent, to be extracted. Our results demonstrate that the vast majority of UniProtKB/Swiss-Prot functional annotations are of high quality, and that FOSTA can interpret annotations successfully. Where FOSTA is not successful, we are able to highlight inconsistencies in UniProtKB/Swiss-Prot annotation. Most of these would have presented equal difficulties for manual interpretation of annotations. We discuss limitations and possible future extensions to FOSTA, and recommend changes to the UniProtKB/Swiss-Prot format, which would facilitate text-mining of UniProtKB/Swiss-Prot

UCL Discovery

Public Library of Science (PLOS)

Enlighten

Genome Trees from Conservation Profiles

Author: Altschul SF Madden TL, Schaffer AA, Zhang J, Zhang, Z, et al.
Daubin V Gouy M, Perriere G
Edouard Yeramian
Fredj Tekaia
Lawrence JG Hendrickson H
Makarova KS Wolf YI, Koonin EV
Philip Bourne
Publication venue: Public Library of Science
Publication date: 01/01/2005
Field of study

The concept of the genome tree depends on the potential evolutionary significance in the clustering of species according to similarities in the gene content of their genomes. In this respect, genome trees have often been identified with species trees. With the rapid expansion of genome sequence data it becomes of increasing importance to develop accurate methods for grasping global trends for the phylogenetic signals that mutually link the various genomes. We therefore derive here the methodological concept of genome trees based on protein conservation profiles in multiple species. The basic idea in this derivation is that the multi-component “presence-absence” protein conservation profiles permit tracking of common evolutionary histories of genes across multiple genomes. We show that a significant reduction in informational redundancy is achieved by considering only the subset of distinct conservation profiles. Beyond these basic ideas, we point out various pitfalls and limitations associated with the data handling, paving the way for further improvements. As an illustration for the methods, we analyze a genome tree based on the above principles, along with a series of other trees derived from the same data and based on pair-wise comparisons (ancestral duplication-conservation and shared orthologs). In all trees we observe a sharp discrimination between the three primary domains of life: Bacteria, Archaea, and Eukarya. The new genome tree, based on conservation profiles, displays a significant correspondence with classically recognized taxonomical groupings, along with a series of departures from such conventional clusterings

CiteSeerX

Public Library of Science (PLOS)

A Probabilistic Model of Local Sequence Alignment That Simplifies Statistical Significance Estimation

Author: A Krogh
A Marchler-Bauer
A Milosavljević
A Pertsemlidis
AA Schäffer
AY Mitrophanov
BJ Webb
Burkhard Rost
C Barrett
C Webber
D Drasdo
D Metzler
D Siegmund
DJC MacKay
EJ Gumbel
EP Nawrocki
ET Jaynes
I Letunic
J Park
JD Storey
JF Lawless
JS Liu
K Karplus
K Karplus
K Sjölander
M Madera
MG Kann
MQ Zhang
MS Waterman
N Chia
P Bucher
R Bundschuh
R Durbin
R Mott
R Mott
R Mott
R Olsen
RC Edgar
RD Finn
S Johnson
S Karlin
S Karlin
S Miyazawa
Sean R. Eddy
SF Altschul
SF Altschul
SF Altschul
SF Altschul
SF Altschul
SF Altschul
SR Eddy
SR Eddy
TF Smith
WR Pearson
Y-K Yu
Y-K Yu
Y-K Yu
Y-K Yu
Publication venue: Public Library of Science
Publication date: 01/05/2008
Field of study

Sequence database searches require accurate estimation of the statistical significance of scores. Optimal local sequence alignment scores follow Gumbel distributions, but determining an important parameter of the distribution (λ) requires time-consuming computational simulation. Moreover, optimal alignment scores are less powerful than probabilistic scores that integrate over alignment uncertainty (“Forward” scores), but the expected distribution of Forward scores remains unknown. Here, I conjecture that both expected score distributions have simple, predictable forms when full probabilistic modeling methods are used. For a probabilistic model of local sequence alignment, optimal alignment bit scores (“Viterbi” scores) are Gumbel-distributed with constant λ = log 2, and the high scoring tail of Forward scores is exponential with the same constant λ. Simulation studies support these conjectures over a wide range of profile/sequence comparisons, using 9,318 profile-hidden Markov models from the Pfam database. This enables efficient and accurate determination of expectation values (E-values) for both Viterbi and Forward scores for probabilistic local alignments

Public Library of Science (PLOS)

Accelerated Profile HMM Searches

Author: A Jacob
A Krogh
A Milosavljević
A Wozniak
AA Schäffer
B Rekapalli
C Camacho
DR Horn
EK Freyhult
EM Gertz
G Chukkapalli
GA Price
J Landman
JP Walters
JP Walters
K Karplus
LR Rabiner
LS Johnson
M Farrar
M Madera
R Durbin
RD Finn
RP Maddimsetty
S Derrien
S Hunter
S Johnson
Sean R. Eddy
SF Altschul
SF Altschul
SF Altschul
SF Altschul
SJ Melnikoff
SR Eddy
T Oliver
T Rognes
T Rognes
TF Smith
V Chaudhary
V Sachdeva
William R. Pearson
WN Grundy
WR Pearson
Y Sun
Y Sun
YK Yu
Publication venue: Public Library of Science
Publication date: 01/01/2011
Field of study

Profile hidden Markov models (profile HMMs) and probabilistic inference methods have made important contributions to the theory of sequence database homology search. However, practical use of profile HMM methods has been hindered by the computational expense of existing software implementations. Here I describe an acceleration heuristic for profile HMMs, the “multiple segment Viterbi” (MSV) algorithm. The MSV algorithm computes an optimal sum of multiple ungapped local alignment segments using a striped vector-parallel approach previously described for fast Smith/Waterman alignment. MSV scores follow the same statistical distribution as gapped optimal local alignment scores, allowing rapid evaluation of significance of an MSV score and thus facilitating its use as a heuristic filter. I also describe a 20-fold acceleration of the standard profile HMM Forward/Backward algorithms using a method I call “sparse rescaling”. These methods are assembled in a pipeline in which high-scoring MSV hits are passed on for reanalysis with the full HMM Forward/Backward algorithm. This accelerated pipeline is implemented in the freely available HMMER3 software package. Performance benchmarks show that the use of the heuristic MSV filter sacrifices negligible sensitivity compared to unaccelerated profile HMM searches. HMMER3 is substantially more sensitive and 100- to 1000-fold faster than HMMER2. HMMER3 is now about as fast as BLAST for protein searches

CiteSeerX

Colour break in reverse bicolour daffodils is associated with the presence of Narcissus mosaic virus

Author: A Chomič
AA Brunt
AA Brunt
AG Plakidas
AR Rees
AR Rees
CJ Asjes
CY Wan
DA Hunter
Donald A Hunter
EL Dekker
Huaibi Zhang
J Chen
J Hammond
JA Lesnaw
John D Fletcher
Kevin M Davies
LI Ward
RAA van der Vlugt
RT McMillan Jr
SF Altschul
Turner
VR Clark
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Daffodils (<it>Narcissus pseudonarcissus</it>) are one of the world's most popular ornamentals. They also provide a scientific model for studying the carotenoid pigments responsible for their yellow and orange flower colours. In reverse bicolour daffodils, the yellow flower trumpet fades to white with age. The flowers of this type of daffodil are particularly prone to colour break whereby, upon opening, the yellow colour of the perianth is observed to be 'broken' into patches of white. This colour break symptom is characteristic of potyviral infections in other ornamentals such as tulips whose colour break is due to alterations in the presence of anthocyanins. However, reverse bicolour flowers displaying colour break show no other virus-like symptoms such as leaf mottling or plant stunting, leading some to argue that the carotenoid-based colour breaking in reverse bicolour flowers may not be caused by virus infection. Results Although potyviruses have been reported to cause colour break in other flower species, enzyme-linked-immunoassays with an antibody specific to the potyviral family showed that potyviruses were not responsible for the occurrence of colour break in reverse bicolour daffodils. Colour break in this type of daffodil was clearly associated with the presence of large quantities of rod-shaped viral particles of lengths 502-580 nm in tepals. Sap from flowers displaying colour break caused red necrotic lesions on <it>Gomphrena globosa</it>, suggesting the presence of potexvirus. Red necrotic lesions were not observed in this indicator plant when sap from reverse bicolour flowers not showing colour break was used. The reverse transcriptase polymerase reactions using degenerate primers to carla-, potex- and poty-viruses linked viral RNA with colour break and sequencing of the amplified products indicated that the potexvirus <it>Narcissisus mosaic virus </it>was the predominant virus associated with the occurrence of the colour break. Conclusions High viral counts were associated with the reverse bicolour daffodil flowers that were displaying colour break but otherwise showed no other symptoms of infection. <it>Narcissus mosaic virus </it>was the virus that was clearly linked to the carotenoid-based colour break.</p

The Phyre2 web portal for protein modeling, prediction and analysis

Author: A González-Pérez
A Lobley
A Marchler-Bauer
A Roy
AA Canutescu
BR Jefferys
C Mao
Christopher M Yates
CM Yates
CT Porter
DT Jones
DT Jones
EV Koonin
G Fucile
IA Adzhubei
IW Davis
J Moult
J Söding
JA Capra
JJ Ward
K Arnold
LA Kelley
Lawrence A Kelley
M Higurashi
M Källberg
M Remmert
Mark N Wass
Michael J E Sternberg
MN Wass
N Siew
Ngak-Leng Sim
P Rotkiewicz
P Schmidtke
R Arjun
S Raman
SF Altschul
Stefans Mezulis
TE Lewis
X Wei
Publication venue: Springer
Publication date: 01/05/2015
Field of study

Phyre2 is a suite of tools available on the web to predict and analyze protein structure, function and mutations. The focus of Phyre2 is to provide biologists with a simple and intuitive interface to state-of-the-art protein bioinformatics tools. Phyre2 replaces Phyre, the original version of the server for which we previously published a paper in Nature Protocols. In this updated protocol, we describe Phyre2, which uses advanced remote homology detection methods to build 3D models, predict ligand binding sites and analyze the effect of amino acid variants (e.g., nonsynonymous SNPs (nsSNPs)) for a user's protein sequence. Users are guided through results by a simple interface at a level of detail they determine. This protocol will guide users from submitting a protein sequence to interpreting the secondary and tertiary structure of their models, their domain composition and model quality. A range of additional available tools is described to find a protein structure in a genome, to submit large number of sequences at once and to automatically run weekly searches for proteins that are difficult to model. The server is available at http://www.sbg.bio.ic.ac.uk/phyre2. A typical structure prediction will be returned between 30 min and 2 h after submission

ZENODO

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Kent Academic Repository

Spiral - Imperial College Digital Repository

Composition-based statistics and translated nucleotide searches: Improving the TBLASTN module of BLAST

Author: AA Schäffer
AL Delcher
Alejandro A Schäffer
B Brejová
B Hao
BG Barrell
DJ States
E Birney
E Birney
E Boy-Marcotte
E Boy-Marcotte
E Halperin
E Michael Gertz
EM Gertz
F Damak
F Zinoni
G Macino
H Peltola
IG Young
J Hein
J Hein
JC Wootton
L Knecht
M Gribskov
MS Boguski
MS Boguski
MS Gelfand
O Gotoh
P Steneberg
P Steneberg
R Durbin
Richa Agarwala
S Henikoff
S Kurtz
SA Chervitz
SC Low
SF Altschul
SF Altschul
SF Altschul
SF Altschul
Stephen F Altschul
TF Smith
W Gish
WJ Kent
WR Pearson
WR Pearson
WR Pearson
X Guan
X Huang
Yi-Kuo Yu
YK Yu
YK Yu
Z Zhang
Z Zhang
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: TBLASTN is a mode of operation for BLAST that aligns protein sequences to a nucleotide database translated in all six frames. We present the first description of the modern implementation of TBLASTN, focusing on new techniques that were used to implement composition-based statistics for translated nucleotide searches. Composition-based statistics use the composition of the sequences being aligned to generate more accurate E-values, which allows for a more accurate distinction between true and false matches. Until recently, composition-based statistics were available only for protein-protein searches. They are now available as a command line option for recent versions of TBLASTN and as an option for TBLASTN on the NCBI BLAST web server. RESULTS: We evaluate the statistical and retrieval accuracy of the E-values reported by a baseline version of TBLASTN and by two variants that use different types of composition-based statistics. To test the statistical accuracy of TBLASTN, we ran 1000 searches using scrambled proteins from the mouse genome and a database of human chromosomes. To test retrieval accuracy, we modernize and adapt to translated searches a test set previously used to evaluate the retrieval accuracy of protein-protein searches. We show that composition-based statistics greatly improve the statistical accuracy of TBLASTN, at a small cost to the retrieval accuracy. CONCLUSION: TBLASTN is widely used, as it is common to wish to compare proteins to chromosomes or to libraries of mRNAs. Composition-based statistics improve the statistical accuracy, and therefore the reliability, of TBLASTN results. The algorithms used by TBLASTN are not widely known, and some of the most important are reported here. The data used to test TBLASTN are available for download and may be useful in other studies of translated search algorithms