114 research outputs found
MRFalign: Protein Homology Detection through Alignment of Markov Random Fields
Sequence-based protein homology detection has been extensively studied and so
far the most sensitive method is based upon comparison of protein sequence
profiles, which are derived from multiple sequence alignment (MSA) of sequence
homologs in a protein family. A sequence profile is usually represented as a
position-specific scoring matrix (PSSM) or an HMM (Hidden Markov Model) and
accordingly PSSM-PSSM or HMM-HMM comparison is used for homolog detection. This
paper presents a new homology detection method MRFalign, consisting of three
key components: 1) a Markov Random Fields (MRF) representation of a protein
family; 2) a scoring function measuring similarity of two MRFs; and 3) an
efficient ADMM (Alternating Direction Method of Multipliers) algorithm aligning
two MRFs. Compared to HMM that can only model very short-range residue
correlation, MRFs can model long-range residue interaction pattern and thus,
encode information for the global 3D structure of a protein family.
Consequently, MRF-MRF comparison for remote homology detection shall be much
more sensitive than HMM-HMM or PSSM-PSSM comparison. Experiments confirm that
MRFalign outperforms several popular HMM or PSSM-based methods in terms of both
alignment accuracy and remote homology detection and that MRFalign works
particularly well for mainly beta proteins. For example, tested on the
benchmark SCOP40 (8353 proteins) for homology detection, PSSM-PSSM and HMM-HMM
succeed on 48% and 52% of proteins, respectively, at superfamily level, and on
15% and 27% of proteins, respectively, at fold level. In contrast, MRFalign
succeeds on 57.3% and 42.5% of proteins at superfamily and fold level,
respectively. This study implies that long-range residue interaction patterns
are very helpful for sequence-based homology detection. The software is
available for download at http://raptorx.uchicago.edu/download/.Comment: Accepted by both RECOMB 2014 and PLOS Computational Biolog
Global analysis of SNPs, proteins and protein-protein interactions: approaches for the prioritisation of candidate disease genes.
PhDUnderstanding the etiology of complex disease remains a challenge in biology. In recent
years there has been an explosion in biological data, this study investigates machine
learning and network analysis methods as tools to aid candidate disease gene prioritisation,
specifically relating to hypertension and cardiovascular disease.
This thesis comprises four sets of analyses: Firstly, non synonymous single nucleotide
polymorphisms (nsSNPs) were analysed in terms of sequence and structure based properties
using a classifier to provide a model for predicting deleterious nsSNPs. The degree
of sequence conservation at the nsSNP position was found to be the single best attribute
but other sequence and structural attributes in combination were also useful. Predictions
for nsSNPs within Ensembl have been made publicly available.
Secondly, predicting protein function for proteins with an absence of experimental
data or lack of clear similarity to a sequence of known function was addressed. Protein
domain attributes based on physicochemical and predicted structural characteristics
of the sequence were used as input to classifiers for predicting membership of large and
diverse protein superfamiles from the SCOP database. An enrichment method was investigated
that involved adding domains to the training dataset that are currently absent
from SCOP. This analysis resulted in improved classifier accuracy, optimised classifiers
achieved 66.3% for single domain proteins and 55.6% when including domains from
multi domain proteins. The domains from superfamilies with low sequence similarity,
share global sequence properties enabling applications to be developed which compliment
profile methods for detecting distant sequence relationships.
Thirdly, a topological analysis of the human protein interactome was performed. The
results were combined with functional annotation and sequence based properties to build
models for predicting hypertension associated proteins. The study found that predicted
hypertension related proteins are not generally associated with network hubs and do
not exhibit high clustering coefficients. Despite this, they tend to be closer and better
connected to other hypertension proteins on the interaction network than would be expected
by chance. Classifiers that combined PPI network, amino acid sequence and functional
properties produced a range of precision and recall scores according to the applied
3
weights.
Finally, interactome properties of proteins implicated in cardiovascular disease and
cancer were studied. The analysis quantified the influential (central) nature of each protein
and defined characteristics of functional modules and pathways in which the disease
proteins reside. Such proteins were found to be enriched 2 fold within proteins that are influential
(p<0.05) in the interactome. Additionally, they cluster in large, complex, highly
connected communities, acting as interfaces between multiple processes more often than
expected. An approach to prioritising disease candidates based on this analysis was proposed.
Each analyses can provide some new insights into the effort to identify novel disease
related proteins for cardiovascular disease
Word correlation matrices for protein sequence analysis and remote homology detection
<p>Abstract</p> <p>Background</p> <p>Classification of protein sequences is a central problem in computational biology. Currently, among computational methods discriminative kernel-based approaches provide the most accurate results. However, kernel-based methods often lack an interpretable model for analysis of discriminative sequence features, and predictions on new sequences usually are computationally expensive.</p> <p>Results</p> <p>In this work we present a novel kernel for protein sequences based on average word similarity between two sequences. We show that this kernel gives rise to a feature space that allows analysis of discriminative features and fast classification of new sequences. We demonstrate the performance of our approach on a widely-used benchmark setup for protein remote homology detection.</p> <p>Conclusion</p> <p>Our word correlation approach provides highly competitive performance as compared with state-of-the-art methods for protein remote homology detection. The learned model is interpretable in terms of biologically meaningful features. In particular, analysis of discriminative words allows the identification of characteristic regions in biological sequences. Because of its high computational efficiency, our method can be applied to ranking of potential homologs in large databases.</p
Multiple graph regularized protein domain ranking
Background Protein domain ranking is a fundamental task in structural
biology. Most protein domain ranking methods rely on the pairwise comparison of
protein domains while neglecting the global manifold structure of the protein
domain database. Recently, graph regularized ranking that exploits the global
structure of the graph defined by the pairwise similarities has been proposed.
However, the existing graph regularized ranking methods are very sensitive to
the choice of the graph model and parameters, and this remains a difficult
problem for most of the protein domain ranking methods.
Results To tackle this problem, we have developed the Multiple Graph
regularized Ranking algorithm, MultiG- Rank. Instead of using a single graph to
regularize the ranking scores, MultiG-Rank approximates the intrinsic manifold
of protein domain distribution by combining multiple initial graphs for the
regularization. Graph weights are learned with ranking scores jointly and
automatically, by alternately minimizing an ob- jective function in an
iterative algorithm. Experimental results on a subset of the ASTRAL SCOP
protein domain database demonstrate that MultiG-Rank achieves a better ranking
performance than single graph regularized ranking methods and pairwise
similarity based ranking methods.
Conclusion The problem of graph model and parameter selection in graph
regularized protein domain ranking can be solved effectively by combining
multiple graphs. This aspect of generalization introduces a new frontier in
applying multiple graphs to solving protein domain ranking applications.Comment: 21 page
The distance-profile representation and its application to detection of distantly related protein families
BACKGROUND: Detecting homology between remotely related protein families is an important problem in computational biology since the biological properties of uncharacterized proteins can often be inferred from those of homologous proteins. Many existing approaches address this problem by measuring the similarity between proteins through sequence or structural alignment. However, these methods do not exploit collective aspects of the protein space and the computed scores are often noisy and frequently fail to recognize distantly related protein families. RESULTS: We describe an algorithm that improves over the state of the art in homology detection by utilizing global information on the proximity of entities in the protein space. Our method relies on a vectorial representation of proteins and protein families and uses structure-specific association measures between proteins and template structures to form a high-dimensional feature vector for each query protein. These vectors are then processed and transformed to sparse feature vectors that are treated as statistical fingerprints of the query proteins. The new representation induces a new metric between proteins measured by the statistical difference between their corresponding probability distributions. CONCLUSION: Using several performance measures we show that the new tool considerably improves the performance in recognizing distant homologies compared to existing approaches such as PSIBLAST and FUGUE
Methods for the refinement of genome-scale metabolic networks
More accurate metabolic networks of pathogens and parasites are required to support the
identification of important enzymes or transporters that could be potential targets for new
drugs. The overall aim of this thesis is to contribute towards a new level of quality for
metabolic network reconstruction, through the application of several different approaches.
After building a draft metabolic network using an automated method, a large amount of
manual curation effort is still necessary before an accurate model can be reached. PathwayBooster,
a standalone software package, which I developed in Python, supports the
first steps of model curation, providing easy access to enzymatic function information and
a visual pathway display to enable the rapid identification of inaccuracies in the model.
A major current problem in model refinement is the identification of genes encoding enzymes
which are believed to be present but cannot be found using standard methods.
Current searches for enzymes are mainly based on strong sequence similarity to proteins
of known function, although in some cases it may be appropriate to consider more distant
relatives as candidates for filling these pathway holes. With this objective in mind, a
protocol was devised to search a proteome for superfamily relatives of a given enzymatic
function, returning candidate enzymes to perform this function.
Another, related approach tackles the problem of misannotation errors in public gene
databases and their influence on metabolic models through the propagation of erroneous
annotations. I show that the topological properties of metabolic networks contains useful information about annotation quality and can therefore play a role in methods for gene
function assignment.
An evolutionary perspective into functional changes within homologous domains opens
up the possibility of integrating information from multiple genomes to support the reconstruction
of metabolic models. I have therefore developed a methodology to predict
functional change within a gene superfamily phylogeny
A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis
<p>Abstract</p> <p>Background</p> <p>Protein remote homology detection and fold recognition are central problems in bioinformatics. Currently, discriminative methods based on support vector machine (SVM) are the most effective and accurate methods for solving these problems. A key step to improve the performance of the SVM-based methods is to find a suitable representation of protein sequences.</p> <p>Results</p> <p>In this paper, a novel building block of proteins called Top-<it>n</it>-grams is presented, which contains the evolutionary information extracted from the protein sequence frequency profiles. The protein sequence frequency profiles are calculated from the multiple sequence alignments outputted by PSI-BLAST and converted into Top-<it>n</it>-grams. The protein sequences are transformed into fixed-dimension feature vectors by the occurrence times of each Top-<it>n</it>-gram. The training vectors are evaluated by SVM to train classifiers which are then used to classify the test protein sequences. We demonstrate that the prediction performance of remote homology detection and fold recognition can be improved by combining Top-<it>n</it>-grams and latent semantic analysis (LSA), which is an efficient feature extraction technique from natural language processing. When tested on superfamily and fold benchmarks, the method combining Top-<it>n</it>-grams and LSA gives significantly better results compared to related methods.</p> <p>Conclusion</p> <p>The method based on Top-<it>n</it>-grams significantly outperforms the methods based on many other building blocks including N-grams, patterns, motifs and binary profiles. Therefore, Top-<it>n</it>-gram is a good building block of the protein sequences and can be widely used in many tasks of the computational biology, such as the sequence alignment, the prediction of domain boundary, the designation of knowledge-based potentials and the prediction of protein binding sites.</p
- …