11 research outputs found
Improved profile HMM performance by assessment of critical algorithmic features in SAM and HMMER
BACKGROUND: Profile hidden Markov model (HMM) techniques are among the most powerful methods for protein homology detection. Yet, the critical features for successful modelling are not fully known. In the present work we approached this by using two of the most popular HMM packages: SAM and HMMER. The programs' abilities to build models and score sequences were compared on a SCOP/Pfam based test set. The comparison was done separately for local and global HMM scoring. RESULTS: Using default settings, SAM was overall more sensitive. SAM's model estimation was superior, while HMMER's model scoring was more accurate. Critical features for model building were then analysed by comparing the two packages' algorithmic choices and parameters. The weighting between prior probabilities and multiple alignment counts held the primary explanation why SAM's model building was superior. Our analysis suggests that HMMER gives too much weight to the sequence counts. SAM's emission prior probabilities were also shown to be more sensitive. The relative sequence weighting schemes are different in the two packages but performed equivalently. CONCLUSION: SAM model estimation was more sensitive, while HMMER model scoring was more accurate. By combining the best algorithmic features from both packages the accuracy was substantially improved compared to their default performance
Subfamily specific conservation profiles for proteins based on n-gram patterns
<p>Abstract</p> <p>Background</p> <p>A new algorithm has been developed for generating conservation profiles that reflect the evolutionary history of the subfamily associated with a query sequence. It is based on n-gram patterns (NP{<it>n,m</it>}) which are sets of <it>n </it>residues and <it>m </it>wildcards in windows of size <it>n+m</it>. The generation of conservation profiles is treated as a signal-to-noise problem where the signal is the count of n-gram patterns in target sequences that are similar to the query sequence and the noise is the count over all target sequences. The signal is differentiated from the noise by applying singular value decomposition to sets of target sequences rank ordered by similarity with respect to the query.</p> <p>Results</p> <p>The new algorithm was used to construct 4,248 profiles from 120 randomly selected Pfam-A families. These were compared to profiles generated from multiple alignments using the consensus approach. The two profiles were similar whenever the subfamily associated with the query sequence was well represented in the multiple alignment. It was possible to construct subfamily specific conservation profiles using the new algorithm for subfamilies with as few as five members. The speed of the new algorithm was comparable to the multiple alignment approach.</p> <p>Conclusion</p> <p>Subfamily specific conservation profiles can be generated by the new algorithm without aprioi knowledge of family relationships or domain architecture. This is useful when the subfamily contains multiple domains with different levels of representation in protein databases. It may also be applicable when the subfamily sample size is too small for the multiple alignment approach.</p
Comparative Genomics Search for Losses of Long-Established Genes on the Human Lineage
Taking advantage of the complete genome sequences of several mammals, we developed a novel method to detect losses of well-established genes in the human genome through syntenic mapping of gene structures between the human, mouse, and dog genomes. Unlike most previous genomic methods for pseudogene identification, this analysis is able to differentiate losses of well-established genes from pseudogenes formed shortly after segmental duplication or generated via retrotransposition. Therefore, it enables us to find genes that were inactivated long after their birth, which were likely to have evolved nonredundant biological functions before being inactivated. The method was used to look for gene losses along the human lineage during the approximately 75 million years (My) since the common ancestor of primates and rodents (the euarchontoglire crown group). We identified 26 losses of well-established genes in the human genome that were all lost at least 50 My after their birth. Many of them were previously characterized pseudogenes in the human genome, such as GULO and UOX. Our methodology is highly effective at identifying losses of single-copy genes of ancient origin, allowing us to find a few well-known pseudogenes in the human genome missed by previous high-throughput genome-wide studies. In addition to confirming previously known gene losses, we identified 16 previously uncharacterized human pseudogenes that are definitive losses of long-established genes. Among them is ACYL3, an ancient enzyme present in archaea, bacteria, and eukaryotes, but lost approximately 6 to 8 Mya in the ancestor of humans and chimps. Although losses of well-established genes do not equate to adaptive gene losses, they are a useful proxy to use when searching for such genetic changes. This is especially true for adaptive losses that occurred more than 250,000 years ago, since any genetic evidence of the selective sweep indicative of such an event has been erased
Knowledge discovery in biological databases : a neural network approach
Knowledge discovery, in databases, also known as data mining, is aimed to find significant information from a set of data. The knowledge to be mined from the dataset may refer to patterns, association rules, classification and clustering rules, and so forth. In this dissertation, we present a neural network approach to finding knowledge in biological databases. Specifically, we propose new methods to process biological sequences in two case studies: the classification of protein sequences and the prediction of E. Coli promoters in DNA sequences. Our proposed methods, based oil neural network architectures combine techniques ranging from Bayesian inference, coding theory, feature selection, dimensionality reduction, to dynamic programming and machine learning algorithms. Empirical studies show that the proposed methods outperform previously published methods and have excellent performance on the latest dataset. We have implemented the proposed algorithms into an infrastructure, called Genome Mining, developed for biosequence classification and recognition
On the molecular evolution of the Plasmodium falciparum
Research in the Plasmodium falciparum molecular evolution field has
predominantly comprised three distinct areas: phylogenetics, host-parasite coevolution
and evolutionary genomics. These areas have greatly enhanced our
understanding of the early origins of the phylum Apicomplexa, the emergence of
P. falciparum, and the co-evolution between parasite and human hereditary
erythrocyte disorders. In addition, the genome sequencing projects have
elucidated the complexity and extremely unusual nature of the parasite genome.
Some aspects of parasite molecular evolution, however, are controversial, such as
human pyruvate kinase (PK) deficiency and P. falciparum virulence coevolution.
Other aspects, like Plasmodium whole genome evolution have
remained unexplored.
This thesis includes a collection of manuscripts that address aspects of the broad
field of P. falciparum molecular evolution. The first deals with the limitations of
bioinformatic methods as applied to P. falciparum, which have arisen due to the
unusual nature of the parasite genome, such as the extreme nucleotide bias.
Although conventional bioinformatics can partially accommodate and
compensate for the genome idiosyncrasies, these limitations have hampered
progress significantly. A novel alignment method, termed FIRE (Functional
Inference using the Rates of Evolution) was therefore developed. FIRE uses the
evolutionary constraints at codon sites to align sequences and infer domain
function and overcomes the problem of poor sequence similarity, which is
commonly encountered between P. falciparum and other taxa. A second aspect
addressed in this thesis, is the host-parasite relationship in the context of PK
deficiency. It was demonstrated that PK deficient erythrocytes are dramatically
resistant to parasite infection, providing in vitro evidence for this phenomenon
and confirming this aspect of host-parasite co-evolution.
The unexplored field of parasite genome evolution was initiated in this thesis by
investigating two major role-players in genome dynamics, mobile genetic elements (MGEs) and programmed cell death (PCD). MGEs were absent in P.
falciparum, possibly due to a geno-protective mechanism, which increased the
AT nucleotide bias. Interestingly, the parasite telomerase reverse transcriptase,
which is a domesticated MGE, was identified. In addition, there is genomic
evidence for the second determinant, a classical PCD pathway. Intriguingly,
functional and structural evidence for a p53-like DNA-binding domain, which
plays a key role in genome evolution, was obtained. Using MGEs and PCD as
examples, a theoretical framework for investigating genome dynamics was
developed. The framework proposes an ecological approach to genome evolution,
in which a trade-off exists between two opposing processes: the generation of
diversity by factors such as MGEs and the maintenance of integrity by factors
like PCD. The framework is suggested for proposing and testing hypotheses to
investigate the origins and evolution of the P. falciparum genome.
Finally, a novel approach, termed Evolutionary Patterning (EP), was developed to
limit the problem of parasite drug resistance and demonstrates the value of
employing molecular evolution to address biomedical challenges.
Some of this work, such as the FIRE method, the host-parasite co-evolution
studies, the PCD findings and the EP approach have been incorporated in grant
proposals and adopted in future projects. It is hoped that this research will be
used to further our understanding of P. falciparum evolution and advance the
efforts to control this deadly pathogen
Integrating Protein Data Resources through Semantic Web Services
Understanding the function of every protein is one major objective of bioinformatics. Currently, a large amount of information (e.g., sequence, structure and dynamics) is being produced by experiments and predictions that are associated with protein function. Integrating these diverse data about protein sequence, structure, dynamics and other protein features allows further exploration and establishment of the relationships between protein sequence, structure, dynamics and function, and thereby controlling the function of target proteins. However, information integration in protein data resources faces challenges at technology level for interfacing heterogeneous data formats and standards and at application level for semantic interpretation of dissimilar data and queries. In this research, a semantic web services infrastructure, called Web Services for Protein data resources (WSP), for flexible and user-oriented integration of protein data resources, is proposed. This infrastructure includes a method for modeling protein web services, a service publication algorithm, an efficient service discovery (matching) algorithm, and an optimal service chaining algorithm. Rather than relying on syntactic matching, the matching algorithm discovers services based on their similarity to the requested service. Therefore, users can locate services that semantically match their data requirements even if they are syntactically distinctive. Furthermore, WSP supports a workflow-based approach for service integration. The chaining algorithm is used to select and chain services, based on the criteria of service accuracy and data interoperability. The algorithm generates a web services workflow which automatically integrates the results from individual services.A number of experiments are conducted to evaluate the performance of the matching algorithm. The results reveal that the algorithm can discover services with reasonable performance. Also, a composite service, which integrates protein dynamics and conservation, is experimented using the WSP infrastructure
Exploring the function and evolution of proteins using domain families
Proteins are frequently composed of multiple domains which fold
independently. These are often evolutionarily distinct units which can be
adapted and reused in other proteins. The classification of protein domains
into evolutionary families facilitates the study of their evolution and function.
In this thesis such classifications are used firstly to examine methods for
identifying evolutionary relationships (homology) between protein domains.
Secondly a specific approach for predicting their function is developed.
Lastly they are used in studying the evolution of protein complexes.
Tools for identifying evolutionary relationships between proteins are
central to computational biology. They aid in classifying families of proteins,
giving clues about the function of proteins and the study of molecular
evolution. The first chapter of this thesis concerns the effectiveness of cutting
edge methods in identifying evolutionary relationships between protein
domains.
The identification of evolutionary relationships between proteins can
give clues as to their function. The second chapter of this thesis concerns the
development of a method to identify proteins involved in the same biological
process. This method is based on the concept of domain fusion whereby
pairs of proteins from one organism with a concerted function are sometimes
found fused into single proteins in a different organism. Using protein
domain classifications it is possible to identify these relationships.
Most proteins do not act in isolation but carry out their function by
binding to other proteins in complexes; little is understood about the
evolution of such complexes. In the third chapter of this thesis the evolution
of complexes is examined in two representative model organisms using
protein domain families. In this work, protein domain superfamilies allow
distantly related parts of complexes to be identified in order to determine
how homologous units are reused
Weighting Hidden Markov Models For Maximum Discrimination
1.1 Motivation Hidden Markov models can efficiently and automatically build statistical representations of related sequences. Unfortunately, training sets are frequently biased toward one subgroup of sequences, leading to an insufficiently general model. This work evaluates sequence weighting methods based on the maximum-discrimination idea. 1.2 Results One good method scales sequence weights by an exponential that ranges between 0.1 for the best scoring sequence and 1.0 for the worst. Experiments with a curated data set show that while training with one or two sequences performed worse than single-sequence Probabilistic Smith-Waterman, training with five or ten sequences reduced errors by 20% and 51%, respectively. This new version of the SAM HMM suite outperforms HMMer (17% reduction over PSW for 10 training sequences), Meta-MEME (28% reduction), and unweighted SAM (31% reduction). 1.3 Availability A World-Wide Web server, as well as information on obtaining the Sequence Alignme..
Activity fingerprints in DNA based on a structural analysis of sequence information.
The function of a DNA sequence is commonly predicted by measuring its nucleotide
similarity to known functional sets. However, the use of structural properties to
identify patterns within families is justified by the discovery that many very different
sequences have similar structural properties. The aim of this thesis is to develop tools
that detect any unusual structural characteristics of a particular sequence or that
identify DNA structure-activity fingerprints common to a set.
This work uses the Octamer Database to describe DNA. The database's contents are
split into two categories: those parameters that describe minimum energy structure and
those that measure flexibility. Information from both of these categories has been
combined to describe structural tendencies, offering an alternative measure of sequence
similarity.
A structural DNA profile gives a graphical illustration of how a parameter from the
Octamer Database varies across either a single sequence's length or across a set of
sequences. Profile Manager is an application that has been developed to automate
single sequence profile generation and is used to study the A-tract phenomenon. The
use of profiles to explore patterns in flexibility across a set of pre-aligned promoters is
then investigated with interesting transitions in decreasing twist flexibility discovered.
Multiple sequence queries are harder to solve than those of single sequences, due to the
inherent need for the sequences to be aligned. It is only under rare circumstances that
sequences are pre-aligned by an experimentally determined position. More commonly
a multiple alignment must be generated. An extended, structure-based, hidden Markov
model technique that successfully generates structural alignment~ is presented. Its.
application is tested on four DNA protein binding site datasets with comparisons made
to the traditional sequence method. Structural alignments of two out of the four
datasets were comparable in performance to sequence with useful insights into
underlying structural mechanisms