11 research outputs found

    Improved profile HMM performance by assessment of critical algorithmic features in SAM and HMMER

    Get PDF
    BACKGROUND: Profile hidden Markov model (HMM) techniques are among the most powerful methods for protein homology detection. Yet, the critical features for successful modelling are not fully known. In the present work we approached this by using two of the most popular HMM packages: SAM and HMMER. The programs' abilities to build models and score sequences were compared on a SCOP/Pfam based test set. The comparison was done separately for local and global HMM scoring. RESULTS: Using default settings, SAM was overall more sensitive. SAM's model estimation was superior, while HMMER's model scoring was more accurate. Critical features for model building were then analysed by comparing the two packages' algorithmic choices and parameters. The weighting between prior probabilities and multiple alignment counts held the primary explanation why SAM's model building was superior. Our analysis suggests that HMMER gives too much weight to the sequence counts. SAM's emission prior probabilities were also shown to be more sensitive. The relative sequence weighting schemes are different in the two packages but performed equivalently. CONCLUSION: SAM model estimation was more sensitive, while HMMER model scoring was more accurate. By combining the best algorithmic features from both packages the accuracy was substantially improved compared to their default performance

    Subfamily specific conservation profiles for proteins based on n-gram patterns

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>A new algorithm has been developed for generating conservation profiles that reflect the evolutionary history of the subfamily associated with a query sequence. It is based on n-gram patterns (NP{<it>n,m</it>}) which are sets of <it>n </it>residues and <it>m </it>wildcards in windows of size <it>n+m</it>. The generation of conservation profiles is treated as a signal-to-noise problem where the signal is the count of n-gram patterns in target sequences that are similar to the query sequence and the noise is the count over all target sequences. The signal is differentiated from the noise by applying singular value decomposition to sets of target sequences rank ordered by similarity with respect to the query.</p> <p>Results</p> <p>The new algorithm was used to construct 4,248 profiles from 120 randomly selected Pfam-A families. These were compared to profiles generated from multiple alignments using the consensus approach. The two profiles were similar whenever the subfamily associated with the query sequence was well represented in the multiple alignment. It was possible to construct subfamily specific conservation profiles using the new algorithm for subfamilies with as few as five members. The speed of the new algorithm was comparable to the multiple alignment approach.</p> <p>Conclusion</p> <p>Subfamily specific conservation profiles can be generated by the new algorithm without aprioi knowledge of family relationships or domain architecture. This is useful when the subfamily contains multiple domains with different levels of representation in protein databases. It may also be applicable when the subfamily sample size is too small for the multiple alignment approach.</p

    Comparative Genomics Search for Losses of Long-Established Genes on the Human Lineage

    Get PDF
    Taking advantage of the complete genome sequences of several mammals, we developed a novel method to detect losses of well-established genes in the human genome through syntenic mapping of gene structures between the human, mouse, and dog genomes. Unlike most previous genomic methods for pseudogene identification, this analysis is able to differentiate losses of well-established genes from pseudogenes formed shortly after segmental duplication or generated via retrotransposition. Therefore, it enables us to find genes that were inactivated long after their birth, which were likely to have evolved nonredundant biological functions before being inactivated. The method was used to look for gene losses along the human lineage during the approximately 75 million years (My) since the common ancestor of primates and rodents (the euarchontoglire crown group). We identified 26 losses of well-established genes in the human genome that were all lost at least 50 My after their birth. Many of them were previously characterized pseudogenes in the human genome, such as GULO and UOX. Our methodology is highly effective at identifying losses of single-copy genes of ancient origin, allowing us to find a few well-known pseudogenes in the human genome missed by previous high-throughput genome-wide studies. In addition to confirming previously known gene losses, we identified 16 previously uncharacterized human pseudogenes that are definitive losses of long-established genes. Among them is ACYL3, an ancient enzyme present in archaea, bacteria, and eukaryotes, but lost approximately 6 to 8 Mya in the ancestor of humans and chimps. Although losses of well-established genes do not equate to adaptive gene losses, they are a useful proxy to use when searching for such genetic changes. This is especially true for adaptive losses that occurred more than 250,000 years ago, since any genetic evidence of the selective sweep indicative of such an event has been erased

    Knowledge discovery in biological databases : a neural network approach

    Get PDF
    Knowledge discovery, in databases, also known as data mining, is aimed to find significant information from a set of data. The knowledge to be mined from the dataset may refer to patterns, association rules, classification and clustering rules, and so forth. In this dissertation, we present a neural network approach to finding knowledge in biological databases. Specifically, we propose new methods to process biological sequences in two case studies: the classification of protein sequences and the prediction of E. Coli promoters in DNA sequences. Our proposed methods, based oil neural network architectures combine techniques ranging from Bayesian inference, coding theory, feature selection, dimensionality reduction, to dynamic programming and machine learning algorithms. Empirical studies show that the proposed methods outperform previously published methods and have excellent performance on the latest dataset. We have implemented the proposed algorithms into an infrastructure, called Genome Mining, developed for biosequence classification and recognition

    On the molecular evolution of the Plasmodium falciparum

    Get PDF
    Research in the Plasmodium falciparum molecular evolution field has predominantly comprised three distinct areas: phylogenetics, host-parasite coevolution and evolutionary genomics. These areas have greatly enhanced our understanding of the early origins of the phylum Apicomplexa, the emergence of P. falciparum, and the co-evolution between parasite and human hereditary erythrocyte disorders. In addition, the genome sequencing projects have elucidated the complexity and extremely unusual nature of the parasite genome. Some aspects of parasite molecular evolution, however, are controversial, such as human pyruvate kinase (PK) deficiency and P. falciparum virulence coevolution. Other aspects, like Plasmodium whole genome evolution have remained unexplored. This thesis includes a collection of manuscripts that address aspects of the broad field of P. falciparum molecular evolution. The first deals with the limitations of bioinformatic methods as applied to P. falciparum, which have arisen due to the unusual nature of the parasite genome, such as the extreme nucleotide bias. Although conventional bioinformatics can partially accommodate and compensate for the genome idiosyncrasies, these limitations have hampered progress significantly. A novel alignment method, termed FIRE (Functional Inference using the Rates of Evolution) was therefore developed. FIRE uses the evolutionary constraints at codon sites to align sequences and infer domain function and overcomes the problem of poor sequence similarity, which is commonly encountered between P. falciparum and other taxa. A second aspect addressed in this thesis, is the host-parasite relationship in the context of PK deficiency. It was demonstrated that PK deficient erythrocytes are dramatically resistant to parasite infection, providing in vitro evidence for this phenomenon and confirming this aspect of host-parasite co-evolution. The unexplored field of parasite genome evolution was initiated in this thesis by investigating two major role-players in genome dynamics, mobile genetic elements (MGEs) and programmed cell death (PCD). MGEs were absent in P. falciparum, possibly due to a geno-protective mechanism, which increased the AT nucleotide bias. Interestingly, the parasite telomerase reverse transcriptase, which is a domesticated MGE, was identified. In addition, there is genomic evidence for the second determinant, a classical PCD pathway. Intriguingly, functional and structural evidence for a p53-like DNA-binding domain, which plays a key role in genome evolution, was obtained. Using MGEs and PCD as examples, a theoretical framework for investigating genome dynamics was developed. The framework proposes an ecological approach to genome evolution, in which a trade-off exists between two opposing processes: the generation of diversity by factors such as MGEs and the maintenance of integrity by factors like PCD. The framework is suggested for proposing and testing hypotheses to investigate the origins and evolution of the P. falciparum genome. Finally, a novel approach, termed Evolutionary Patterning (EP), was developed to limit the problem of parasite drug resistance and demonstrates the value of employing molecular evolution to address biomedical challenges. Some of this work, such as the FIRE method, the host-parasite co-evolution studies, the PCD findings and the EP approach have been incorporated in grant proposals and adopted in future projects. It is hoped that this research will be used to further our understanding of P. falciparum evolution and advance the efforts to control this deadly pathogen

    Integrating Protein Data Resources through Semantic Web Services

    Get PDF
    Understanding the function of every protein is one major objective of bioinformatics. Currently, a large amount of information (e.g., sequence, structure and dynamics) is being produced by experiments and predictions that are associated with protein function. Integrating these diverse data about protein sequence, structure, dynamics and other protein features allows further exploration and establishment of the relationships between protein sequence, structure, dynamics and function, and thereby controlling the function of target proteins. However, information integration in protein data resources faces challenges at technology level for interfacing heterogeneous data formats and standards and at application level for semantic interpretation of dissimilar data and queries. In this research, a semantic web services infrastructure, called Web Services for Protein data resources (WSP), for flexible and user-oriented integration of protein data resources, is proposed. This infrastructure includes a method for modeling protein web services, a service publication algorithm, an efficient service discovery (matching) algorithm, and an optimal service chaining algorithm. Rather than relying on syntactic matching, the matching algorithm discovers services based on their similarity to the requested service. Therefore, users can locate services that semantically match their data requirements even if they are syntactically distinctive. Furthermore, WSP supports a workflow-based approach for service integration. The chaining algorithm is used to select and chain services, based on the criteria of service accuracy and data interoperability. The algorithm generates a web services workflow which automatically integrates the results from individual services.A number of experiments are conducted to evaluate the performance of the matching algorithm. The results reveal that the algorithm can discover services with reasonable performance. Also, a composite service, which integrates protein dynamics and conservation, is experimented using the WSP infrastructure

    Exploring the function and evolution of proteins using domain families

    Get PDF
    Proteins are frequently composed of multiple domains which fold independently. These are often evolutionarily distinct units which can be adapted and reused in other proteins. The classification of protein domains into evolutionary families facilitates the study of their evolution and function. In this thesis such classifications are used firstly to examine methods for identifying evolutionary relationships (homology) between protein domains. Secondly a specific approach for predicting their function is developed. Lastly they are used in studying the evolution of protein complexes. Tools for identifying evolutionary relationships between proteins are central to computational biology. They aid in classifying families of proteins, giving clues about the function of proteins and the study of molecular evolution. The first chapter of this thesis concerns the effectiveness of cutting edge methods in identifying evolutionary relationships between protein domains. The identification of evolutionary relationships between proteins can give clues as to their function. The second chapter of this thesis concerns the development of a method to identify proteins involved in the same biological process. This method is based on the concept of domain fusion whereby pairs of proteins from one organism with a concerted function are sometimes found fused into single proteins in a different organism. Using protein domain classifications it is possible to identify these relationships. Most proteins do not act in isolation but carry out their function by binding to other proteins in complexes; little is understood about the evolution of such complexes. In the third chapter of this thesis the evolution of complexes is examined in two representative model organisms using protein domain families. In this work, protein domain superfamilies allow distantly related parts of complexes to be identified in order to determine how homologous units are reused

    Weighting Hidden Markov Models For Maximum Discrimination

    No full text
    1.1 Motivation Hidden Markov models can efficiently and automatically build statistical representations of related sequences. Unfortunately, training sets are frequently biased toward one subgroup of sequences, leading to an insufficiently general model. This work evaluates sequence weighting methods based on the maximum-discrimination idea. 1.2 Results One good method scales sequence weights by an exponential that ranges between 0.1 for the best scoring sequence and 1.0 for the worst. Experiments with a curated data set show that while training with one or two sequences performed worse than single-sequence Probabilistic Smith-Waterman, training with five or ten sequences reduced errors by 20% and 51%, respectively. This new version of the SAM HMM suite outperforms HMMer (17% reduction over PSW for 10 training sequences), Meta-MEME (28% reduction), and unweighted SAM (31% reduction). 1.3 Availability A World-Wide Web server, as well as information on obtaining the Sequence Alignme..

    Activity fingerprints in DNA based on a structural analysis of sequence information.

    Get PDF
    The function of a DNA sequence is commonly predicted by measuring its nucleotide similarity to known functional sets. However, the use of structural properties to identify patterns within families is justified by the discovery that many very different sequences have similar structural properties. The aim of this thesis is to develop tools that detect any unusual structural characteristics of a particular sequence or that identify DNA structure-activity fingerprints common to a set. This work uses the Octamer Database to describe DNA. The database's contents are split into two categories: those parameters that describe minimum energy structure and those that measure flexibility. Information from both of these categories has been combined to describe structural tendencies, offering an alternative measure of sequence similarity. A structural DNA profile gives a graphical illustration of how a parameter from the Octamer Database varies across either a single sequence's length or across a set of sequences. Profile Manager is an application that has been developed to automate single sequence profile generation and is used to study the A-tract phenomenon. The use of profiles to explore patterns in flexibility across a set of pre-aligned promoters is then investigated with interesting transitions in decreasing twist flexibility discovered. Multiple sequence queries are harder to solve than those of single sequences, due to the inherent need for the sequences to be aligned. It is only under rare circumstances that sequences are pre-aligned by an experimentally determined position. More commonly a multiple alignment must be generated. An extended, structure-based, hidden Markov model technique that successfully generates structural alignment~ is presented. Its. application is tested on four DNA protein binding site datasets with comparisons made to the traditional sequence method. Structural alignments of two out of the four datasets were comparable in performance to sequence with useful insights into underlying structural mechanisms
    corecore