3 research outputs found

    The effectiveness of position- and composition-specific gap costs for protein similarity searches

    Get PDF
    The flexibility in gap cost enjoyed by Hidden Markov Models (HMMs) is expected to afford them better retrieval accuracy than position-specific scoring matrices (PSSMs). We attempt to quantify the effect of more general gap parameters by separately examining the influence of position- and composition-specific gap scores, as well as by comparing the retrieval accuracy of the PSSMs constructed using an iterative procedure to that of the HMMs provided by Pfam and SUPERFAMILY, curated ensembles of multiple alignments. We found that position-specific gap penalties have an advantage over uniform gap costs. We did not explore optimizing distinct uniform gap costs for each query. For Pfam, PSSMs iteratively constructed from seeds based on HMM consensus sequences perform equivalently to HMMs that were adjusted to have constant gap transition probabilities, albeit with much greater variance. We observed no effect of composition-specific gap costs on retrieval performance.Comment: 17 pages, 4 figures, 2 table

    Identification, Characterization, and Life Cycle of Intein-Associated Homing Endonucleases

    Get PDF
    Inteins are molecular parasites that have been identified in unicellular organisms from the three domains of life. The intein self-excises following translation of the host gene, and therefore incurs a fitness cost for its carrier. The symbiotic state of the intein to its host is dependent on the presence or absence of a homing endonuclease domain, which facilitates horizontal transfer of the molecule. Identification of this domain provides information on the evolutionary history of the intein, as well as patterns of horizontal gene transfer in microbial communities. I have therefore developed Hidden Markov Models (HMMs) to identify homing endonuclease domains in biological sequence data. Following validation, the HMMs were used to assign symbiotic states to inteins found in the haloarchaea. This search method expands upon previous approaches to characterizing inteins, and provides molecular evidence for the presence of homing endonuclease domains. I have also created an agent-based model for the competition between intein states in a simulated microbial population. The model incorporates spatial interactions, measured efficiencies of gene transfer, and environmental perturbations to determine the conditions under which inteins spread. These simulations determined that inteins actively spread in a population that is in stationary growth phase, while carriers are outcompeted during exponential phases of growth. My computational analysis provides a new method for assessing the symbiotic state of inteins, as well as a platform for exploring the life cycle of inteins under a variety of environmental scenarios

    More Than 1,001 Problems with Protein Domain Databases: Transmembrane Regions, Signal Peptides and the Issue of Sequence Homology

    Get PDF
    Large-scale genome sequencing gained general importance for life science because functional annotation of otherwise experimentally uncharacterized sequences is made possible by the theory of biomolecular sequence homology. Historically, the paradigm of similarity of protein sequences implying common structure, function and ancestry was generalized based on studies of globular domains. Having the same fold imposes strict conditions over the packing in the hydrophobic core requiring similarity of hydrophobic patterns. The implications of sequence similarity among non-globular protein segments have not been studied to the same extent; nevertheless, homology considerations are silently extended for them. This appears especially detrimental in the case of transmembrane helices (TMs) and signal peptides (SPs) where sequence similarity is necessarily a consequence of physical requirements rather than common ancestry. Thus, matching of SPs/TMs creates the illusion of matching hydrophobic cores. Therefore, inclusion of SPs/TMs into domain models can give rise to wrong annotations. More than 1001 domains among the 10,340 models of Pfam release 23 and 18 domains of SMART version 6 (out of 809) contain SP/TM regions. As expected, fragment-mode HMM searches generate promiscuous hits limited to solely the SP/TM part among clearly unrelated proteins. More worryingly, we show explicit examples that the scores of clearly false-positive hits, even in global-mode searches, can be elevated into the significance range just by matching the hydrophobic runs. In the PIR iProClass database v3.74 using conservative criteria, we find that at least between 2.1% and 13.6% of its annotated Pfam hits appear unjustified for a set of validated domain models. Thus, false-positive domain hits enforced by SP/TM regions can lead to dramatic annotation errors where the hit has nothing in common with the problematic domain model except the SP/TM region itself. We suggest a workflow of flagging problematic hits arising from SP/TM-containing models for critical reconsideration by annotation users
    corecore