812 research outputs found

    Pairwise statistical significance of local sequence alignment using multiple parameter sets and empirical justification of parameter set change penalty

    Get PDF
    Background: Accurate estimation of statistical significance of a pairwise alignment is an important problem in sequence comparison. Recently, a comparative study of pairwise statistical significance with database statistical significance was conducted. In this paper, we extend the earlier work on pairwise statistical significance by incorporating with it the use of multiple parameter sets. Results: Results for a knowledge discovery application of homology detection reveal that using multiple parameter sets for pairwise statistical significance estimates gives better coverage than using a single parameter set, at least at some error levels. Further, the results of pairwise statistical significance using multiple parameter sets are shown to be significantly better than database statistical significance estimates reported by BLAST and PSI-BLAST, and comparable and at times significantly better than SSEARCH. Using non-zero parameter set change penalty values give better performance than zero penalty. Conclusion: The fact that the homology detection performance does not degrade when using multiple parameter sets is a strong evidence for the validity of the assumption that the alignment score distribution follows an extreme value distribution even when using multiple parameter sets. Parameter set change penalty is a useful parameter for alignment using multiple parameter sets. Pairwise statistical significance using multiple parameter sets can be effectively used to determine the relatedness of a (or a few) pair(s) of sequences without performing a time-consuming database search

    Detecting Remote Evolutionary Relationships among Proteins by Large-Scale Semantic Embedding

    Get PDF
    Virtually every molecular biologist has searched a protein or DNA sequence database to find sequences that are evolutionarily related to a given query. Pairwise sequence comparison methods—i.e., measures of similarity between query and target sequences—provide the engine for sequence database search and have been the subject of 30 years of computational research. For the difficult problem of detecting remote evolutionary relationships between protein sequences, the most successful pairwise comparison methods involve building local models (e.g., profile hidden Markov models) of protein sequences. However, recent work in massive data domains like web search and natural language processing demonstrate the advantage of exploiting the global structure of the data space. Motivated by this work, we present a large-scale algorithm called ProtEmbed, which learns an embedding of protein sequences into a low-dimensional “semantic space.” Evolutionarily related proteins are embedded in close proximity, and additional pieces of evidence, such as 3D structural similarity or class labels, can be incorporated into the learning process. We find that ProtEmbed achieves superior accuracy to widely used pairwise sequence methods like PSI-BLAST and HHSearch for remote homology detection; it also outperforms our previous RankProp algorithm, which incorporates global structure in the form of a protein similarity network. Finally, the ProtEmbed embedding space can be visualized, both at the global level and local to a given query, yielding intuition about the structure of protein sequence space

    The Personal Sequence Database: a suite of tools to create and maintain web-accessible sequence databases

    Get PDF
    Background: Large molecular sequence databases are fundamental resources for modern\ud bioscientists. Whether for project-specific purposes or sharing data with colleagues, it is often\ud advantageous to maintain smaller sequence databases. However, this is usually not an easy task for\ud the average bench scientist.\ud \ud Results: We present the Personal Sequence Database (PSD), a suite of tools to create and\ud maintain small- to medium-sized web-accessible sequence databases. All interactions with PSD\ud tools occur via the internet with a web browser. Users may define sequence groups within their\ud database that can be maintained privately or published to the web for public use. A sequence group\ud can be downloaded, browsed, searched by keyword or searched for sequence similarities using\ud BLAST. Publishing a sequence group extends these capabilities to colleagues and collaborators. In\ud addition to being able to manage their own sequence databases, users can enroll sequences in\ud BLASTAgent, a BLAST hit tracking system, to monitor NCBI databases for new entries displaying\ud a specified level of nucleotide or amino acid similarity.\ud \ud Conclusion: The PSD offers a valuable set of resources unavailable elsewhere. In addition to\ud managing sequence data and BLAST search results, it facilitates data sharing with colleagues,\ud collaborators and public users. The PSD is hosted by the authors and is available at http://\ud bioinfo.cgrb.oregonstate.edu/psd/

    CrossHybDetector: detection of cross-hybridization events in DNA microarray experiments

    Get PDF
    Background\ud DNA microarrays contain thousands of different probe sequences represented on their surface. These are designed in such a way that potential cross-hybridization reactions with non-target sequences are minimized. However, given the large number of probes, the occurrence of cross hybridization events cannot be excluded. This problem can dramatically affect the data quality and cause false positive/false negative results.\ud \ud Results\ud CrossHybDetector is a software package aimed at the identification of cross-hybridization events occurred during individual array hybridization, by using the probe sequences and the array intensity values. As output, the software provides the user with a list of array spots potentially 'corrupted' and their associated p-values calculated by Monte Carlo simulations. Graphical plots are also generated, which provide a visual and global overview of the quality of the microarray experiment with respect to cross-hybridization issues.\ud \ud Conclusion\ud CrossHybDetector is implemented as a package for the statistical computing environment R and is freely available under the LGPL license within the CRAN project

    Testing statistical significance scores of sequence comparison methods with structure similarity

    Get PDF
    BACKGROUND: In the past years the Smith-Waterman sequence comparison algorithm has gained popularity due to improved implementations and rapidly increasing computing power. However, the quality and sensitivity of a database search is not only determined by the algorithm but also by the statistical significance testing for an alignment. The e-value is the most commonly used statistical validation method for sequence database searching. The CluSTr database and the Protein World database have been created using an alternative statistical significance test: a Z-score based on Monte-Carlo statistics. Several papers have described the superiority of the Z-score as compared to the e-value, using simulated data. We were interested if this could be validated when applied to existing, evolutionary related protein sequences. RESULTS: All experiments are performed on the ASTRAL SCOP database. The Smith-Waterman sequence comparison algorithm with both e-value and Z-score statistics is evaluated, using ROC, CVE and AP measures. The BLAST and FASTA algorithms are used as reference. We find that two out of three Smith-Waterman implementations with e-value are better at predicting structural similarities between proteins than the Smith-Waterman implementation with Z-score. SSEARCH especially has very high scores. CONCLUSION: The compute intensive Z-score does not have a clear advantage over the e-value. The Smith-Waterman implementations give generally better results than their heuristic counterparts. We recommend using the SSEARCH algorithm combined with e-values for pairwise sequence comparisons

    Increased S-nitrosylation and proteasomal degradation of caspase-3 during infection contribute to the persistence of adherent invasive escherichia coli (AIEC) in immune cells

    Get PDF
    Adherent invasive Escherichia coli (AIEC) have been implicated as a causative agent of Crohn's disease (CD) due to their isolation from the intestines of CD sufferers and their ability to persist in macrophages inducing granulomas. The rapid intracellular multiplication of AIEC sets it apart from other enteric pathogens such as Salmonella Typhimurium which after limited replication induce programmed cell death (PCD). Understanding the response of infected cells to the increased AIEC bacterial load and associated metabolic stress may offer insights into AIEC pathogenesis and its association with CD. Here we show that AIEC persistence within macrophages and dendritic cells is facilitated by increased proteasomal degradation of caspase-3. In addition S-nitrosylation of pro- and active forms of caspase-3, which can inhibit the enzymes activity, is increased in AIEC infected macrophages. This S-nitrosylated caspase-3 was seen to accumulate upon inhibition of the proteasome indicating an additional role for S-nitrosylation in inducing caspase-3 degradation in a manner independent of ubiquitination. In addition to the autophagic genetic defects that are linked to CD, this delay in apoptosis mediated in AIEC infected cells through increased degradation of caspase-3, may be an essential factor in its prolonged persistence in CD patients

    A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Protein remote homology detection and fold recognition are central problems in bioinformatics. Currently, discriminative methods based on support vector machine (SVM) are the most effective and accurate methods for solving these problems. A key step to improve the performance of the SVM-based methods is to find a suitable representation of protein sequences.</p> <p>Results</p> <p>In this paper, a novel building block of proteins called Top-<it>n</it>-grams is presented, which contains the evolutionary information extracted from the protein sequence frequency profiles. The protein sequence frequency profiles are calculated from the multiple sequence alignments outputted by PSI-BLAST and converted into Top-<it>n</it>-grams. The protein sequences are transformed into fixed-dimension feature vectors by the occurrence times of each Top-<it>n</it>-gram. The training vectors are evaluated by SVM to train classifiers which are then used to classify the test protein sequences. We demonstrate that the prediction performance of remote homology detection and fold recognition can be improved by combining Top-<it>n</it>-grams and latent semantic analysis (LSA), which is an efficient feature extraction technique from natural language processing. When tested on superfamily and fold benchmarks, the method combining Top-<it>n</it>-grams and LSA gives significantly better results compared to related methods.</p> <p>Conclusion</p> <p>The method based on Top-<it>n</it>-grams significantly outperforms the methods based on many other building blocks including N-grams, patterns, motifs and binary profiles. Therefore, Top-<it>n</it>-gram is a good building block of the protein sequences and can be widely used in many tasks of the computational biology, such as the sequence alignment, the prediction of domain boundary, the designation of knowledge-based potentials and the prediction of protein binding sites.</p

    Evolutionary relationships among barley and <i>Arabidopsis</i> core circadian clock and clock-associated genes

    Get PDF
    The circadian clock regulates a multitude of plant developmental and metabolic processes. In crop species, it contributes significantly to plant performance and productivity and to the adaptation and geographical range over which crops can be grown. To understand the clock in barley and how it relates to the components in the Arabidopsis thaliana clock, we have performed a systematic analysis of core circadian clock and clock-associated genes in barley, Arabidopsis and another eight species including tomato, potato, a range of monocotyledonous species and the moss, Physcomitrella patens. We have identified orthologues and paralogues of Arabidopsis genes which are conserved in all species, monocot/dicot differences, species-specific differences and variation in gene copy number (e.g. gene duplications among the various species). We propose that the common ancestor of barley and Arabidopsis had two-thirds of the key clock components identified in Arabidopsis prior to the separation of the monocot/dicot groups. After this separation, multiple independent gene duplication events took place in both monocot and dicot ancestors. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1007/s00239-015-9665-0) contains supplementary material, which is available to authorized users
    corecore