6 research outputs found

    A comprehensive software suite for protein family construction and functional site prediction.

    No full text
    In functionally diverse protein families, conservation in short signature regions may outperform full-length sequence comparisons for identifying proteins that belong to a subgroup within which one specific aspect of their function is conserved. The SIMBAL workflow (Sites Inferred by Metabolic Background Assertion Labeling) is a data-mining procedure for finding such signature regions. It begins by using clues from genomic context, such as co-occurrence or conserved gene neighborhoods, to build a useful training set from a large number of uncharacterized but mutually homologous proteins. When training set construction is successful, the YES partition is enriched in proteins that share function with the user's query sequence, while the NO partition is depleted. A selected query sequence is then mined for short signature regions whose closest matches overwhelmingly favor proteins from the YES partition. High-scoring signature regions typically contain key residues critical to functional specificity, so proteins with the highest sequence similarity across these regions tend to share the same function. The SIMBAL algorithm was described previously, but significant manual effort, expertise, and a supporting software infrastructure were required to prepare the requisite training sets. Here, we describe a new, distributable software suite that speeds up and simplifies the process for using SIMBAL, most notably by providing tools that automate training set construction. These tools have broad utility for comparative genomics, allowing for flexible collection of proteins or protein domains based on genomic context as well as homology, a capability that can greatly assist in protein family construction. Armed with this new software suite, SIMBAL can serve as a fast and powerful in silico alternative to direct experimentation for characterizing proteins and their functional interactions

    Graphical output from SIMBAL computation and post-processing.

    No full text
    <p>Fig 2A shows the triangle heat map obtained for query sequence NP_718091.1, a rhombosortase from <i>Shewanella oneidensis</i> MR-1, obtained as described in the text. Each colored pixel in the heat map conveys three pieces of information: a SIMBAL score (color, where red indicates greater statistical significance), the length of the subsequence being scores (height on the Y-axis), and the location of the middle of the subsequence along the length of the complete protein (position on the X-axis). Subsequences are evaluated from a minimum length of 9 (bottom of the heat map) to a maximum of 204, the full length of the protein, at the top of the heatmap. Scores are computed as the negative log<sub>10</sub> of the odds against encountering, purely by chance, at least as great a preponderance of YES set-derived sequences among the top BLAST hits. Note that nearby pixels may differ sharply in score, and that the deepest red colors appear in a “plume” of pixels whose corresponding subsequences all contain the same key small region. Fig 2<b>B</b> shows a smoothed heatmap that results from re-processing SIMBAL scores so that each pixel represents a blend of its own score and those of longer sequences that contain it, performed iteratively starting with the longest sequences. The heritability parameter used was 0.93. Fig 2<b>C</b> shows data in numerical form corresponding to a line passing through the heatmap of Fig 2<b>A</b> very near its base, at a subsequence length of 9, with height rather than color showing the score. Fig 2<b>D</b> shows the corresponding slice through the rescored heatmap of Fig 2B, with a greatly reduced jitter in scores from one pixel to the next, and a clear indication of which short subsequences most likely contain key sites that discriminate rhombosortases from other rhomboid family proteases.</p

    Training Set Construction.

    No full text
    <p>Fig 1A illustrates the simplest method for training set construction. Each genome (gray circles) is treated as a “bag of genes”; distance relationships between genes are ignored. One hidden Markov model (HMM) identifies target family proteins (orange squares) in the corresponding proteome. A second HMM finds proteins from a second family (yellow stars) whose presence or absence in the proteome is the attribute that controls how target family proteins are sorted. If an attribute family protein is found, members of the target family get sorted to the YES set (green container). If not, then target family proteins go to the NO set (red container). The training set builder (TSB) always works on one target protein family at a time, but more complicated rules may require multiple attributes to be jointly present for the YES set, and multiple attributes to be jointly absent for the NO set. Fig 1B shows training set construction using a distance rule. The S-shaped curved represents a long segment of genomic DNA. A target protein is sorted to the YES set if and only if its gene lies within a user-specified distance from the attribute protein’s gene. Target proteins from genomes that lack the attribute completely go to the NO set. A target protein goes to the FAR set if and only its gene sufficiently far from the nearest attribute gene, and the genome has already sent a target protein to the YES set. If a genome encodes an attribute family protein, but no target family protein qualifies for the YES set, then target family proteins are not sorted to any bin.</p

    Sequential steps in a SIMBAL analysis.

    No full text
    <p>Sequential steps in a SIMBAL analysis.</p

    Greek art: Classical to Hellenistic

    No full text
    corecore