83 research outputs found

    Multiple structure alignment with msTALI

    Get PDF
    BACKGROUND: Multiple structure alignments have received increasing attention in recent years as an alternative to multiple sequence alignments. Although multiple structure alignment algorithms can potentially be applied to a number of problems, they have primarily been used for protein core identification. A method that is capable of solving a variety of problems using structure comparison is still absent. Here we introduce a program msTALI for aligning multiple protein structures. Our algorithm uses several informative features to guide its alignments: torsion angles, backbone C(α) atom positions, secondary structure, residue type, surface accessibility, and properties of nearby atoms. The algorithm allows the user to weight the types of information used to generate the alignment, which expands its utility to a wide variety of problems. RESULTS: msTALI exhibits competitive results on 824 families from the Homstrad and SABmark databases when compared to Matt and Mustang. We also demonstrate success at building a database of protein cores using 341 randomly selected CATH domains and highlight the contribution of msTALI compared to the CATH classifications. Finally, we present an example applying msTALI to the problem of detecting hinges in a protein undergoing rigid-body motion. CONCLUSIONS: msTALI is an effective algorithm for multiple structure alignment. In addition to its performance on standard comparison databases, it utilizes clear, informative features, allowing further customization for domain-specific applications. The C++ source code for msTALI is available for Linux on the web at http://ifestos.cse.sc.edu/mstali

    Empirical comparison of cross-platform normalization methods for gene expression data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Simultaneous measurement of gene expression on a genomic scale can be accomplished using microarray technology or by sequencing based methods. Researchers who perform high throughput gene expression assays often deposit their data in public databases, but heterogeneity of measurement platforms leads to challenges for the combination and comparison of data sets. Researchers wishing to perform cross platform normalization face two major obstacles. First, a choice must be made about which method or methods to employ. Nine are currently available, and no rigorous comparison exists. Second, software for the selected method must be obtained and incorporated into a data analysis workflow.</p> <p>Results</p> <p>Using two publicly available cross-platform testing data sets, cross-platform normalization methods are compared based on inter-platform concordance and on the consistency of gene lists obtained with transformed data. Scatter and ROC-like plots are produced and new statistics based on those plots are introduced to measure the effectiveness of each method. Bootstrapping is employed to obtain distributions for those statistics. The consistency of platform effects across studies is explored theoretically and with respect to the testing data sets.</p> <p>Conclusions</p> <p>Our comparisons indicate that four methods, DWD, EB, GQ, and XPN, are generally effective, while the remaining methods do not adequately correct for platform effects. Of the four successful methods, XPN generally shows the highest inter-platform concordance when treatment groups are equally sized, while DWD is most robust to differently sized treatment groups and consistently shows the smallest loss in gene detection. We provide an R package, CONOR, capable of performing the nine cross-platform normalization methods considered. The package can be downloaded at <url>http://alborz.sdsu.edu/conor</url> and is available from CRAN.</p

    The structure of the KlcA and ArdB proteins reveals a novel fold and antirestriction activity against Type I DNA restriction systems in vivo but not in vitro

    Get PDF
    Plasmids, conjugative transposons and phage frequently encode anti-restriction proteins to enhance their chances of entering a new bacterial host that is highly likely to contain a Type I DNA restriction and modification (RM) system. The RM system usually destroys the invading DNA. Some of the anti-restriction proteins are DNA mimics and bind to the RM enzyme to prevent it binding to DNA. In this article, we characterize ArdB anti-restriction proteins and their close homologues, the KlcA proteins from a range of mobile genetic elements; including an ArdB encoded on a pathogenicity island from uropathogenic Escherichia coli and a KlcA from an IncP-1b plasmid, pBP136 isolated from Bordetella pertussis. We show that all the ArdB and KlcA act as anti-restriction proteins and inhibit the four main families of Type I RM systems in vivo, but fail to block the restriction endonuclease activity of the archetypal Type I RM enzyme, EcoKI, in vitro indicating that the action of ArdB is indirect and very different from that of the DNA mimics. We also present the structure determined by NMR spectroscopy of the pBP136 KlcA protein. The structure shows a novel protein fold and it is clearly not a DNA structural mimic

    Post hoc pattern matching: assigning significance to statistically defined expression patterns in single channel microarray data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Researchers using RNA expression microarrays in experimental designs with more than two treatment groups often identify statistically significant genes with ANOVA approaches. However, the ANOVA test does not discriminate which of the multiple treatment groups differ from one another. Thus, <it>post hoc </it>tests, such as linear contrasts, template correlations, and pairwise comparisons are used. Linear contrasts and template correlations work extremely well, especially when the researcher has <it>a priori </it>information pointing to a particular pattern/template among the different treatment groups. Further, all pairwise comparisons can be used to identify particular, treatment group-dependent patterns of gene expression. However, these approaches are biased by the researcher's assumptions, and some treatment-based patterns may fail to be detected using these approaches. Finally, different patterns may have different probabilities of occurring by chance, importantly influencing researchers' conclusions about a pattern and its constituent genes.</p> <p>Results</p> <p>We developed a four step, <it>post hoc </it>pattern matching (PPM) algorithm to automate single channel gene expression pattern identification/significance. First, 1-Way Analysis of Variance (ANOVA), coupled with <it>post hoc </it>'all pairwise' comparisons are calculated for all genes. Second, for each ANOVA-significant gene, all pairwise contrast results are encoded to create unique pattern ID numbers. The # genes found in each pattern in the data is identified as that pattern's 'actual' frequency. Third, using Monte Carlo simulations, those patterns' frequencies are estimated in random data ('random' gene pattern frequency). Fourth, a Z-score for overrepresentation of the pattern is calculated ('actual' against 'random' gene pattern frequencies). We wrote a Visual Basic program (StatiGen) that automates PPM procedure, constructs an Excel workbook with standardized graphs of overrepresented patterns, and lists of the genes comprising each pattern. The visual basic code, installation files for StatiGen, and sample data are available as supplementary material.</p> <p>Conclusion</p> <p>The PPM procedure is designed to augment current microarray analysis procedures by allowing researchers to incorporate all of the information from post hoc tests to establish unique, overarching gene expression patterns in which there is no overlap in gene membership. In our hands, PPM works well for studies using from three to six treatment groups in which the researcher is interested in treatment-related patterns of gene expression. Hardware/software limitations and extreme number of theoretical expression patterns limit utility for larger numbers of treatment groups. Applied to a published microarray experiment, the StatiGen program successfully flagged patterns that had been manually assigned in prior work, and further identified other gene expression patterns that may be of interest. Thus, over a moderate range of treatment groups, PPM appears to work well. It allows researchers to assign statistical probabilities to patterns of gene expression that fit <it>a priori </it>expectations/hypotheses, it preserves the data's ability to show the researcher interesting, yet unanticipated gene expression patterns, and assigns the majority of ANOVA-significant genes to non-overlapping patterns.</p

    Datamining Protein Structure Databanks for Crystallization Patterns of Proteins

    No full text
    A study of 345 protein structures selected among 1,500 structures determined by nuclear magnetic resonance (NMR) methods, revealed useful correlations between crystallization properties and several parameters for the studied proteins. NMR methods of structure determination do not require the growth of protein crystals, and hence allow comparison of properties of proteins that have or have not been the subject of crystallographic approaches. One‐ and two‐dimensional statistical analyses of the data confirmed a hypothesized relation between the size of the molecule and its crystallization potential. Furthermore, two‐dimensional Bayesian analysis revealed a significant relationship between relative ratio of different secondary structures and the likelihood of success for crystallization trials. The most immediate result is an apparent correlation of crystallization potential with protein size. Further analysis of the data revealed a relationship between the unstructured fraction of proteins and the success of its crystallization. Utilization of Bayesian analysis on the latter correlation resulted in a prediction performance of about 64%, whereas a two‐dimensional Bayesian analysis succeeded with a performance of about 75%

    Rapid Classification of a Protein Fold Family Using a Statistical Analysis of Dipolar Couplings

    No full text
    Motivation: One of the primary aims of the structural genomics initiative is the determination of representative structures from each protein fold family. Given this objective, it is important to rapidly identify proteins that belong to a family that is already well populated (so they can be eliminated from further studies), or more importantly identify proteins that represent new families of fold. Results: A method for rapid classification to a fold family by the statistical analyses of unassigned the 15N–1H residual dipolar couplings is presented. The required NMR data can be quickly acquired and analyzed. Using this method, structure determination efforts can be focused on more unique and interesting structures, and the overall efficiency in the construction of an information-rich library can be increased

    REDCAT: A Residual Dipolar Coupling Analysis Tool

    No full text
    Recent advancements in the utilization of residual dipolar couplings (RDCs) as a means of structure validation and elucidation have demonstrated the need for, not only a more user friendly, but also a more powerful RDC analysis tool. In this paper, we introduce a software package named REsidual Dipolar Coupling Analysis Tool (REDCAT) designed to address the above issues. REDCAT is a user-friendly program with its graphical-user-interface developed in Tcl/Tk, which is highly portable. Furthermore, the computational engine behind this GUI is written in C/C++ and its computational performance is therefore excellent. The modular implementation of REDCAT’s algorithms, with separation of the computational engine from the graphical engine allows for flexible and easy command line interaction. This feature can be utilized for the design of automated data analysis sessions. Furthermore, this software package is portable to Linux clusters for high throughput applications. In addition to basic utilities to solve for order tensors and back calculate couplings from a given order tensor and proposed structure, a number of improved algorithms have been incorporated. These include the proper sampling of the Null-space (when the system of linear equations is under-determined), more sophisticated filters for invalid order-tensor identification, error analysis for the identification of the problematic measurements and simulation of the effects of dynamic averaging processes
    corecore