112 research outputs found

    Simultaneous regression and classification for drug sensitivity prediction using an advanced random forest method

    Get PDF
    Machine learning methods trained on cancer cell line panels are intensively studied for the prediction of optimal anti-cancer therapies. While classifcation approaches distinguish efective from inefective drugs, regression approaches aim to quantify the degree of drug efectiveness. However, the high specifcity of most anti-cancer drugs induces a skewed distribution of drug response values in favor of the more drug-resistant cell lines, negatively afecting the classifcation performance (class imbalance) and regression performance (regression imbalance) for the sensitive cell lines. Here, we present a novel approach called SimultAneoUs Regression and classifcatiON Random Forests (SAURON-RF) based on the idea of performing a joint regression and classifcation analysis. We demonstrate that SAURON-RF improves the classifcation and regression performance for the sensitive cell lines at the expense of a moderate loss for the resistant ones. Furthermore, our results show that simultaneous classifcation and regression can be superior to regression or classifcation alone

    NightShift: NMR shift inference by general hybrid model training - a framework for NMR chemical shift prediction

    Get PDF
    BACKGROUND: NMR chemical shift prediction plays an important role in various applications in computational biology. Among others, structure determination, structure optimization, and the scoring of docking results can profit from efficient and accurate chemical shift estimation from a three-dimensional model. A variety of NMR chemical shift prediction approaches have been presented in the past, but nearly all of these rely on laborious manual data set preparation and the training itself is not automatized, making retraining the model, e.g., if new data is made available, or testing new models a time-consuming manual chore. RESULTS: In this work, we present the framework NightShift (NMR Shift Inference by General Hybrid Model Training), which enables automated data set generation as well as model training and evaluation of protein NMR chemical shift prediction. In addition to this main result – the NightShift framework itself – we describe the resulting, automatically generated, data set and, as a proof-of-concept, a random forest model called Spinster that was built using the pipeline. CONCLUSION: By demonstrating that the performance of the automatically generated predictors is at least en par with the state of the art, we conclude that automated data set and predictor generation is well-suited for the design of NMR chemical shift estimators. The framework can be downloaded from https://bitbucket.org/akdehof/nightshift. It requires the open source Biochemical Algorithms Library (BALL), and is available under the conditions of the GNU Lesser General Public License (LGPL). We additionally offer a browser-based user interface to our NightShift instance employing the Galaxy framework via https://ballaxy.bioinf.uni-sb.de/

    Glycosylation Patterns of Proteins Studied by Liquid Chromatography-Mass Spectrometry and Bioinformatic Tools

    Get PDF
    Due to their extensive structural heterogeneity, the elucidation of glycosylation patterns in glycoproteins such as the subunits of chorionic gonadotropin (CG), CG-alpha and CG-beta remains one of the most challenging problems in the proteomic analysis of posttranslational modifications. In consequence, glycosylation is usually studied after decomposition of the intact proteins to the proteolytic peptide level. However, by this approach all information about the combination of the different glycopeptides in the intact protein is lost. In this study we have, therefore, attempted to combine the results of glycan identification after tryptic digestion with molecular mass measurements on the intact glycoproteins. Despite the extremely high number of possible combinations of the glycans identified in the tryptic peptides by high-performance liquid chromatography-mass spectrometry (> 1000 for CG-alpha and > 10.000 for CG-beta), the mass spectra of intact CG-alpha and CG-beta revealed only a limited number of glycoforms present in CG preparations from pools of pregnancy urines. Peak annotations for CG-alpha were performed with the help of an algorithm that generates a database containing all possible modifications of the proteins (inclusive possible artificial modifications such as oxidation or truncation) and subsequent searches for combinations fitting the mass difference between the polypeptide backbone and the measured molecular masses. Fourteen different glycoforms of CG-alpha, including methionine-oxidized and N-terminally truncated forms, were readily identified. For CG-beta, however, the relatively high mass accuracy of ± 2 Da was still insufficient to unambiguously assign the possible combinations of posttranslational modifications. Finally, the mass spectrometric fingerprints of the intact molecules were shown to be very useful for the characterization of glycosylation patterns in different CG preparations

    GraBCas: a bioinformatics tool for score-based prediction of Caspase- and Granzyme B-cleavage sites in protein sequences

    Get PDF
    Caspases and granzyme B are proteases that share the primary specificity to cleave at the carboxyl terminal of aspartate residues in their substrates. Both, caspases and granzyme B are enzymes that are involved in fundamental cellular processes and play a central role in apoptotic cell death. Although various targets are described, many substrates still await identification and many cleavage sites of known substrates are not identified or experimentally verified. A more comprehensive knowledge of caspase and granzyme B substrates is essential to understand the biological roles of these enzymes in more detail. The relatively high variability in cleavage site recognition sequence often complicates the identification of cleavage sites. As of yet there is no software available that allows identification of caspase and/or granzyme with cleavage sites differing from the consensus sequence. Here, we present a bioinformatics tool ‘GraBCas’ that provides score-based prediction of potential cleavage sites for the caspases 1–9 and granzyme B including an estimation of the fragment size. We tested GraBCas on already known substrates and showed its usefulness for protein sequence analysis. GraBCas is available at

    Transcriptome analysis by GeneTrail revealed regulation of functional categories in response to alterations of iron homeostasis in Arabidopsis thaliana

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>High-throughput technologies have opened new avenues to study biological processes and pathways. The interpretation of the immense amount of data sets generated nowadays needs to be facilitated in order to enable biologists to identify complex gene networks and functional pathways. To cope with this task multiple computer-based programs have been developed. GeneTrail is a freely available online tool that screens comparative transcriptomic data for differentially regulated functional categories and biological pathways extracted from common data bases like KEGG, Gene Ontology (GO), TRANSPATH and TRANSFAC. Additionally, GeneTrail offers a feature that allows screening of individually defined biological categories that are relevant for the respective research topic.</p> <p>Results</p> <p>We have set up GeneTrail for the use of <it>Arabidopsis thaliana</it>. To test the functionality of this tool for plant analysis, we generated transcriptome data of root and leaf responses to Fe deficiency and the Arabidopsis metal homeostasis mutant <it>nas4x-1</it>. We performed Gene Set Enrichment Analysis (GSEA) with eight meaningful pairwise comparisons of transcriptome data sets. We were able to uncover several functional pathways including metal homeostasis that were affected in our experimental situations. Representation of the differentially regulated functional categories in Venn diagrams uncovered regulatory networks at the level of whole functional pathways. Over-Representation Analysis (ORA) of differentially regulated genes identified in pairwise comparisons revealed specific functional plant physiological categories as major targets upon Fe deficiency and in <it>nas4x-1</it>.</p> <p>Conclusion</p> <p>Here, we obtained supporting evidence, that the <it>nas4x-1 </it>mutant was defective in metal homeostasis. It was confirmed that <it>nas4x-1 </it>showed Fe deficiency in roots and signs of Fe deficiency and Fe sufficiency in leaves. Besides metal homeostasis, biotic stress, root carbohydrate, leaf photosystem and specific cell biological categories were discovered as main targets for regulated changes in response to - Fe and <it>nas4x-1</it>. Among 258 differentially expressed genes in response to - Fe and <it>nas4x-1 </it>five functional categories were enriched covering metal homeostasis, redox regulation, cell division and histone acetylation. We proved that GeneTrail offers a flexible and user-adapted way to identify functional categories in large-scale plant transcriptome data sets. The distinguished feature that allowed analysis of individually assembled functional categories facilitated the study of the <it>Arabidopsis thaliana </it>transcriptome.</p

    Efficient Interpretation of Tandem Mass Tags in Top-Down Proteomics

    Get PDF
    Mass spectrometry is the major analytical tool for the identification and quantification of proteins in biological samples. In so-called top-down proteomics, separation and mass spectrometric analysis is performed at the level of intact proteins, without preparatory digestion steps. It has been shown that the tandem mass tag (TMT) labeling technology, which is often used for quantification based on digested proteins (bottom-up studies), can be applied in top-down proteomics as well. This, however, leads to a complex interpretation problem, where we need to annotate measured peaks with their respective generating protein, the number of charges, and the a priori unknown number of TMT-groups attached to this protein. In this work, we give an algorithm for the efficient enumeration of all valid annotations that fulfill available experimental constraints. Applying the algorithm to real-world data, we show that the annotation problem can indeed be efficiently solved. However, our experiments also demonstrate that reliable annotation in complex mixtures requires at least partial sequence information and high mass accuracy and resolution to go beyond the proof-of-concept stage

    A minimally invasive multiple marker approach allows highly efficient detection of meningioma tumors

    Get PDF
    BACKGROUND: The development of effective frameworks that permit an accurate diagnosis of tumors, especially in their early stages, remains a grand challenge in the field of bioinformatics. Our approach uses statistical learning techniques applied to multiple antigen tumor antigen markers utilizing the immune system as a very sensitive marker of molecular pathological processes. For validation purposes we choose the intracranial meningioma tumors as model system since they occur very frequently, are mostly benign, and are genetically stable. RESULTS: A total of 183 blood samples from 93 meningioma patients (WHO stages I-III) and 90 healthy controls were screened for seroreactivity with a set of 57 meningioma-associated antigens. We tested several established statistical learning methods on the resulting reactivity patterns using 10-fold cross validation. The best performance was achieved by Naïve Bayes Classifiers. With this classification method, our framework, called Minimally Invasive Multiple Marker (MIMM) approach, yielded a specificity of 96.2%, a sensitivity of 84.5%, and an accuracy of 90.3%, the respective area under the ROC curve was 0.957. Detailed analysis revealed that prediction performs particularly well on low-grade (WHO I) tumors, consistent with our goal of early stage tumor detection. For these tumors the best classification result with a specificity of 97.5%, a sensitivity of 91.3%, an accuracy of 95.6%, and an area under the ROC curve of 0.971 was achieved using a set of 12 antigen markers only. This antigen set was detected by a subset selection method based on Mutual Information. Remarkably, our study proves that the inclusion of non-specific antigens, detected not only in tumor but also in normal sera, increases the performance significantly, since non-specific antigens contribute additional diagnostic information. CONCLUSION: Our approach offers the possibility to screen members of risk groups as a matter of routine such that tumors hopefully can be diagnosed immediately after their genesis. The early detection will finally result in a higher cure- and lower morbidity-rate

    BNDB – The Biochemical Network Database

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Technological advances in high-throughput techniques and efficient data acquisition methods have resulted in a massive amount of life science data. The data is stored in numerous databases that have been established over the last decades and are essential resources for scientists nowadays. However, the diversity of the databases and the underlying data models make it difficult to combine this information for solving complex problems in systems biology. Currently, researchers typically have to browse several, often highly focused, databases to obtain the required information. Hence, there is a pressing need for more efficient systems for integrating, analyzing, and interpreting these data. The standardization and virtual consolidation of the databases is a major challenge resulting in a unified access to a variety of data sources.</p> <p>Description</p> <p>We present the Biochemical Network Database (BNDB), a powerful relational database platform, allowing a complete semantic integration of an extensive collection of external databases. BNDB is built upon a comprehensive and extensible object model called BioCore, which is powerful enough to model most known biochemical processes and at the same time easily extensible to be adapted to new biological concepts. Besides a web interface for the search and curation of the data, a Java-based viewer (BiNA) provides a powerful platform-independent visualization and navigation of the data. BiNA uses sophisticated graph layout algorithms for an interactive visualization and navigation of BNDB.</p> <p>Conclusion</p> <p>BNDB allows a simple, unified access to a variety of external data sources. Its tight integration with the biochemical network library BN++ offers the possibility for import, integration, analysis, and visualization of the data. BNDB is freely accessible at <url>http://www.bndb.org</url>.</p
    corecore