4 research outputs found

    Combining techniques for screening and evaluating interaction terms on high-dimensional time-to-event data

    Get PDF
    BACKGROUND: Molecular data, e.g. arising from microarray technology, is often used for predicting survival probabilities of patients. For multivariate risk prediction models on such high-dimensional data, there are established techniques that combine parameter estimation and variable selection. One big challenge is to incorporate interactions into such prediction models. In this feasibility study, we present building blocks for evaluating and incorporating interactions terms in high-dimensional time-to-event settings, especially for settings in which it is computationally too expensive to check all possible interactions. RESULTS: We use a boosting technique for estimation of effects and the following building blocks for pre-selecting interactions: (1) resampling, (2) random forests and (3) orthogonalization as a data pre-processing step. In a simulation study, the strategy that uses all building blocks is able to detect true main effects and interactions with high sensitivity in different kinds of scenarios. The main challenge are interactions composed of variables that do not represent main effects, but our findings are also promising in this regard. Results on real world data illustrate that effect sizes of interactions frequently may not be large enough to improve prediction performance, even though the interactions are potentially of biological relevance. CONCLUSION: Screening interactions through random forests is feasible and useful, when one is interested in finding relevant two-way interactions. The other building blocks also contribute considerably to an enhanced pre-selection of interactions. We determined the limits of interaction detection in terms of necessary effect sizes. Our study emphasizes the importance of making full use of existing methods in addition to establishing new ones

    PRETICTIVE BIOINFORMATIC METHODS FOR ANALYZING GENES AND PROTEINS

    Get PDF
    Since large amounts of biological data are generated using various high-throughput technologies, efficient computational methods are important for understanding the biological meanings behind the complex data. Machine learning is particularly appealing for biological knowledge discovery. Tissue-specific gene expression and protein sumoylation play essential roles in the cell and are implicated in many human diseases. Protein destabilization is a common mechanism by which mutations cause human diseases. In this study, machine learning approaches were developed for predicting human tissue-specific genes, protein sumoylation sites and protein stability changes upon single amino acid substitutions. Relevant biological features were selected for input vector encoding, and machine learning algorithms, including Random Forests and Support Vector Machines, were used for classifier construction. The results suggest that the approaches give rise to more accurate predictions than previous studies and can provide valuable information for further experimental studies. Moreover, seeSUMO and MuStab web servers were developed to make the classifiers accessible to the biological research community. Structure-based methods can be used to predict the effects of amino acid substitutions on protein function and stability. The nonsynonymous Single Nucleotide Polymorphisms (nsSNPs) located at the protein binding interface have dramatic effects on protein-protein interactions. To model the effects, the nsSNPs at the interfaces of 264 protein-protein complexes were mapped on the protein structures using homology-based methods. The results suggest that disease-causing nsSNPs tend to destabilize the electrostatic component of the binding energy and nsSNPs at conserved positions have significant effects on binding energy changes. The structure-based approach was developed to quantitatively assess the effects of amino acid substitutions on protein stability and protein-protein interaction. It was shown that the structure-based analysis could help elucidate the mechanisms by which mutations cause human genetic disorders. These new bioinformatic methods can be used to analyze some interesting genes and proteins for human genetic research and improve our understanding of their molecular mechanisms underlying human diseases

    Random forest-based prediction of protein sumoylation sites from sequence features

    No full text
    corecore