4,359 research outputs found
Recommended from our members
Multi-class protein fold classification using a new ensemble machine learning approach.
Protein structure classification represents an important process in understanding the associations
between sequence and structure as well as possible functional and evolutionary relationships.
Recent structural genomics initiatives and other high-throughput experiments have populated the
biological databases at a rapid pace. The amount of structural data has made traditional methods
such as manual inspection of the protein structure become impossible. Machine learning has been
widely applied to bioinformatics and has gained a lot of success in this research area. This work
proposes a novel ensemble machine learning method that improves the coverage of the classifiers
under the multi-class imbalanced sample sets by integrating knowledge induced from different base
classifiers, and we illustrate this idea in classifying multi-class SCOP protein fold data. We have
compared our approach with PART and show that our method improves the sensitivity of the
classifier in protein fold classification. Furthermore, we have extended this method to learning over
multiple data types, preserving the independence of their corresponding data sources, and show
that our new approach performs at least as well as the traditional technique over a single joined
data source. These experimental results are encouraging, and can be applied to other bioinformatics
problems similarly characterised by multi-class imbalanced data sets held in multiple data
sources
Utilizing Data Mining Techniques and Ensemble Learning to Predict Development of Surgical Site Infections in Gynecologic Cancer Patients
Surgical site infections are costly to both patients and hospitals, increase patient mortality, and are the most common form of a hospital acquired infection. Gynecological cancer surgery patients are already at higher risk of developing an infection due to the suppression of their immune system. This research leverages popular data mining techniques to create a prediction model to identify high risk patients. Implemented techniques include logistic regression, naive Bayes, recursive partitioning and regression trees, random forest, feed forward neural network, k-nearest neighbor, and support vector machines with linear kernel. Weighted stacked generalization was implemented to improve upon the individual base level model’s performance. The chosen meta level classifiers were support vector machines with linear kernel, logistic regression, and k-nearest neighbor. The result is a model that identifies high-risk patients immediately following a surgical procedure with an AUC of 0.6864, accuracy of 0.6744, sensitivity of 0.7, and specificity of 0.6728
An Introduction to Recursive Partitioning: Rationale, Application and Characteristics of Classification and Regression Trees, Bagging and Random Forests
Recursive partitioning methods have become popular and widely used tools for nonparametric regression and classification in many scientific fields. Especially random forests, that can deal with large numbers of predictor variables even in the presence of complex interactions, have been applied successfully in genetics, clinical medicine and bioinformatics within the past few years.
High dimensional problems are common not only in genetics, but also in some areas of psychological research, where only few subjects can be measured due to time or cost constraints, yet a large amount of data is generated for each subject. Random forests have been shown to achieve a high prediction accuracy in such applications, and provide descriptive variable importance measures reflecting the impact of each variable in both main effects and interactions.
The aim of this work is to introduce the principles of the standard recursive partitioning methods as well as recent methodological improvements, to illustrate their usage for low and high dimensional data exploration, but also to point out limitations of the methods and potential pitfalls in their practical application.
Application of the methods is illustrated using freely available implementations in the R system for statistical computing
- …