1 research outputs found

    Feature selection and classification for high-dimensional biological data under cross-validation framework

    Get PDF
    This research focuses on using statistical learning methods on high-dimensional biological data analysis. In our implementation of high-dimensional biological data analysis, we primarily utilize the statistical learning methods in selecting important predictors and to build predictive classification models. Traditionally, cross-validation methods have been used in order to determine the tuning or threshold parameter for the feature selection. We propose improvements over the methods by adding repeated and nested cross validation techniques. Also, several types of machine learning methods such as lasso, support vector machine and random forest have been used by many previous studies. Those methods have their own merits and demerits. We also propose ensemble feature selection out of the results of the three machine learning methods by capturing their strengths in order to find the more stable feature subset and to optimize the prediction accuracy. We utilize DNA microarray gene expression datasets to describe our methods. We have summarized our work in the following order: (1) the structure of high dimensional biological datasets and the statistical methods to analyze such data; (2) several statistical and machine learning algorithms to analyze high-dimensional biological datasets; (3) improved cross-validation and ensemble learning method to achieve better prediction accuracy and (4) examples using the DNA microarray data to describe our metho
    corecore