805 research outputs found

    Gene set based ensemble methods for cancer classification

    Get PDF
    Diagnosis of cancer very often depends on conclusions drawn after both clinical and microscopic examinations of tissues to study the manifestation of the disease in order to place tumors in known categories. One factor which determines the categorization of cancer is the tissue from which the tumor originates. Information gathered from clinical exams may be partial or not completely predictive of a specific category of cancer. Further complicating the problem of categorizing various tumors is that the histological classification of the cancer tissue and description of its course of development may be atypical. Gene expression data gleaned from micro-array analysis provides tremendous promise for more accurate cancer diagnosis. One hurdle in the classification of tumors based on gene expression data is that the data space is ultra-dimensional with relatively few points; that is, there are a small number of examples with a large number of genes. A second hurdle is expression bias caused by the correlation of genes. Analysis of subsets of genes, known as gene set analysis, provides a mechanism by which groups of differentially expressed genes can be identified. We propose an ensemble of classifiers whose base classifiers are â„“1-regularized logistic regression models with restriction of the feature space to biologically relevant genes. Some researchers have already explored the use of ensemble classifiers to classify cancer but the effect of the underlying base classifiers in conjunction with biologically-derived gene sets on cancer classification has not been explored

    An adaptive ensemble learner function via bagging and rank aggregation with applications to high dimensional data.

    Get PDF
    An ensemble consists of a set of individual predictors whose predictions are combined. Generally, different classification and regression models tend to work well for different types of data and also, it is usually not know which algorithm will be optimal in any given application. In this thesis an ensemble regression function is presented which is adapted from Datta et al. 2010. The ensemble function is constructed by combining bagging and rank aggregation that is capable of changing its performance depending on the type of data that is being used. In the classification approach, the results can be optimized with respect to performance measures such as accuracy, sensitivity, specificity and area under the curve (AUC) whereas in the regression approach, it can be optimized with respect to measures such as mean square error and mean absolute error. The ensemble classifier and ensemble regressor performs at the level of the best individual classifier or regression model. For complex high-dimensional datasets, it may be advisable to combine a number of classification algorithms or regression algorithms rather than using one specific algorithm

    Breast Cancer Classification: Features Investigation using Machine Learning Approaches

    Get PDF
    Breast cancer is the second most common cancer after lung cancer and one of the main causes of death worldwide. Women have a higher risk of breast cancer as compared to men. Thus, one of the early diagnosis with an accurate and reliable system is critical in breast cancer treatment. Machine learning techniques are well known and popular among researchers, especially for classification and prediction. An investigation was conducted to evaluate the performance of breast cancer classification for malignant tumors and benign tumors using various machine learning techniques, namely k-Nearest Neighbors (k-NN), Random Forest, and Support Vector Machine (SVM) and ensemble techniques to compute the prediction of the breast cancer survival by implementing 10-fold cross validation. This study used a dataset obtained from Wisconsin Diagnostic Breast Cancer (WDBC) with 23 selected features measured from 569 patients, from which 212 patients have malignant tumors and 357 patients have benign tumors. The analysis was performed to investigate the feature of the tumors based on its mean, standard error, and worst. Each feature has ten properties which are radius, texture, perimeter, area, smoothness, compactness, concavity, concave, symmetry and fractal dimensions. The selection of features was considered a significant influence to the breast cancer. The analysis is compared and evaluated with thirty features to determine the features used for breast cancer classification. The result shown AdaBoost has obtained the highest accuracy for thirty features at 98.95%, ten features of mean at 98.07%, and ten features of worst at 98.77% with a lowest error rate. Additionally, the proposed methods are classified using 2-fold, 3-fold, and 5-fold cross validation to meet the best accuracy rate. Comparison results between all methods show that AdaBoost ensemble methods gave the highest accuracy at 98.77% for 10-fold cross validation, while 2-fold and 3-fold cross validation at 98.41% and 98.24%, respectively. Nevertheless, the result with 5-fold cross validation shows SVM produced the best accuracy rate at 98.60% with the lowest error rate

    Likelihood Adaptively Modified Penalties

    Full text link
    A new family of penalty functions, adaptive to likelihood, is introduced for model selection in general regression models. It arises naturally through assuming certain types of prior distribution on the regression parameters. To study stability properties of the penalized maximum likelihood estimator, two types of asymptotic stability are defined. Theoretical properties, including the parameter estimation consistency, model selection consistency, and asymptotic stability, are established under suitable regularity conditions. An efficient coordinate-descent algorithm is proposed. Simulation results and real data analysis show that the proposed method has competitive performance in comparison with existing ones.Comment: 42 pages, 4 figure
    • …
    corecore