23 research outputs found

    An adaptive ensemble learner function via bagging and rank aggregation with applications to high dimensional data.

    Get PDF
    An ensemble consists of a set of individual predictors whose predictions are combined. Generally, different classification and regression models tend to work well for different types of data and also, it is usually not know which algorithm will be optimal in any given application. In this thesis an ensemble regression function is presented which is adapted from Datta et al. 2010. The ensemble function is constructed by combining bagging and rank aggregation that is capable of changing its performance depending on the type of data that is being used. In the classification approach, the results can be optimized with respect to performance measures such as accuracy, sensitivity, specificity and area under the curve (AUC) whereas in the regression approach, it can be optimized with respect to measures such as mean square error and mean absolute error. The ensemble classifier and ensemble regressor performs at the level of the best individual classifier or regression model. For complex high-dimensional datasets, it may be advisable to combine a number of classification algorithms or regression algorithms rather than using one specific algorithm

    Computational protein biomarker prediction: a case study for prostate cancer

    Get PDF
    BACKGROUND: Recent technological advances in mass spectrometry pose challenges in computational mathematics and statistics to process the mass spectral data into predictive models with clinical and biological significance. We discuss several classification-based approaches to finding protein biomarker candidates using protein profiles obtained via mass spectrometry, and we assess their statistical significance. Our overall goal is to implicate peaks that have a high likelihood of being biologically linked to a given disease state, and thus to narrow the search for biomarker candidates. RESULTS: Thorough cross-validation studies and randomization tests are performed on a prostate cancer dataset with over 300 patients, obtained at the Eastern Virginia Medical School using SELDI-TOF mass spectrometry. We obtain average classification accuracies of 87% on a four-group classification problem using a two-stage linear SVM-based procedure and just 13 peaks, with other methods performing comparably. CONCLUSIONS: Modern feature selection and classification methods are powerful techniques for both the identification of biomarker candidates and the related problem of building predictive models from protein mass spectrometric profiles. Cross-validation and randomization are essential tools that must be performed carefully in order not to bias the results unfairly. However, only a biological validation and identification of the underlying proteins will ultimately confirm the actual value and power of any computational predictions

    Using Decision Forest to Classify Prostate Cancer Samples on the Basis of SELDI-TOF MS Data: Assessing Chance Correlation and Prediction Confidence

    Get PDF
    Class prediction using “omics” data is playing an increasing role in toxicogenomics, diagnosis/prognosis, and risk assessment. These data are usually noisy and represented by relatively few samples and a very large number of predictor variables (e.g., genes of DNA microarray data or m/z peaks of mass spectrometry data). These characteristics manifest the importance of assessing potential random correlation and overfitting of noise for a classification model based on omics data. We present a novel classification method, decision forest (DF), for class prediction using omics data. DF combines the results of multiple heterogeneous but comparable decision tree (DT) models to produce a consensus prediction. The method is less prone to overfitting of noise and chance correlation. A DF model was developed to predict presence of prostate cancer using a proteomic data set generated from surface-enhanced laser deposition/ionization time-of-flight mass spectrometry (SELDI-TOF MS). The degree of chance correlation and prediction confidence of the model was rigorously assessed by extensive cross-validation and randomization testing. Comparison of model prediction with imposed random correlation demonstrated biologic relevance of the model and the reduction of overfitting in DF. Furthermore, two confidence levels (high and low confidences) were assigned to each prediction, where most misclassifications were associated with the low-confidence region. For the high-confidence prediction, the model achieved 99.2% sensitivity and 98.2% specificity. The model also identified a list of significant peaks that could be useful for biomarker identification. DF should be equally applicable to other omics data such as gene expression data or metabolomic data. The DF algorithm is available upon request

    Mass spectrometry data mining for cancer detection

    Get PDF
    Early detection of cancer is crucial for successful intervention strategies. Mass spectrometry-based high throughput proteomics is recognized as a major breakthrough in cancer detection. Many machine learning methods have been used to construct classifiers based on mass spectrometry data for discriminating between cancer stages, yet, the classifiers so constructed generally lack biological interpretability. To better assist clinical uses, a key step is to discover ”biomarker signature profiles”, i.e. combinations of a small number of protein biomarkers strongly discriminating between cancer states. This dissertation introduces two innovative algorithms to automatically search for a signature and to construct a high-performance signature-based classifier for cancer discrimination tasks based on mass spectrometry data, such as data acquired by MALDI or SELDI techniques. Our first algorithm assumes that homogeneous groups of mass spectra can be modeled by (unknown) Gibbs distributions to generate an optimal signature and an associated signature-based classifier by robust log-likelihood analysis; our second algorithm uses a stochastic optimization algorithm to search for two lists of biomarkers, and then constructs a signature-based classifier. To support these two algorithms theoretically, this dissertation also studies the empirical probability distributions of mass spectrometry data and implements the actual fitting of Markov random fields to these high-dimensional distributions. We have validated our two signature discovery algorithms on several mass spectrometry datasets related to ovarian cancer and to colorectal cancer patients groups. For these cancer discrimination tasks, our algorithms have yielded better classification performances than existing machine learning algorithms and in addition,have generated more interpretable explicit signatures.Mathematics, Department o

    Computational diagnosis and risk evaluation for canine lymphoma

    Full text link
    The canine lymphoma blood test detects the levels of two biomarkers, the acute phase proteins (C-Reactive Protein and Haptoglobin). This test can be used for diagnostics, for screening, and for remission monitoring as well. We analyze clinical data, test various machine learning methods and select the best approach to these problems. Three family of methods, decision trees, kNN (including advanced and adaptive kNN) and probability density evaluation with radial basis functions, are used for classification and risk estimation. Several pre-processing approaches were implemented and compared. The best of them are used to create the diagnostic system. For the differential diagnosis the best solution gives the sensitivity and specificity of 83.5% and 77%, respectively (using three input features, CRP, Haptoglobin and standard clinical symptom). For the screening task, the decision tree method provides the best result, with sensitivity and specificity of 81.4% and >99%, respectively (using the same input features). If the clinical symptoms (Lymphadenopathy) are considered as unknown then a decision tree with CRP and Hapt only provides sensitivity 69% and specificity 83.5%. The lymphoma risk evaluation problem is formulated and solved. The best models are selected as the system for computational lymphoma diagnosis and evaluation the risk of lymphoma as well. These methods are implemented into a special web-accessed software and are applied to problem of monitoring dogs with lymphoma after treatment. It detects recurrence of lymphoma up to two months prior to the appearance of clinical signs. The risk map visualisation provides a friendly tool for explanatory data analysis.Comment: 24 pages, 86 references in the bibliography, Significantly extended version with review of lymphoma biomarkers and data mining methods (Three new sections are added: 1.1. Biomarkers for canine lymphoma, 1.2. Acute phase proteins as lymphoma biomarkers and 3.1. Data mining methods for biomarker cancer diagnosis. Flowcharts of data analysis are included as supplementary material (20 pages

    Early Detection of Ovarian Cancer Using Gabor Wavelet Phase Quantization and Binary Coding

    Get PDF
    Ovarian cancer is the 5th most common cancer in women, but it is the most difficult to detect in its early stages. Early detection and treatment of ovarian cancer has been shown to increase the five year survival rate of a woman from 12% if caught in stage four of the disease up to 92% if caught in stage one of the disease. Using signal processing, pattern classification and a learning algorithm, it is possible to identify patterns in high dimensionality mass spectrometry data that distinguishes between cancer and non-cancer ovarian samples. For our research, proteomic spectra were generated using SELDI-TOF mass spectrum data, which was composed of 162 ovarian cancer and 91 non-ovarian cancer samples. We introduce a Gabor filter on the mass spectrometry data and design a binary coding scheme for phase quantization encoding that is used for the pattern classification. This pattern will expose crucial features in the data that can be used to correctly classify unmasked samples for the presence or absence of ovarian cancer. Our proposed algorithm was able to successfully discriminate ovarian cancer and non-ovarian samples that yielded results with sensitivities, specificities and accuracies in the 90% to 100% range

    Comparison of Supervised Classification Methods for Protein Profiling in Cancer Diagnosis

    Get PDF
    A key challenge in clinical proteomics of cancer is the identification of biomarkers that could allow detection, diagnosis and prognosis of the diseases. Recent advances in mass spectrometry and proteomic instrumentations offer unique chance to rapidly identify these markers. These advances pose considerable challenges, similar to those created by microarray-based investigation, for the discovery of pattern of markers from high-dimensional data, specific to each pathologic state (e.g. normal vs cancer). We propose a three-step strategy to select important markers from high-dimensional mass spectrometry data using surface enhanced laser desorption/ionization (SELDI) technology. The first two steps are the selection of the most discriminating biomarkers with a construction of different classifiers. Finally, we compare and validate their performance and robustness using different supervised classification methods such as Support Vector Machine, Linear Discriminant Analysis, Quadratic Discriminant Analysis, Neural Networks, Classification Trees and Boosting Trees. We show that the proposed method is suitable for analysing high-throughput proteomics data and that the combination of logistic regression and Linear Discriminant Analysis outperform other methods tested

    Proteomic mass spectra classification using decision tree based ensemble methods.

    Full text link
    MOTIVATION: Modern mass spectrometry allows the determination of proteomic fingerprints of body fluids like serum, saliva or urine. These measurements can be used in many medical applications in order to diagnose the current state or predict the evolution of a disease. Recent developments in machine learning allow one to exploit such datasets, characterized by small numbers of very high-dimensional samples. RESULTS: We propose a systematic approach based on decision tree ensemble methods, which is used to automatically determine proteomic biomarkers and predictive models. The approach is validated on two datasets of surface-enhanced laser desorption/ionization time of flight measurements, for the diagnosis of rheumatoid arthritis and inflammatory bowel diseases. The results suggest that the methodology can handle a broad class of similar problems
    corecore