49 research outputs found

    Recursive SVM feature selection and sample classification for mass-spectrometry and microarray data

    Get PDF
    BACKGROUND: Like microarray-based investigations, high-throughput proteomics techniques require machine learning algorithms to identify biomarkers that are informative for biological classification problems. Feature selection and classification algorithms need to be robust to noise and outliers in the data. RESULTS: We developed a recursive support vector machine (R-SVM) algorithm to select important genes/biomarkers for the classification of noisy data. We compared its performance to a similar, state-of-the-art method (SVM recursive feature elimination or SVM-RFE), paying special attention to the ability of recovering the true informative genes/biomarkers and the robustness to outliers in the data. Simulation experiments show that a 5 %-~20 % improvement over SVM-RFE can be achieved regard to these properties. The SVM-based methods are also compared with a conventional univariate method and their respective strengths and weaknesses are discussed. R-SVM was applied to two sets of SELDI-TOF-MS proteomics data, one from a human breast cancer study and the other from a study on rat liver cirrhosis. Important biomarkers found by the algorithm were validated by follow-up biological experiments. CONCLUSION: The proposed R-SVM method is suitable for analyzing noisy high-throughput proteomics and microarray data and it outperforms SVM-RFE in the robustness to noise and in the ability to recover informative features. The multivariate SVM-based method outperforms the univariate method in the classification performance, but univariate methods can reveal more of the differentially expressed features especially when there are correlations between the features

    Classification and biomarker identification using gene network modules and support vector machines

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Classification using microarray datasets is usually based on a small number of samples for which tens of thousands of gene expression measurements have been obtained. The selection of the genes most significant to the classification problem is a challenging issue in high dimension data analysis and interpretation. A previous study with SVM-RCE (Recursive Cluster Elimination), suggested that classification based on groups of correlated genes sometimes exhibits better performance than classification using single genes. Large databases of gene interaction networks provide an important resource for the analysis of genetic phenomena and for classification studies using interacting genes.</p> <p>We now demonstrate that an algorithm which integrates network information with recursive feature elimination based on SVM exhibits good performance and improves the biological interpretability of the results. We refer to the method as SVM with Recursive Network Elimination (SVM-RNE)</p> <p>Results</p> <p>Initially, one thousand genes selected by t-test from a training set are filtered so that only genes that map to a gene network database remain. The Gene Expression Network Analysis Tool (GXNA) is applied to the remaining genes to form <it>n </it>clusters of genes that are highly connected in the network. Linear SVM is used to classify the samples using these clusters, and a weight is assigned to each cluster based on its importance to the classification. The least informative clusters are removed while retaining the remainder for the next classification step. This process is repeated until an optimal classification is obtained.</p> <p>Conclusion</p> <p>More than 90% accuracy can be obtained in classification of selected microarray datasets by integrating the interaction network information with the gene expression information from the microarrays.</p> <p>The Matlab version of SVM-RNE can be downloaded from <url>http://web.macam.ac.il/~myousef</url></p

    Ovarian cancer classification based on dimensionality reduction for SELDI-TOF data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Recent advances in proteomics technologies such as SELDI-TOF mass spectrometry has shown promise in the detection of early stage cancers. However, dimensionality reduction and classification are considerable challenges in statistical machine learning. We therefore propose a novel approach for dimensionality reduction and tested it using published high-resolution SELDI-TOF data for ovarian cancer.</p> <p>Results</p> <p>We propose a method based on statistical moments to reduce feature dimensions. After refining and <it>t</it>-testing, SELDI-TOF data are divided into several intervals. Four statistical moments (mean, variance, skewness and kurtosis) are calculated for each interval and are used as representative variables. The high dimensionality of the data can thus be rapidly reduced. To improve efficiency and classification performance, the data are further used in kernel PLS models. The method achieved average sensitivity of 0.9950, specificity of 0.9916, accuracy of 0.9935 and a correlation coefficient of 0.9869 for 100 five-fold cross validations. Furthermore, only one control was misclassified in leave-one-out cross validation.</p> <p>Conclusion</p> <p>The proposed method is suitable for analyzing high-throughput proteomics data.</p

    The Validation and Assessment of Machine Learning: A Game of Prediction from High-Dimensional Data

    Get PDF
    In applied statistics, tools from machine learning are popular for analyzing complex and high-dimensional data. However, few theoretical results are available that could guide to the appropriate machine learning tool in a new application. Initial development of an overall strategy thus often implies that multiple methods are tested and compared on the same set of data. This is particularly difficult in situations that are prone to over-fitting where the number of subjects is low compared to the number of potential predictors. The article presents a game which provides some grounds for conducting a fair model comparison. Each player selects a modeling strategy for predicting individual response from potential predictors. A strictly proper scoring rule, bootstrap cross-validation, and a set of rules are used to make the results obtained with different strategies comparable. To illustrate the ideas, the game is applied to data from the Nugenob Study where the aim is to predict the fat oxidation capacity based on conventional factors and high-dimensional metabolomics data. Three players have chosen to use support vector machines, LASSO, and random forests, respectively

    استخراج الگوی پروتئينی از داده‌طيف‌جرمی‌ليزری جهت تشخيص سرطان پستان با استفاده از الگوريتم داده‌کاوی

    Get PDF
    زمينه و هدف: يکی از مشکلات اساسی در درمان بيماری سرطان، عدم وجود روشی مناسب در تشخيص زودرس آن مي‏باشد. سرطان پستان يکی از بيماری‌های شايع در بين زنان می‌باشد که تشخيص در مراحل اوليه می‌تواند تأثير بسزايی در ميزان مرگ و مير زنان داشته ‌باشد. در حال حاضر، نشانگرهای تومور مناسب برای تشخيص زودرس اين بيماری وجود ندارد. واکنش‌های شيميايی درون يک عضو زنده مي‏تواند بصورت الگوهايی پروتئينی در مايعاتی نظير خون، خلط و ادرار انعکاس داده شود. طيف‌سنج ‌جرمی جذب- يونيزاسيون ليزری سطحی ارتقاء يافته زمان پروازی يک ابزار مناسب جهت تهيه پروفايل‌های پروتئينی از نمونه‌های بيولوژيک می‌باشد. ارايه يک روش داده‌کاوی جهت انتخاب نشانگرهای ‌حياتی تفکيک‌ کننده گروه‌های سالم از سرطانی، جزء چالش‌های مهم در تحليل الگوهای پروتئينی محسوب می‌شود. روش بررسی: در اين تحقيق، داده‌های پروفايل پروتئينی خونابه بيماران مبتلا به سرطان پستان مورد تحليل قرارگرفت. با ارايه يک مدل رياضی و استفاده از تبديل موجک گسسته، اغتشاشات خط ‌زمينه و نويز الکتريکی در مرحله پيش‌پردازش حذف گرديد و سپس، تمام سيگنال‌های طيف‌ جرمی نرماليزه شدند. در اين مقاله، يک الگوريتم داده‌ کاوی ترکيبی مبتنی بر سه معيار آزمون آماری، اندازه تفکيک‌پذيری کلاس و امتيازدهی نقاط، معرفی شده ‌است. با روش پيشنهاد شده، بهترين زيرمجموعه پروتئين‌ها از بين 13488 نقطه موجود با حفظ ارزش اطلاعاتی و قدرت تفکيک‌پذيری انتخاب شد و برای تعيين نشانگرهای‌ حياتی استفاده گرديد. با استفاده از روش ارزيابی متقابل K چرخشی، نمونه‌های موجود در مجموعه ‌داده به دو دسته يادگيری و آزمون، بطور تصادفی تقسيم شدند. حداقل آستانه برای آمارگان T مقدار 96/1 انتخاب شد. الگوريتم داده‌کاوی به نقاط باقيمانده از مرحله آستانه‌دهی اعمال شد و بهترين زيرمجموعه‌ ويژگی‌ها شامل نشانگرهای‌ حياتی با قدرت تمايز بالا انتخاب گرديد. يافته ها: با استفاده از روش تحليل تمايز خطی، تعداد 19 پروتئين بعنوان نشانگر حياتی برگزيده شد که توانست نمونه‌های سالم و سرطانی را با دقت تشخيص 100%، حساسيت 100% و قطعيت 100% از هم تميز دهد. بحث و نتيجه‌گيری: با توليد اطلاعات کامل از نمونه‌های بيولوژيک می‌توان از آنها در تشخيص بيماری‌های با عوامل تشخيصی ضعيف نظير سرطان استفاده نمود. تشخيص بيماری نمونه‌ای از تفکيک الگو می‌باشد. در اين مقاله، يک الگوريتم داده‌ کاوی جهت انتخاب بهترين زيرمجموعه از پروتئين‌ها معرفی گرديد. روش پيشنهادی نشان داد که با کاهش تعداد نشانگرهای ‌حياتی منتخب، که از مزيت‌های اين روش می‌باشد، قدرت تفکيک‌پذيری از سطح مناسبی برخوردار است. نتايج بدست آمده تأکيد دارد که انتخاب مناسب زيرمجموعه پروتئين‌های شاخص تأثير بسزايی در تعيين نشانگرهای ‌حياتی جهت تشخيص صحيح بيماری دارد

    Predictive Power Estimation Algorithm (PPEA) - A New Algorithm to Reduce Overfitting for Genomic Biomarker Discovery

    Get PDF
    Toxicogenomics promises to aid in predicting adverse effects, understanding the mechanisms of drug action or toxicity, and uncovering unexpected or secondary pharmacology. However, modeling adverse effects using high dimensional and high noise genomic data is prone to over-fitting. Models constructed from such data sets often consist of a large number of genes with no obvious functional relevance to the biological effect the model intends to predict that can make it challenging to interpret the modeling results. To address these issues, we developed a novel algorithm, Predictive Power Estimation Algorithm (PPEA), which estimates the predictive power of each individual transcript through an iterative two-way bootstrapping procedure. By repeatedly enforcing that the sample number is larger than the transcript number, in each iteration of modeling and testing, PPEA reduces the potential risk of overfitting. We show with three different cases studies that: (1) PPEA can quickly derive a reliable rank order of predictive power of individual transcripts in a relatively small number of iterations, (2) the top ranked transcripts tend to be functionally related to the phenotype they are intended to predict, (3) using only the most predictive top ranked transcripts greatly facilitates development of multiplex assay such as qRT-PCR as a biomarker, and (4) more importantly, we were able to demonstrate that a small number of genes identified from the top-ranked transcripts are highly predictive of phenotype as their expression changes distinguished adverse from nonadverse effects of compounds in completely independent tests. Thus, we believe that the PPEA model effectively addresses the over-fitting problem and can be used to facilitate genomic biomarker discovery for predictive toxicology and drug responses

    Segmentation of Multi-Isotope Imaging Mass Spectrometry Data for Semi-Automatic Detection of Regions of Interest

    Get PDF
    Multi-isotope imaging mass spectrometry (MIMS) associates secondary ion mass spectrometry (SIMS) with detection of several atomic masses, the use of stable isotopes as labels, and affiliated quantitative image-analysis software. By associating image and measure, MIMS allows one to obtain quantitative information about biological processes in sub-cellular domains. MIMS can be applied to a wide range of biomedical problems, in particular metabolism and cell fate [1], [2], [3]. In order to obtain morphologically pertinent data from MIMS images, we have to define regions of interest (ROIs). ROIs are drawn by hand, a tedious and time-consuming process. We have developed and successfully applied a support vector machine (SVM) for segmentation of MIMS images that allows fast, semi-automatic boundary detection of regions of interests. Using the SVM, high-quality ROIs (as compared to an expert's manual delineation) were obtained for 2 types of images derived from unrelated data sets. This automation simplifies, accelerates and improves the post-processing analysis of MIMS images. This approach has been integrated into “Open MIMS,” an ImageJ-plugin for comprehensive analysis of MIMS images that is available online at http://www.nrims.hms.harvard.edu/NRIMS_ImageJ.php
    corecore