13 research outputs found

    Ovarian cancer classification based on dimensionality reduction for SELDI-TOF data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Recent advances in proteomics technologies such as SELDI-TOF mass spectrometry has shown promise in the detection of early stage cancers. However, dimensionality reduction and classification are considerable challenges in statistical machine learning. We therefore propose a novel approach for dimensionality reduction and tested it using published high-resolution SELDI-TOF data for ovarian cancer.</p> <p>Results</p> <p>We propose a method based on statistical moments to reduce feature dimensions. After refining and <it>t</it>-testing, SELDI-TOF data are divided into several intervals. Four statistical moments (mean, variance, skewness and kurtosis) are calculated for each interval and are used as representative variables. The high dimensionality of the data can thus be rapidly reduced. To improve efficiency and classification performance, the data are further used in kernel PLS models. The method achieved average sensitivity of 0.9950, specificity of 0.9916, accuracy of 0.9935 and a correlation coefficient of 0.9869 for 100 five-fold cross validations. Furthermore, only one control was misclassified in leave-one-out cross validation.</p> <p>Conclusion</p> <p>The proposed method is suitable for analyzing high-throughput proteomics data.</p

    A scale space approach for unsupervised feature selection in mass spectra classification for ovarian cancer detection

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Mass spectrometry spectra, widely used in proteomics studies as a screening tool for protein profiling and to detect discriminatory signals, are high dimensional data. A large number of local maxima (a.k.a. <it>peaks</it>) have to be analyzed as part of computational pipelines aimed at the realization of efficient predictive and screening protocols. With this kind of data dimensions and samples size the risk of over-fitting and selection bias is pervasive. Therefore the development of bio-informatics methods based on unsupervised feature extraction can lead to general tools which can be applied to several fields of predictive proteomics.</p> <p>Results</p> <p>We propose a method for feature selection and extraction grounded on the theory of multi-scale spaces for high resolution spectra derived from analysis of serum. Then we use support vector machines for classification. In particular we use a database containing 216 samples spectra divided in 115 cancer and 91 control samples. The overall accuracy averaged over a large cross validation study is 98.18. The area under the ROC curve of the best selected model is 0.9962.</p> <p>Conclusion</p> <p>We improved previous known results on the problem on the same data, with the advantage that the proposed method has an unsupervised feature selection phase. All the developed code, as MATLAB scripts, can be downloaded from <url>http://medeaserver.isa.cnr.it/dacierno/spectracode.htm</url></p

    استخراج الگوی پروتئينی از داده‌طيف‌جرمی‌ليزری جهت تشخيص سرطان پستان با استفاده از الگوريتم داده‌کاوی

    Get PDF
    زمينه و هدف: يکی از مشکلات اساسی در درمان بيماری سرطان، عدم وجود روشی مناسب در تشخيص زودرس آن مي‏باشد. سرطان پستان يکی از بيماری‌های شايع در بين زنان می‌باشد که تشخيص در مراحل اوليه می‌تواند تأثير بسزايی در ميزان مرگ و مير زنان داشته ‌باشد. در حال حاضر، نشانگرهای تومور مناسب برای تشخيص زودرس اين بيماری وجود ندارد. واکنش‌های شيميايی درون يک عضو زنده مي‏تواند بصورت الگوهايی پروتئينی در مايعاتی نظير خون، خلط و ادرار انعکاس داده شود. طيف‌سنج ‌جرمی جذب- يونيزاسيون ليزری سطحی ارتقاء يافته زمان پروازی يک ابزار مناسب جهت تهيه پروفايل‌های پروتئينی از نمونه‌های بيولوژيک می‌باشد. ارايه يک روش داده‌کاوی جهت انتخاب نشانگرهای ‌حياتی تفکيک‌ کننده گروه‌های سالم از سرطانی، جزء چالش‌های مهم در تحليل الگوهای پروتئينی محسوب می‌شود. روش بررسی: در اين تحقيق، داده‌های پروفايل پروتئينی خونابه بيماران مبتلا به سرطان پستان مورد تحليل قرارگرفت. با ارايه يک مدل رياضی و استفاده از تبديل موجک گسسته، اغتشاشات خط ‌زمينه و نويز الکتريکی در مرحله پيش‌پردازش حذف گرديد و سپس، تمام سيگنال‌های طيف‌ جرمی نرماليزه شدند. در اين مقاله، يک الگوريتم داده‌ کاوی ترکيبی مبتنی بر سه معيار آزمون آماری، اندازه تفکيک‌پذيری کلاس و امتيازدهی نقاط، معرفی شده ‌است. با روش پيشنهاد شده، بهترين زيرمجموعه پروتئين‌ها از بين 13488 نقطه موجود با حفظ ارزش اطلاعاتی و قدرت تفکيک‌پذيری انتخاب شد و برای تعيين نشانگرهای‌ حياتی استفاده گرديد. با استفاده از روش ارزيابی متقابل K چرخشی، نمونه‌های موجود در مجموعه ‌داده به دو دسته يادگيری و آزمون، بطور تصادفی تقسيم شدند. حداقل آستانه برای آمارگان T مقدار 96/1 انتخاب شد. الگوريتم داده‌کاوی به نقاط باقيمانده از مرحله آستانه‌دهی اعمال شد و بهترين زيرمجموعه‌ ويژگی‌ها شامل نشانگرهای‌ حياتی با قدرت تمايز بالا انتخاب گرديد. يافته ها: با استفاده از روش تحليل تمايز خطی، تعداد 19 پروتئين بعنوان نشانگر حياتی برگزيده شد که توانست نمونه‌های سالم و سرطانی را با دقت تشخيص 100%، حساسيت 100% و قطعيت 100% از هم تميز دهد. بحث و نتيجه‌گيری: با توليد اطلاعات کامل از نمونه‌های بيولوژيک می‌توان از آنها در تشخيص بيماری‌های با عوامل تشخيصی ضعيف نظير سرطان استفاده نمود. تشخيص بيماری نمونه‌ای از تفکيک الگو می‌باشد. در اين مقاله، يک الگوريتم داده‌ کاوی جهت انتخاب بهترين زيرمجموعه از پروتئين‌ها معرفی گرديد. روش پيشنهادی نشان داد که با کاهش تعداد نشانگرهای ‌حياتی منتخب، که از مزيت‌های اين روش می‌باشد، قدرت تفکيک‌پذيری از سطح مناسبی برخوردار است. نتايج بدست آمده تأکيد دارد که انتخاب مناسب زيرمجموعه پروتئين‌های شاخص تأثير بسزايی در تعيين نشانگرهای ‌حياتی جهت تشخيص صحيح بيماری دارد

    A mixture model with a reference-based automatic selection of components for disease classification from protein and/or gene expression levels

    Get PDF
    Background Bioinformatics data analysis is often using linear mixture model representing samples as additive mixture of components. Properly constrained blind matrix factorization methods extract those components using mixture samples only. However, automatic selection of extracted components to be retained for classification analysis remains an open issue. Results The method proposed here is applied to well-studied protein and genomic datasets of ovarian, prostate and colon cancers to extract components for disease prediction. It achieves average sensitivities of: 96.2 (sd=2.7%), 97.6% (sd=2.8%) and 90.8% (sd=5.5%) and average specificities of: 93.6% (sd=4.1%), 99% (sd=2.2%) and 79.4% (sd=9.8%) in 100 independent two-fold cross-validations. Conclusions We propose an additive mixture model of a sample for feature extraction using, in principle, sparseness constrained factorization on a sample-by-sample basis. As opposed to that, existing methods factorize complete dataset simultaneously. The sample model is composed of a reference sample representing control and/or case (disease) groups and a test sample. Each sample is decomposed into two or more components that are selected automatically (without using label information) as control specific, case specific and not differentially expressed (neutral). The number of components is determined by cross-validation. Automatic assignment of features (m/z ratios or genes) to particular component is based on thresholds estimated from each sample directly. Due to the locality of decomposition, the strength of the expression of each feature across the samples can vary. Yet, they will still be allocated to the related disease and/or control specific component. Since label information is not used in the selection process, case and control specific components can be used for classification. That is not the case with standard factorization methods. Moreover, the component selected by proposed method as disease specific can be interpreted as a sub-mode and retained for further analysis to identify potential biomarkers. As opposed to standard matrix factorization methods this can be achieved on a sample (experiment)-by-sample basis. Postulating one or more components with indifferent features enables their removal from disease and control specific components on a sample-by-sample basis. This yields selected components with reduced complexity and generally, it increases prediction accuracy

    Aplicación de modelos de feature selection y machine learning para identificar inhibidores potentes de la tirosinasa

    Get PDF
    Tyrosinase inhibitors are drugs used for the treatment of skin hyperpigmentation, but the low effectiveness and safety of current inhibitors require the discovery of new compounds of this kind. However, current in vitro and in silico (computational) methods for this purpose present high costs and limited efficiency...Los inhibidores de la tirosinasa son fármacos utilizados para el tratamiento de la hiperpigmentación de la piel, pero la baja efectividad y seguridad de los inhibidores actuales exigen el continuo descubrimiento de nuevos compuestos de este tipo. Sin embargo, los métodos existentes in vitro e in silico (computacionales) para este fin presentan altos costos y una eficiencia limitada..

    Investigation into the use of support vector machine for -omics applications

    Get PDF
    Master'sMASTER OF SCIENC

    Ovarian cancer identification based on dimensionality reduction for high-throughput mass spectrometry data

    No full text
    MOTIVATION: High-throughput and high-resolution mass spectrometry instruments are increasingly used for disease classification and therapeutic guidance. However, the analysis of immense amount of data poses considerable challenges. We have therefore developed a novel method for dimensionality reduction and tested on a published ovarian high-resolution SELDI-TOF dataset. RESULTS: We have developed a four-step strategy for data preprocessing based on: (1) binning, (2) Kolmogorov-Smirnov test, (3) restriction of coefficient of variation and (4) wavelet analysis. Subsequently, support vector machines were used for classification. The developed method achieves an average sensitivity of 97.38% (sd = 0.0125) and an average specificity of 93.30% (sd = 0.0174) in 1000 independent k-fold cross-validations, where k = 2, ..., 10
    corecore