13 research outputs found
Ovarian cancer classification based on dimensionality reduction for SELDI-TOF data
<p>Abstract</p> <p>Background</p> <p>Recent advances in proteomics technologies such as SELDI-TOF mass spectrometry has shown promise in the detection of early stage cancers. However, dimensionality reduction and classification are considerable challenges in statistical machine learning. We therefore propose a novel approach for dimensionality reduction and tested it using published high-resolution SELDI-TOF data for ovarian cancer.</p> <p>Results</p> <p>We propose a method based on statistical moments to reduce feature dimensions. After refining and <it>t</it>-testing, SELDI-TOF data are divided into several intervals. Four statistical moments (mean, variance, skewness and kurtosis) are calculated for each interval and are used as representative variables. The high dimensionality of the data can thus be rapidly reduced. To improve efficiency and classification performance, the data are further used in kernel PLS models. The method achieved average sensitivity of 0.9950, specificity of 0.9916, accuracy of 0.9935 and a correlation coefficient of 0.9869 for 100 five-fold cross validations. Furthermore, only one control was misclassified in leave-one-out cross validation.</p> <p>Conclusion</p> <p>The proposed method is suitable for analyzing high-throughput proteomics data.</p
A scale space approach for unsupervised feature selection in mass spectra classification for ovarian cancer detection
<p>Abstract</p> <p>Background</p> <p>Mass spectrometry spectra, widely used in proteomics studies as a screening tool for protein profiling and to detect discriminatory signals, are high dimensional data. A large number of local maxima (a.k.a. <it>peaks</it>) have to be analyzed as part of computational pipelines aimed at the realization of efficient predictive and screening protocols. With this kind of data dimensions and samples size the risk of over-fitting and selection bias is pervasive. Therefore the development of bio-informatics methods based on unsupervised feature extraction can lead to general tools which can be applied to several fields of predictive proteomics.</p> <p>Results</p> <p>We propose a method for feature selection and extraction grounded on the theory of multi-scale spaces for high resolution spectra derived from analysis of serum. Then we use support vector machines for classification. In particular we use a database containing 216 samples spectra divided in 115 cancer and 91 control samples. The overall accuracy averaged over a large cross validation study is 98.18. The area under the ROC curve of the best selected model is 0.9962.</p> <p>Conclusion</p> <p>We improved previous known results on the problem on the same data, with the advantage that the proposed method has an unsupervised feature selection phase. All the developed code, as MATLAB scripts, can be downloaded from <url>http://medeaserver.isa.cnr.it/dacierno/spectracode.htm</url></p
استخراج الگوی پروتئينی از دادهطيفجرمیليزری جهت تشخيص سرطان پستان با استفاده از الگوريتم دادهکاوی
زمينه و هدف: يکی از مشکلات اساسی در درمان بيماری سرطان، عدم وجود روشی مناسب در تشخيص زودرس آن ميباشد. سرطان پستان يکی از بيماریهای شايع در بين زنان میباشد که تشخيص در مراحل اوليه میتواند تأثير بسزايی در ميزان مرگ و مير زنان داشته باشد. در حال حاضر، نشانگرهای تومور مناسب برای تشخيص زودرس اين بيماری وجود ندارد. واکنشهای شيميايی درون يک عضو زنده ميتواند بصورت الگوهايی پروتئينی در مايعاتی نظير خون، خلط و ادرار انعکاس داده شود. طيفسنج جرمی جذب- يونيزاسيون ليزری سطحی ارتقاء يافته زمان پروازی يک ابزار مناسب جهت تهيه پروفايلهای پروتئينی از نمونههای بيولوژيک میباشد. ارايه يک روش دادهکاوی جهت انتخاب نشانگرهای حياتی تفکيک کننده گروههای سالم از سرطانی، جزء چالشهای مهم در تحليل الگوهای پروتئينی محسوب میشود.
روش بررسی: در اين تحقيق، دادههای پروفايل پروتئينی خونابه بيماران مبتلا به سرطان پستان مورد تحليل قرارگرفت. با ارايه يک مدل رياضی و استفاده از تبديل موجک گسسته، اغتشاشات خط زمينه و نويز الکتريکی در مرحله پيشپردازش حذف گرديد و سپس، تمام سيگنالهای طيف جرمی نرماليزه شدند. در اين مقاله، يک الگوريتم داده کاوی ترکيبی مبتنی بر سه معيار آزمون آماری، اندازه تفکيکپذيری کلاس و امتيازدهی نقاط، معرفی شده است. با روش پيشنهاد شده، بهترين زيرمجموعه پروتئينها از بين 13488 نقطه موجود با حفظ ارزش اطلاعاتی و قدرت تفکيکپذيری انتخاب شد و برای تعيين نشانگرهای حياتی استفاده گرديد. با استفاده از روش ارزيابی متقابل K چرخشی، نمونههای موجود در مجموعه داده به دو دسته يادگيری و آزمون، بطور تصادفی تقسيم شدند. حداقل آستانه برای آمارگان T مقدار 96/1 انتخاب شد. الگوريتم دادهکاوی به نقاط باقيمانده از مرحله آستانهدهی اعمال شد و بهترين زيرمجموعه ويژگیها شامل نشانگرهای حياتی با قدرت تمايز بالا انتخاب گرديد.
يافته ها: با استفاده از روش تحليل تمايز خطی، تعداد 19 پروتئين بعنوان نشانگر حياتی برگزيده شد که توانست نمونههای سالم و سرطانی را با دقت تشخيص 100%، حساسيت 100% و قطعيت 100% از هم تميز دهد.
بحث و نتيجهگيری: با توليد اطلاعات کامل از نمونههای بيولوژيک میتوان از آنها در تشخيص بيماریهای با عوامل تشخيصی ضعيف نظير سرطان استفاده نمود. تشخيص بيماری نمونهای از تفکيک الگو میباشد. در اين مقاله، يک الگوريتم داده کاوی جهت انتخاب بهترين زيرمجموعه از پروتئينها معرفی گرديد. روش پيشنهادی نشان داد که با کاهش تعداد نشانگرهای حياتی منتخب، که از مزيتهای اين روش میباشد، قدرت تفکيکپذيری از سطح مناسبی برخوردار است. نتايج بدست آمده تأکيد دارد که انتخاب مناسب زيرمجموعه پروتئينهای شاخص تأثير بسزايی در تعيين نشانگرهای حياتی جهت تشخيص صحيح بيماری دارد
A mixture model with a reference-based automatic selection of components for disease classification from protein and/or gene expression levels
Background Bioinformatics data analysis is often using linear mixture model representing samples as additive mixture of components. Properly constrained blind matrix factorization methods extract those components using mixture samples only. However, automatic selection of extracted components to be retained for classification analysis remains an open issue. Results The method proposed here is applied to well-studied protein and genomic datasets of ovarian, prostate and colon cancers to extract components for disease prediction. It achieves average sensitivities of: 96.2 (sd=2.7%), 97.6% (sd=2.8%) and 90.8% (sd=5.5%) and average specificities of: 93.6% (sd=4.1%), 99% (sd=2.2%) and 79.4% (sd=9.8%) in 100 independent two-fold cross-validations. Conclusions We propose an additive mixture model of a sample for feature extraction using, in principle, sparseness constrained factorization on a sample-by-sample basis. As opposed to that, existing methods factorize complete dataset simultaneously. The sample model is composed of a reference sample representing control and/or case (disease) groups and a test sample. Each sample is decomposed into two or more components that are selected automatically (without using label information) as control specific, case specific and not differentially expressed (neutral). The number of components is determined by cross-validation. Automatic assignment of features (m/z ratios or genes) to particular component is based on thresholds estimated from each sample directly. Due to the locality of decomposition, the strength of the expression of each feature across the samples can vary. Yet, they will still be allocated to the related disease and/or control specific component. Since label information is not used in the selection process, case and control specific components can be used for classification. That is not the case with standard factorization methods. Moreover, the component selected by proposed method as disease specific can be interpreted as a sub-mode and retained for further analysis to identify potential biomarkers. As opposed to standard matrix factorization methods this can be achieved on a sample (experiment)-by-sample basis. Postulating one or more components with indifferent features enables their removal from disease and control specific components on a sample-by-sample basis. This yields selected components with reduced complexity and generally, it increases prediction accuracy
Aplicación de modelos de feature selection y machine learning para identificar inhibidores potentes de la tirosinasa
Tyrosinase inhibitors are drugs used for the treatment of skin hyperpigmentation, but the low effectiveness and safety of current inhibitors require the discovery of new compounds of this kind. However, current in vitro and in silico (computational) methods for this purpose present high costs and limited efficiency...Los inhibidores de la tirosinasa son fármacos utilizados para el tratamiento de la hiperpigmentación de la piel, pero la baja efectividad y seguridad de los inhibidores actuales exigen el continuo descubrimiento de nuevos compuestos de este tipo. Sin embargo, los métodos existentes in vitro e in silico (computacionales) para este fin presentan altos costos y una eficiencia limitada..
Investigation into the use of support vector machine for -omics applications
Master'sMASTER OF SCIENC
Ovarian cancer identification based on dimensionality reduction for high-throughput mass spectrometry data
MOTIVATION:
High-throughput and high-resolution mass spectrometry instruments are increasingly used for disease classification and therapeutic guidance. However, the analysis of immense amount of data poses considerable challenges. We have therefore developed a novel method for dimensionality reduction and tested on a published ovarian high-resolution SELDI-TOF dataset.
RESULTS:
We have developed a four-step strategy for data preprocessing based on: (1) binning, (2) Kolmogorov-Smirnov test, (3) restriction of coefficient of variation and (4) wavelet analysis. Subsequently, support vector machines were used for classification. The developed method achieves an average sensitivity of 97.38% (sd = 0.0125) and an average specificity of 93.30% (sd = 0.0174) in 1000 independent k-fold cross-validations, where k = 2, ..., 10