5,665 research outputs found

    Biomarker discovery and redundancy reduction towards classification using a multi-factorial MALDI-TOF MS T2DM mouse model dataset

    Get PDF
    Diabetes like many diseases and biological processes is not mono-causal. On the one hand multifactorial studies with complex experimental design are required for its comprehensive analysis. On the other hand, the data from these studies often include a substantial amount of redundancy such as proteins that are typically represented by a multitude of peptides. Coping simultaneously with both complexities (experimental and technological) makes data analysis a challenge for Bioinformatics

    Developing a discrimination rule between breast cancer patients and controls using proteomics mass spectrometric data: A three-step approach

    Get PDF
    To discriminate between breast cancer patients and controls, we used a three-step approach to obtain our decision rule. First, we ranked the mass/charge values using random forests, because it generates importance indices that take possible interactions into account. We observed that the top ranked variables consisted of highly correlated contiguous mass/charge values, which were grouped in the second step into new variables. Finally, these newly created variables were used as predictors to find a suitable discrimination rule. In this last step, we compared three different methods, namely Classification and Regression Tree ( CART), logistic regression and penalized logistic regression. Logistic regression and penalized logistic regression performed equally well and both had a higher classification accuracy than CART. The model obtained with penalized logistic regression was chosen as we hypothesized that this model would provide a better classification accuracy in the validation set. The solution had a good performance on the training set with a classification accuracy of 86.3%, and a sensitivity and specificity of 86.8% and 85.7%, respectively

    Mass spectral imaging of clinical samples using deep learning

    Get PDF
    A better interpretation of tumour heterogeneity and variability is vital for the improvement of novel diagnostic techniques and personalized cancer treatments. Tumour tissue heterogeneity is characterized by biochemical heterogeneity, which can be investigated by unsupervised metabolomics. Mass Spectrometry Imaging (MSI) combined with Machine Learning techniques have generated increasing interest as analytical and diagnostic tools for the analysis of spatial molecular patterns in tissue samples. Considering the high complexity of data produced by the application of MSI, which can consist of many thousands of spectral peaks, statistical analysis and in particular machine learning and deep learning have been investigated as novel approaches to deduce the relationships between the measured molecular patterns and the local structural and biological properties of the tissues. Machine learning have historically been divided into two main categories: Supervised and Unsupervised learning. In MSI, supervised learning methods may be used to segment tissues into histologically relevant areas e.g. the classification of tissue regions in H&E (Haemotoxylin and Eosin) stained samples. Initial classification by an expert histopathologist, through visual inspection enables the development of univariate or multivariate models, based on tissue regions that have significantly up/down-regulated ions. However, complex data may result in underdetermined models, and alternative methods that can cope with high dimensionality and noisy data are required. Here, we describe, apply, and test a novel diagnostic procedure built using a combination of MSI and deep learning with the objective of delineating and identifying biochemical differences between cancerous and non-cancerous tissue in metastatic liver cancer and epithelial ovarian cancer. The workflow investigates the robustness of single (1D) to multidimensional (3D) tumour analyses and also highlights possible biomarkers which are not accessible from classical visual analysis of the H&E images. The identification of key molecular markers may provide a deeper understanding of tumour heterogeneity and potential targets for intervention.Open Acces

    Kernel methods in genomics and computational biology

    Full text link
    Support vector machines and kernel methods are increasingly popular in genomics and computational biology, due to their good performance in real-world applications and strong modularity that makes them suitable to a wide range of problems, from the classification of tumors to the automatic annotation of proteins. Their ability to work in high dimension, to process non-vectorial data, and the natural framework they provide to integrate heterogeneous data are particularly relevant to various problems arising in computational biology. In this chapter we survey some of the most prominent applications published so far, highlighting the particular developments in kernel methods triggered by problems in biology, and mention a few promising research directions likely to expand in the future

    Feed Forward Artificial Neural Network: Tool for Early Detection of Ovarian Cancer

    Get PDF
    Pathological changes in an organ or tissue may be reflected in proteomic patterns in serum. The early detection of cancer is crucial for successful treatment. Some cancers affect the concentration of certain molecules in the blood, which allows early diagnosis by analyzing the blood mass spectrum. It is possible that exclusive serum proteomic patterns could be used to differentiate cancer samples from non-cancer ones. Several techniques have been developed for the analysis of mass-spectrum curve, and use them for the detection of prostate, ovarian, breast, bladder, pancreatic, kidney, liver, and colon cancers. In present study, we applied data mining to the diagnosis of ovarian cancer and identified the most informative points of the mass-spectrum curve, then used student t-test and neural networks to determine the differences between the curves of cancer patients and healthy people. Two serum SELDI MS data sets were used in this research to identify serum proteomic patterns that distinguish the serum of ovarian cancer cases from non-cancer controls. Statistical testing and genetic algorithm-based methods are used for feature selection respectively. The results showed that (1) data mining techniques can be successfully applied to ovarian cancer detection with a reasonably high performance; (2) the discriminatory features (proteomic patterns) can be very different from one selection method to another

    Feature selection and nearest centroid classification for protein mass spectrometry

    Get PDF
    BACKGROUND: The use of mass spectrometry as a proteomics tool is poised to revolutionize early disease diagnosis and biomarker identification. Unfortunately, before standard supervised classification algorithms can be employed, the "curse of dimensionality" needs to be solved. Due to the sheer amount of information contained within the mass spectra, most standard machine learning techniques cannot be directly applied. Instead, feature selection techniques are used to first reduce the dimensionality of the input space and thus enable the subsequent use of classification algorithms. This paper examines feature selection techniques for proteomic mass spectrometry. RESULTS: This study examines the performance of the nearest centroid classifier coupled with the following feature selection algorithms. Student-t test, Kolmogorov-Smirnov test, and the P-test are univariate statistics used for filter-based feature ranking. From the wrapper approaches we tested sequential forward selection and a modified version of sequential backward selection. Embedded approaches included shrunken nearest centroid and a novel version of boosting based feature selection we developed. In addition, we tested several dimensionality reduction approaches, namely principal component analysis and principal component analysis coupled with linear discriminant analysis. To fairly assess each algorithm, evaluation was done using stratified cross validation with an internal leave-one-out cross-validation loop for automated feature selection. Comprehensive experiments, conducted on five popular cancer data sets, revealed that the less advocated sequential forward selection and boosted feature selection algorithms produce the most consistent results across all data sets. In contrast, the state-of-the-art performance reported on isolated data sets for several of the studied algorithms, does not hold across all data sets. CONCLUSION: This study tested a number of popular feature selection methods using the nearest centroid classifier and found that several reportedly state-of-the-art algorithms in fact perform rather poorly when tested via stratified cross-validation. The revealed inconsistencies provide clear evidence that algorithm evaluation should be performed on several data sets using a consistent (i.e., non-randomized, stratified) cross-validation procedure in order for the conclusions to be statistically sound

    MSI-based mapping strategies in tumour-heterogeneity

    Get PDF
    Since the early 2000s, considerable innovations in MS technology and associated gene sequencing systems have enabled the "-omics" revolution. The data collected from multiple omics research can be combined to gain a better understanding of cancer's biological activity. Breast and ovarian cancer are among the most common cancers worldwide in women. Despite significant advances in diagnosis, treatment, and subtype identification, breast cancer remains the world's second leading cause of cancer-related deaths in women, with ovarian cancer ranking fifth. Tumour heterogeneity is a significant hurdle in cancer patient prognosis, response to therapy, and metastasis. As such, heterogeneity is one of the most significant and clinically relevant areas of cancer research nowadays. Metabolic reprogramming is a hallmark of malignancy that has been widely acknowledged in recent literature. Metabolic heterogeneity in tumours poses a challenge in developing therapies that exploit metabolic vulnerabilities. Consequently, it is crucial to approach tumour heterogeneity with an unlabeled yet spatially specific read-out of metabolic and genetic information. The advantage of DESI-MSI technology originates from its untargeted nature, which allows for the investigation of thousands of component distributions, at a micrometre scale, in a single experiment. Most notably, using a DESI-MSI clustering approach could potentially offer novel insights into metabolism, providing a method to characterise metabolically distinct sub-regions and subsequently delineate the underlying genetic drivers through genomic analyses. Hence, in this study, we aim to map the inter-and intra-tumour metabolic heterogeneity in breast and ovarian cancer by integrating multimodal MSI-based mapping strategies, comprising DESI and MALDI, with IMC (Imaging Mass Cytometry) analysis of the tumour section, using CyTOF, and high- throughput genetic characterisation of metabolically-distinct regions by transcriptomics. The multimodal analysis workflow was initially performed using sequential breast cancer Patient-Derived Xenografts (PDX) models and was expanded on primary tumour sections. Moreover, a newly developed DESI-MSI friendly, hydroxypropyl-methylcellulose and polyvinylpyrrolidone (HPMC/PVP) hydrogel-based embedding was successfully established to allow simultaneous preparation and analysis of numerous fresh frozen core-size biopsies in the same Tissue Microarray (TMA) block for the investigation of tumour heterogeneity. Additionally, a single section strategy was combined with DESI-MSI coupled to Laser Capture Microdissection (LCM) application to integrate gene expression analysis and Liquid Chromatography-Mass Spectrometry (LC-MS) on the same tissue segment. The developed single section methodology was then tested with multi-region collected ovarian tumours. DESI-MSI-guided spatial transcriptomics was performed for co-registration of different omics datasets on the same regions of interest (ROIs). This co-registration of various omics could unravel possible interactions between distinct metabolic profiles and specific genetic drivers that can lead to intra-tumour heterogeneity. Linking all these findings from MSI-based or guided various strategies allows for a transition from a qualitative approach to a conceptual understanding of the architecture of multiple molecular networks responsible for cellular metabolism in tumour heterogeneity.Open Acces

    Ovarian cancer classification based on dimensionality reduction for SELDI-TOF data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Recent advances in proteomics technologies such as SELDI-TOF mass spectrometry has shown promise in the detection of early stage cancers. However, dimensionality reduction and classification are considerable challenges in statistical machine learning. We therefore propose a novel approach for dimensionality reduction and tested it using published high-resolution SELDI-TOF data for ovarian cancer.</p> <p>Results</p> <p>We propose a method based on statistical moments to reduce feature dimensions. After refining and <it>t</it>-testing, SELDI-TOF data are divided into several intervals. Four statistical moments (mean, variance, skewness and kurtosis) are calculated for each interval and are used as representative variables. The high dimensionality of the data can thus be rapidly reduced. To improve efficiency and classification performance, the data are further used in kernel PLS models. The method achieved average sensitivity of 0.9950, specificity of 0.9916, accuracy of 0.9935 and a correlation coefficient of 0.9869 for 100 five-fold cross validations. Furthermore, only one control was misclassified in leave-one-out cross validation.</p> <p>Conclusion</p> <p>The proposed method is suitable for analyzing high-throughput proteomics data.</p

    A data review and re-assessment of ovarian cancer serum proteomic profiling

    Get PDF
    BACKGROUND: The early detection of ovarian cancer has the potential to dramatically reduce mortality. Recently, the use of mass spectrometry to develop profiles of patient serum proteins, combined with advanced data mining algorithms has been reported as a promising method to achieve this goal. In this report, we analyze the Ovarian Dataset 8-7-02 downloaded from the Clinical Proteomics Program Databank website, using nonparametric statistics and stepwise discriminant analysis to develop rules to diagnose patients, as well as to understand general patterns in the data that may guide future research. RESULTS: The mass spectrometry serum profiles derived from cancer and controls exhibited numerous statistical differences. For example, use of the Wilcoxon test in comparing the intensity at each of the 15,154 mass to charge (M/Z) values between the cancer and controls, resulted in the detection of 3,591 M/Z values whose intensities differed by a p-value of 10(-6 )or less. The region containing the M/Z values of greatest statistical difference between cancer and controls occurred at M/Z values less than 500. For example the M/Z values of 2.7921478 and 245.53704 could be used to significantly separate the cancer from control groups. Three other sets of M/Z values were developed using a training set that could distinguish between cancer and control subjects in a test set with 100% sensitivity and specificity. CONCLUSION: The ability to discriminate between cancer and control subjects based on the M/Z values of 2.7921478 and 245.53704 reveals the existence of a significant non-biologic experimental bias between these two groups. This bias may invalidate attempts to use this dataset to find patterns of reproducible diagnostic value. To minimize false discovery, results using mass spectrometry and data mining algorithms should be carefully reviewed and benchmarked with routine statistical methods
    corecore