1,416 research outputs found

    Probabilistic Disease Classification of Expression-Dependent Proteomic Data from Mass Spectrometry of Human Serum

    Get PDF
    We have developed an algorithm called Q5 for probabilistic classification of healthy vs. disease whole serum samples using mass spectrometry. The algorithm employs Principal Components Analysis (PCA) followed by Linear Discriminant Analysis (LDA) on whole spectrum Surface-Enhanced Laser Desorption/Ionization Time of Flight (SELDI-TOF) Mass Spectrometry (MS) data, and is demonstrated on four real datasets from complete, complex SELDI spectra of human blood serum. Q5 is a closed-form, exact solution to the problem of classification of complete mass spectra of a complex protein mixture. Q5 employs a novel probabilistic classification algorithm built upon a dimension-reduced linear discriminant analysis. Our solution is computationally efficient; it is non-iterative and computes the optimal linear discriminant using closed-form equations. The optimal discriminant is computed and verified for datasets of complete, complex SELDI spectra of human blood serum. Replicate experiments of different training/testing splits of each dataset are employed to verify robustness of the algorithm. The probabilistic classification method achieves excellent performance. We achieve sensitivity, specificity, and positive predictive values above 97% on three ovarian cancer datasets and one prostate cancer dataset. The Q5 method outperforms previous full-spectrum complex sample spectral classification techniques, and can provide clues as to the molecular identities of differentially-expressed proteins and peptides

    Cross-platform Analysis of Cancer Biomarkers: A Bayesian Network Approach to Incorporating Mass Spectrometry and Microarray Data

    Get PDF
    Many studies showed inconsistent cancer biomarkers due to bioinformatics artifacts. In this paper we use multiple data sets from microarrays, mass spectrometry, protein sequences, and other biological knowledge in order to improve the reliability of cancer biomarkers. We present a novel Bayesian network (BN) model which integrates and cross-annotates multiple data sets related to prostate cancer. The main contribution of this study is that we provide a method that is designed to find cancer biomarkers whose presence is supported by multiple data sources and biological knowledge. Relevant biological knowledge is explicitly encoded into the model parameters, and the biomarker finding problem is formulated as a Bayesian inference problem. Besides diagnostic accuracy, we introduce reliability as another quality measurement of the biological relevance of biomarkers. Based on the proposed BN model, we develop an empirical scoring scheme and a simulation algorithm for inferring biomarkers. Fourteen genes/proteins including prostate specific antigen (PSA) are identified as reliable serum biomarkers which are insensitive to the model assumptions. The computational results show that our method is able to find biologically relevant biomarkers with highest reliability while maintaining competitive predictive power. In addition, by combining biological knowledge and data from multiple platforms, the number of putative biomarkers is greatly reduced to allow more-focused clinical studies

    Computational diagnosis and risk evaluation for canine lymphoma

    Full text link
    The canine lymphoma blood test detects the levels of two biomarkers, the acute phase proteins (C-Reactive Protein and Haptoglobin). This test can be used for diagnostics, for screening, and for remission monitoring as well. We analyze clinical data, test various machine learning methods and select the best approach to these problems. Three family of methods, decision trees, kNN (including advanced and adaptive kNN) and probability density evaluation with radial basis functions, are used for classification and risk estimation. Several pre-processing approaches were implemented and compared. The best of them are used to create the diagnostic system. For the differential diagnosis the best solution gives the sensitivity and specificity of 83.5% and 77%, respectively (using three input features, CRP, Haptoglobin and standard clinical symptom). For the screening task, the decision tree method provides the best result, with sensitivity and specificity of 81.4% and >99%, respectively (using the same input features). If the clinical symptoms (Lymphadenopathy) are considered as unknown then a decision tree with CRP and Hapt only provides sensitivity 69% and specificity 83.5%. The lymphoma risk evaluation problem is formulated and solved. The best models are selected as the system for computational lymphoma diagnosis and evaluation the risk of lymphoma as well. These methods are implemented into a special web-accessed software and are applied to problem of monitoring dogs with lymphoma after treatment. It detects recurrence of lymphoma up to two months prior to the appearance of clinical signs. The risk map visualisation provides a friendly tool for explanatory data analysis.Comment: 24 pages, 86 references in the bibliography, Significantly extended version with review of lymphoma biomarkers and data mining methods (Three new sections are added: 1.1. Biomarkers for canine lymphoma, 1.2. Acute phase proteins as lymphoma biomarkers and 3.1. Data mining methods for biomarker cancer diagnosis. Flowcharts of data analysis are included as supplementary material (20 pages

    Feed Forward Artificial Neural Network: Tool for Early Detection of Ovarian Cancer

    Get PDF
    Pathological changes in an organ or tissue may be reflected in proteomic patterns in serum. The early detection of cancer is crucial for successful treatment. Some cancers affect the concentration of certain molecules in the blood, which allows early diagnosis by analyzing the blood mass spectrum. It is possible that exclusive serum proteomic patterns could be used to differentiate cancer samples from non-cancer ones. Several techniques have been developed for the analysis of mass-spectrum curve, and use them for the detection of prostate, ovarian, breast, bladder, pancreatic, kidney, liver, and colon cancers. In present study, we applied data mining to the diagnosis of ovarian cancer and identified the most informative points of the mass-spectrum curve, then used student t-test and neural networks to determine the differences between the curves of cancer patients and healthy people. Two serum SELDI MS data sets were used in this research to identify serum proteomic patterns that distinguish the serum of ovarian cancer cases from non-cancer controls. Statistical testing and genetic algorithm-based methods are used for feature selection respectively. The results showed that (1) data mining techniques can be successfully applied to ovarian cancer detection with a reasonably high performance; (2) the discriminatory features (proteomic patterns) can be very different from one selection method to another

    Pilot multi-omic analysis of human bile from benign and malignant biliary strictures: a machine-learning approach

    Get PDF
    Cholangiocarcinoma (CCA) and pancreatic adenocarcinoma (PDAC) may lead to the development of extrahepatic obstructive cholestasis. However, biliary stenoses can also be caused by benign conditions, and the identification of their etiology still remains a clinical challenge. We performed metabolomic and proteomic analyses of bile from patients with benign (n = 36) and malignant conditions, CCA (n = 36) or PDAC (n = 57), undergoing endoscopic retrograde cholangiopancreatography with the aim of characterizing bile composition in biliopancreatic disease and identifying biomarkers for the differential diagnosis of biliary strictures. Comprehensive analyses of lipids, bile acids and small molecules were carried out using mass spectrometry (MS) and nuclear magnetic resonance spectroscopy (1H-NMR) in all patients. MS analysis of bile proteome was performed in five patients per group. We implemented artificial intelligence tools for the selection of biomarkers and algorithms with predictive capacity. Our machine-learning pipeline included the generation of synthetic data with properties of real data, the selection of potential biomarkers (metabolites or proteins) and their analysis with neural networks (NN). Selected biomarkers were then validated with real data. We identified panels of lipids (n = 10) and proteins (n = 5) that when analyzed with NN algorithms discriminated between patients with and without cancer with an unprecedented accuracy.This research was funded by: Instituto de Salud Carlos III (ISCIII) co-financed by Fondo Europeo de Desarrollo Regional (FEDER) Una manera de hacer Europa, grant numbers: PI16/01126 (M.A.A.), PI19/00819 (M.J.M. and J.J.G.M.), PI15/01132, PI18/01075 and Miguel Servet Program CON14/00129 (J.M.B.); Fundación Científica de la Asociación Española Contra el Cáncer (AECC Scientific Foundation), grant name: Rare Cancers 2017 (J.M.U., M.L.M., J.M.B., M.J.M., R.I.R.M., M.G.F.-B., C.B., M.A.A.); Gobierno de Navarra Salud, grant number 58/17 (J.M.U., M.A.A.); La Caixa Foundation, grant name: HEPACARE (C.B., M.A.A.); AMMF The Cholangiocarcinoma Charity, UK, grant number: 2018/117 (F.J.C. and M.A.A.); PSC Partners US, PSC Supports UK, grant number 06119JB (J.M.B.); Horizon 2020 (H2020) ESCALON project, grant number H2020-SC1-BHC-2018–2020 (J.M.B.); BIOEF (Basque Foundation for Innovation and Health Research: EiTB Maratoia, grant numbers BIO15/CA/016/BD (J.M.B.) and BIO15/CA/011 (M.A.A.). Department of Health of the Basque Country, grant number 2017111010 (J.M.B.). La Caixa Foundation, grant number: LCF/PR/HP17/52190004 (M.L.M.), Mineco-Feder, grant number SAF2017-87301-R (M.L.M.), Fundación BBVA grant name: Ayudas a Equipos de Investigación Científica Umbrella 2018 (M.L.M.). MCIU, grant number: Severo Ochoa Excellence Accreditation SEV-2016-0644 (M.L.M.). Part of the equipment used in this work was co-funded by the Generalitat Valenciana and European Regional Development Fund (FEDER) funds (PO FEDER of Comunitat Valenciana 2014–2020). Gobierno de Navarra fellowship to L.C. (Leticia Colyn); AECC post-doctoral fellowship to M.A.; Ramón y Cajal Program contracts RYC-2014-15242 and RYC2018-024475-1 to F.J.C. and M.G.F.-B., respectively. The generous support from: Fundación Eugenio Rodríguez Pascual, Fundación Echébano, Fundación Mario Losantos, Fundación M Torres and Mr. Eduardo Avila are acknowledged. The CNB-CSIC Proteomics Unit belongs to ProteoRed, PRB3-ISCIII, supported by grant PT17/0019/0001 (F.J.C.). Comunidad de Madrid Grant B2017/BMD-3817 (F.J.C.).Peer reviewe

    A Bayesian framework for statistical signal processing and knowledge discovery in proteomic engineering

    Get PDF
    Thesis (Ph. D.)--Harvard-MIT Division of Health Sciences and Technology, February 2006.Includes bibliographical references (leaves 73-85).Proteomics has been revolutionized in the last couple of years through integration of new mass spectrometry technologies such as -Enhanced Laser Desorption/Ionization (SELDI) mass spectrometry. As data is generated in an increasingly rapid and automated manner, novel and application-specific computational methods will be needed to deal with all of this information. This work seeks to develop a Bayesian framework in mass-based proteomics for protein identification. Using the Bayesian framework in a statistical signal processing manner, mass spectrometry data is filtered and analyzed in order to estimate protein identity. This is done by a multi-stage process which compares probabilistic networks generated from mass spectrometry-based data with a mass-based network of protein interactions. In addition, such models can provide insight on features of existing models by identifying relevant proteins. This work finds that the search space of potential proteins can be reduced such that simple antibody-based tests can be used to validate protein identity. This is done with real proteins as a proof of concept. Regarding protein interaction networks, the largest human protein interaction meta-database was created as part of this project, containing over 162,000 interactions. A further contribution is the implementation of the massome network database of mass-based interactions- which is used in the protein identification process.(cont.) This network is explored in terms potential usefulness for protein identification. The framework provides an approach to a number of core issues in proteomics. Besides providing these tools, it yields a novel way to approach statistical signal processing problems in this domain in a way that can be adapted as proteomics-based technologies mature.by Gil Alterovitz.Ph.D

    On consensus biomarker selection

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Recent development of mass spectrometry technology enabled the analysis of complex peptide mixtures. A lot of effort is currently devoted to the identification of biomarkers in human body fluids like serum or plasma, based on which new diagnostic tests for different diseases could be constructed. Various biomarker selection procedures have been exploited in recent studies. It has been noted that they often lead to different biomarker lists and as a consequence, the patient classification may also vary.</p> <p>Results</p> <p>Here we propose a new approach to the biomarker selection problem: to apply several competing feature ranking procedures and compute a consensus list of features based on their outcomes. We validate our methods on two proteomic datasets for the diagnosis of ovarian and prostate cancer.</p> <p>Conclusion</p> <p>The proposed methodology can improve the classification results and at the same time provide a unified biomarker list for further biological examinations and interpretation.</p

    Informed baseline subtraction of proteomic mass spectrometry data aided by a novel sliding window algorithm

    Get PDF
    Proteomic matrix-assisted laser desorption/ionisation (MALDI) linear time-of-flight (TOF) mass spectrometry (MS) may be used to produce protein profiles from biological samples with the aim of discovering biomarkers for disease. However, the raw protein profiles suffer from several sources of bias or systematic variation which need to be removed via pre-processing before meaningful downstream analysis of the data can be undertaken. Baseline subtraction, an early pre-processing step that removes the non-peptide signal from the spectra, is complicated by the following: (i) each spectrum has, on average, wider peaks for peptides with higher mass-to-charge ratios (m/z), and (ii) the time-consuming and error-prone trial-and-error process for optimising the baseline subtraction input arguments. With reference to the aforementioned complications, we present an automated pipeline that includes (i) a novel `continuous' line segment algorithm that efficiently operates over data with a transformed m/z-axis to remove the relationship between peptide mass and peak width, and (ii) an input-free algorithm to estimate peak widths on the transformed m/z scale. The automated baseline subtraction method was deployed on six publicly available proteomic MS datasets using six different m/z-axis transformations. Optimality of the automated baseline subtraction pipeline was assessed quantitatively using the mean absolute scaled error (MASE) when compared to a gold-standard baseline subtracted signal. Near-optimal baseline subtraction was achieved using the automated pipeline. The advantages of the proposed pipeline include informed and data specific input arguments for baseline subtraction methods, the avoidance of time-intensive and subjective piecewise baseline subtraction, and the ability to automate baseline subtraction completely. Moreover, individual steps can be adopted as stand-alone routines.Comment: 50 pages, 19 figure
    corecore