3,584 research outputs found

    Latent protein trees

    Get PDF
    Unbiased, label-free proteomics is becoming a powerful technique for measuring protein expression in almost any biological sample. The output of these measurements after preprocessing is a collection of features and their associated intensities for each sample. Subsets of features within the data are from the same peptide, subsets of peptides are from the same protein, and subsets of proteins are in the same biological pathways, therefore, there is the potential for very complex and informative correlational structure inherent in these data. Recent attempts to utilize this data often focus on the identification of single features that are associated with a particular phenotype that is relevant to the experiment. However, to date, there have been no published approaches that directly model what we know to be multiple different levels of correlation structure. Here we present a hierarchical Bayesian model which is specifically designed to model such correlation structure in unbiased, label-free proteomics. This model utilizes partial identification information from peptide sequencing and database lookup as well as the observed correlation in the data to appropriately compress features into latent proteins and to estimate their correlation structure. We demonstrate the effectiveness of the model using artificial/benchmark data and in the context of a series of proteomics measurements of blood plasma from a collection of volunteers who were infected with two different strains of viral influenza.Comment: Published in at http://dx.doi.org/10.1214/13-AOAS639 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Bayesian nonparametric models for peak identification in MALDI-TOF mass spectroscopy

    Full text link
    We present a novel nonparametric Bayesian approach based on L\'{e}vy Adaptive Regression Kernels (LARK) to model spectral data arising from MALDI-TOF (Matrix Assisted Laser Desorption Ionization Time-of-Flight) mass spectrometry. This model-based approach provides identification and quantification of proteins through model parameters that are directly interpretable as the number of proteins, mass and abundance of proteins and peak resolution, while having the ability to adapt to unknown smoothness as in wavelet based methods. Informative prior distributions on resolution are key to distinguishing true peaks from background noise and resolving broad peaks into individual peaks for multiple protein species. Posterior distributions are obtained using a reversible jump Markov chain Monte Carlo algorithm and provide inference about the number of peaks (proteins), their masses and abundance. We show through simulation studies that the procedure has desirable true-positive and false-discovery rates. Finally, we illustrate the method on five example spectra: a blank spectrum, a spectrum with only the matrix of a low-molecular-weight substance used to embed target proteins, a spectrum with known proteins, and a single spectrum and average of ten spectra from an individual lung cancer patient.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS450 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Peaks detection and alignment for mass spectrometry data

    Get PDF
    The goal of this paper is to review existing methods for protein mass spectrometry data analysis, and to present a new methodology for automatic extraction of significant peaks (biomarkers). For the pre-processing step required for data from MALDI-TOF or SELDI- TOF spectra, we use a purely nonparametric approach that combines stationary invariant wavelet transform for noise removal and penalized spline quantile regression for baseline correction. We further present a multi-scale spectra alignment technique that is based on identification of statistically significant peaks from a set of spectra. This method allows one to find common peaks in a set of spectra that can subsequently be mapped to individual proteins. This may serve as useful biomarkers in medical applications, or as individual features for further multidimensional statistical analysis. MALDI-TOF spectra obtained from serum samples are used throughout the paper to illustrate the methodology

    Novel Algorithms and Datamining for Clustering Massive Datasets

    Get PDF
    Clustering proteomics data is a challenging problem for any traditional clustering algorithm. Usually, the number of samples is much smaller than the number of protein peaks. The use of a clustering algorithm which does not take into consideration the number of feature of variables (here the number of peaks) is needed. An innovative hierarchical clustering algorithm may be a good approach. This work proposes a new dissimilarity measure for the hierarchical clustering combined with a functional data analysis. This work presents a specific application of functional data analysis (FDA) to a highthrouput proteomics study. The high performance of the proposed algorithm is compared to two popular dissimilarity measures in the clustering of normal and Human T Cell Leukemia Virus Type 1 (HTLV-1)-infected patients samples. The difficulty in clustering spatial data is that the data is multi - dimensional and massive. Sometimes, an automated clustering algorithm may not be sufficient to cluster this type of data. An iterative clustering algorithm along with the capability of visual steering may be a good approach. This case study proposes a new iterative algorithm which is the combination of automated clustering methods like the bayesian clustering, detection of multivariate outliers, and the visual clustering. Simulated data from a plasma experiment and real astronomical data are used to test the performance of the algorithm

    New Statistical Algorithms for the Analysis of Mass Spectrometry Time-Of-Flight Mass Data with Applications in Clinical Diagnostics

    Get PDF
    Mass spectrometry (MS) based techniques have emerged as a standard forlarge-scale protein analysis. The ongoing progress in terms of more sensitive machines and improved data analysis algorithms led to a constant expansion of its fields of applications. Recently, MS was introduced into clinical proteomics with the prospect of early disease detection using proteomic pattern matching. Analyzing biological samples (e.g. blood) by mass spectrometry generates mass spectra that represent the components (molecules) contained in a sample as masses and their respective relative concentrations. In this work, we are interested in those components that are constant within a group of individuals but differ much between individuals of two distinct groups. These distinguishing components that dependent on a particular medical condition are generally called biomarkers. Since not all biomarkers found by the algorithms are of equal (discriminating) quality we are only interested in a small biomarker subset that - as a combination - can be used as a fingerprint for a disease. Once a fingerprint for a particular disease (or medical condition) is identified, it can be used in clinical diagnostics to classify unknown spectra. In this thesis we have developed new algorithms for automatic extraction of disease specific fingerprints from mass spectrometry data. Special emphasis has been put on designing highly sensitive methods with respect to signal detection. Thanks to our statistically based approach our methods are able to detect signals even below the noise level inherent in data acquired by common MS machines, such as hormones. To provide access to these new classes of algorithms to collaborating groups we have created a web-based analysis platform that provides all necessary interfaces for data transfer, data analysis and result inspection. To prove the platform's practical relevance it has been utilized in several clinical studies two of which are presented in this thesis. In these studies it could be shown that our platform is superior to commercial systems with respect to fingerprint identification. As an outcome of these studies several fingerprints for different cancer types (bladder, kidney, testicle, pancreas, colon and thyroid) have been detected and validated. The clinical partners in fact emphasize that these results would be impossible with a less sensitive analysis tool (such as the currently available systems). In addition to the issue of reliably finding and handling signals in noise we faced the problem to handle very large amounts of data, since an average dataset of an individual is about 2.5 Gigabytes in size and we have data of hundreds to thousands of persons. To cope with these large datasets, we developed a new framework for a heterogeneous (quasi) ad-hoc Grid - an infrastructure that allows to integrate thousands of computing resources (e.g. Desktop Computers, Computing Clusters or specialized hardware, such as IBM's Cell Processor in a Playstation 3)

    Ovarian Cancer Classification based on Mass Spectrometry Analysis of Sera

    Get PDF
    In our previous study [1], we have compared the performance of a number of widely used discrimination methods for classifying ovarian cancer using Matrix Assisted Laser Desorption Ionization (MALDI) mass spectrometry data on serum samples obtained from Reflectron mode. Our results demonstrate good performance with a random forest classifier. In this follow-up study, to improve the molecular classification power of the MALDI platform for ovarian cancer disease, we expanded the mass range of the MS data by adding data acquired in Linear mode and evaluated the resultant decrease in classification error. A general statistical framework is proposed to obtain unbiased classification error estimates and to analyze the effects of sample size and number of selected m/z features on classification errors. We also emphasize the importance of combining biological knowledge and statistical analysis to obtain both biologically and statistically sound results

    Pinpointing new protein and phosphoprotein biomarkers in rheumatoid arthritis by high-resolution label-free mass spectrometry analysis of liquid biopsies

    Get PDF
    Rheumatoid arthritis is an autoimmune inflammatory disease that attacks the joints, leading to joint destruction, if left untreated. The disease affects 0.5 to 1 % of the population in developed countries and causes great impairment to those affected and may even lead to early mortality, since the disease is usually correlated to cardiovascular events. Premature diagnosis and treatment are of the utmost importance, since it has been proved that people who began treatment within 3 months of disease onset had better outcomes, usually being able to avoid cartilage destruction in the joints. This work has the primary goal of identifying potential biomarkers in serum samples of patients with rheumatoid arthritis. Another objective was to check the phosphoproteome for differently phosphorylated peptides. In order to achieve these goals, we employed liquid chromatography-mass spectrometry to perform Label-Free Quantification of the non-phosphorylated fraction of the proteome, as well as for identifying the peptide sequences and phosphorylation present on the phosphoproteome of the different conditions. We compared serum samples from rheumatoid arthritis to healthy donors and to patients with systemic lupus erythematosus, another autoimmune disease characterised by chronic inflammation. We were able to identify 43 proteins, that were differently expressed between rheumatoid arthritis and healthy subjects. 12 of these were specific to rheumatoid arthritis. We were also able to identify 41 peptides that possessed different phosphorylation patterns between rheumatoid arthritis and healthy subjects. It was also possible to identify a kinase that appears to be active in rheumatoid arthritis, but not in healthy subjects

    Pilot multi-omic analysis of human bile from benign and malignant biliary strictures: a machine-learning approach

    Get PDF
    Cholangiocarcinoma (CCA) and pancreatic adenocarcinoma (PDAC) may lead to the development of extrahepatic obstructive cholestasis. However, biliary stenoses can also be caused by benign conditions, and the identification of their etiology still remains a clinical challenge. We performed metabolomic and proteomic analyses of bile from patients with benign (n = 36) and malignant conditions, CCA (n = 36) or PDAC (n = 57), undergoing endoscopic retrograde cholangiopancreatography with the aim of characterizing bile composition in biliopancreatic disease and identifying biomarkers for the differential diagnosis of biliary strictures. Comprehensive analyses of lipids, bile acids and small molecules were carried out using mass spectrometry (MS) and nuclear magnetic resonance spectroscopy (1H-NMR) in all patients. MS analysis of bile proteome was performed in five patients per group. We implemented artificial intelligence tools for the selection of biomarkers and algorithms with predictive capacity. Our machine-learning pipeline included the generation of synthetic data with properties of real data, the selection of potential biomarkers (metabolites or proteins) and their analysis with neural networks (NN). Selected biomarkers were then validated with real data. We identified panels of lipids (n = 10) and proteins (n = 5) that when analyzed with NN algorithms discriminated between patients with and without cancer with an unprecedented accuracy.This research was funded by: Instituto de Salud Carlos III (ISCIII) co-financed by Fondo Europeo de Desarrollo Regional (FEDER) Una manera de hacer Europa, grant numbers: PI16/01126 (M.A.A.), PI19/00819 (M.J.M. and J.J.G.M.), PI15/01132, PI18/01075 and Miguel Servet Program CON14/00129 (J.M.B.); Fundación Científica de la Asociación Española Contra el Cáncer (AECC Scientific Foundation), grant name: Rare Cancers 2017 (J.M.U., M.L.M., J.M.B., M.J.M., R.I.R.M., M.G.F.-B., C.B., M.A.A.); Gobierno de Navarra Salud, grant number 58/17 (J.M.U., M.A.A.); La Caixa Foundation, grant name: HEPACARE (C.B., M.A.A.); AMMF The Cholangiocarcinoma Charity, UK, grant number: 2018/117 (F.J.C. and M.A.A.); PSC Partners US, PSC Supports UK, grant number 06119JB (J.M.B.); Horizon 2020 (H2020) ESCALON project, grant number H2020-SC1-BHC-2018–2020 (J.M.B.); BIOEF (Basque Foundation for Innovation and Health Research: EiTB Maratoia, grant numbers BIO15/CA/016/BD (J.M.B.) and BIO15/CA/011 (M.A.A.). Department of Health of the Basque Country, grant number 2017111010 (J.M.B.). La Caixa Foundation, grant number: LCF/PR/HP17/52190004 (M.L.M.), Mineco-Feder, grant number SAF2017-87301-R (M.L.M.), Fundación BBVA grant name: Ayudas a Equipos de Investigación Científica Umbrella 2018 (M.L.M.). MCIU, grant number: Severo Ochoa Excellence Accreditation SEV-2016-0644 (M.L.M.). Part of the equipment used in this work was co-funded by the Generalitat Valenciana and European Regional Development Fund (FEDER) funds (PO FEDER of Comunitat Valenciana 2014–2020). Gobierno de Navarra fellowship to L.C. (Leticia Colyn); AECC post-doctoral fellowship to M.A.; Ramón y Cajal Program contracts RYC-2014-15242 and RYC2018-024475-1 to F.J.C. and M.G.F.-B., respectively. The generous support from: Fundación Eugenio Rodríguez Pascual, Fundación Echébano, Fundación Mario Losantos, Fundación M Torres and Mr. Eduardo Avila are acknowledged. The CNB-CSIC Proteomics Unit belongs to ProteoRed, PRB3-ISCIII, supported by grant PT17/0019/0001 (F.J.C.). Comunidad de Madrid Grant B2017/BMD-3817 (F.J.C.).Peer reviewe
    corecore