7 research outputs found

    Latent protein trees

    Get PDF
    Unbiased, label-free proteomics is becoming a powerful technique for measuring protein expression in almost any biological sample. The output of these measurements after preprocessing is a collection of features and their associated intensities for each sample. Subsets of features within the data are from the same peptide, subsets of peptides are from the same protein, and subsets of proteins are in the same biological pathways, therefore, there is the potential for very complex and informative correlational structure inherent in these data. Recent attempts to utilize this data often focus on the identification of single features that are associated with a particular phenotype that is relevant to the experiment. However, to date, there have been no published approaches that directly model what we know to be multiple different levels of correlation structure. Here we present a hierarchical Bayesian model which is specifically designed to model such correlation structure in unbiased, label-free proteomics. This model utilizes partial identification information from peptide sequencing and database lookup as well as the observed correlation in the data to appropriately compress features into latent proteins and to estimate their correlation structure. We demonstrate the effectiveness of the model using artificial/benchmark data and in the context of a series of proteomics measurements of blood plasma from a collection of volunteers who were infected with two different strains of viral influenza.Comment: Published in at http://dx.doi.org/10.1214/13-AOAS639 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Statistical Methods for Mass Spectrometry Proteomics

    Get PDF
    DNA makes RNA makes proteins is the central dogma of molecular biology. While the measurement of RNA has dominated the landscape of scientificc inquiry for many years, often the true outcome of interest is the final protein product. Microarray and RNAseq studies do not tell researchers anything about what happens during and after translation. For this reason interest in directly measuring the proteome has flourished. Unfortunately the direct analysis of proteins often creates a complicated inferential situation. When scientists want to see the whole proteome (or at least a large unknown sample of the proteome) mass spectrometry is often the most powerful technology available. Mass spectrometers allow researchers to separate proteins from complex samples and obtain information about the relative abundance of around 10,000 proteins in a given experiment. However the analysis of mass spectrometry proteomics data involves a complicated statistical inference problem. Inference is made on relative protein abundance by examining protein fragments called peptides. This inference problem is complicated by the two intrinsic statistical didifficulties of proteomics; matched pairs and non-ignorable missingness, which combine to create unexpected challenges for statisticians. Here I will discuss the complexities of modeling mass spectrometry proteomics and provide new methods to improve both the accuracy and depth of protein estimation. Beyond point estimation, great interest has developed in the proteomics community regarding the clustering of high throughput data. Although the strange nature of proteomics data likely causes unique problems for clustering algorithms, we found that work needed to be done regarding the statistical interpretation of clustering before any special cases could be considered. For this reason we have explored clustering from a statistical framework and used this foundation to establish new measures of clustering performance. These indices allow for the interpretation of a clustering problem in the commonly understood framework of sensitivity and specificity.Doctor of Philosoph

    Metaprotein expression modeling for label-free quantitative proteomics

    No full text
    Abstract Background Label-free quantitative proteomics holds a great deal of promise for the future study of both medicine and biology. However, the data generated is extremely intricate in its correlation structure, and its proper analysis is complex. There are issues with missing identifications. There are high levels of correlation between many, but not all, of the peptides derived from the same protein. Additionally, there may be systematic shifts in the sensitivity of the machine between experiments or even through time within the duration of a single experiment. Results We describe a hierarchical model for analyzing unbiased, label-free proteomics data which utilizes the covariance of peptide expression across samples as well as MS/MS-based identifications to group peptides—a strategy we call metaprotein expression modeling. Our metaprotein model acknowledges the possibility of misidentifications, post-translational modifications and systematic differences between samples due to changes in instrument sensitivity or differences in total protein concentration. In addition, our approach allows us to validate findings from unbiased, label-free proteomics experiments with further unbiased, label-free proteomics experiments. Finally, we demonstrate the clinical/translational utility of the model for building predictors capable of differentiating biological phenotypes as well as for validating those findings in the context of three novel cohorts of patients with Hepatitis C. Conclusions Mass-spectrometry proteomics is quickly becoming a powerful tool for studying biological and translational questions. Making use of all of the information contained in a particular set of data will be critical to the success of those endeavors. Our proposed model represents an advance in the ability of statistical models of proteomic data to identify and utilize correlation between features. This allows validation of predictors without translation to targeted assays in addition to informing the choice of targets when it is appropriate to generate those assays.</p

    Analysing datafied life

    No full text
    Our life is being increasingly quantified by data. To obtain information from quantitative data, we need to develop various analysis methods, which can be drawn from diverse fields, such as computer science, information theory and statistics. This thesis focuses on investigating methods for analysing data generated for medical research. Its focus is on the purpose of using various data to quantify patients for personalized treatment. From the perspective of data type, this thesis proposes analysis methods for the data from the fields of Bioinformatics and medical imaging. We will discuss the need of using data from molecular level to pathway level and also incorporating medical imaging data. Different preprocessing methods should be developed for different data types, while some post-processing steps for various data types, such as classification and network analysis, can be done by a generalized approach. From the perspective of research questions, this thesis studies methods for answering five typical questions from simple to complex. These questions are detecting associations, identifying groups, constructing classifiers, deriving connectivity and building dynamic models. Each research question is studied in a specific field. For example, detecting associations is investigated for fMRI signals. However, the proposed methods can be naturally extended to solve questions in other fields. This thesis has successfully demonstrated that applying a method traditionally used in one field to a new field can bring lots of new insights. Five main research contributions for different research questions have been made in this thesis. First, to detect active brain regions associated to tasks using fMRI signals, a new significance index, CR-value, has been proposed. It is originated from the idea of using sparse modelling in gene association study. Secondly, in quantitative Proteomics analysis, a clustering based method has been developed to extract more information from large scale datasets than traditional methods. Clustering methods, which are usually used in finding subgroups of samples or features, are used to match similar identities across samples. Thirdly, a pipeline originally proposed in the field of Bioinformatics has been adapted to multivariate analysis of fMRI signals. Fourthly, the concept of elastic computing in computer science has been used to develop a new method for generating functional connectivity from fMRI data. Finally, sparse signal recovery methods from the domain of signal processing are suggested to solve the underdetermined problem of network model inference.Open Acces
    corecore