3,636 research outputs found

    Sparse Proteomics Analysis - A compressed sensing-based approach for feature selection and classification of high-dimensional proteomics mass spectrometry data

    Get PDF
    Background: High-throughput proteomics techniques, such as mass spectrometry (MS)-based approaches, produce very high-dimensional data-sets. In a clinical setting one is often interested in how mass spectra differ between patients of different classes, for example spectra from healthy patients vs. spectra from patients having a particular disease. Machine learning algorithms are needed to (a) identify these discriminating features and (b) classify unknown spectra based on this feature set. Since the acquired data is usually noisy, the algorithms should be robust against noise and outliers, while the identified feature set should be as small as possible. Results: We present a new algorithm, Sparse Proteomics Analysis (SPA), based on the theory of compressed sensing that allows us to identify a minimal discriminating set of features from mass spectrometry data-sets. We show (1) how our method performs on artificial and real-world data-sets, (2) that its performance is competitive with standard (and widely used) algorithms for analyzing proteomics data, and (3) that it is robust against random and systematic noise. We further demonstrate the applicability of our algorithm to two previously published clinical data-sets

    The Path to Clinical Proteomics Research: Integration of Proteomics, Genomics, Clinical Laboratory and Regulatory Science

    Get PDF
    Better biomarkers are urgently needed to cancer detection, diagnosis, and prognosis. While the genomics community is making significant advances in understanding the molecular basis of disease, proteomics will delineate the functional units of a cell, proteins and their intricate interaction network and signaling pathways for the underlying disease. Great progress has been made to characterize thousands of proteins qualitatively and quantitatively in complex biological systems by utilizing multi-dimensional sample fractionation strategies, mass spectrometry and protein microarrays. Comparative/quantitative analysis of high-quality clinical biospecimen (e.g., tissue and biofluids) of human cancer proteome landscape has the potential to reveal protein/peptide biomarkers responsible for this disease by means of their altered levels of expression, post-translational modifications as well as different forms of protein variants. Despite technological advances in proteomics, major hurdles still exist in every step of the biomarker development pipeline. The National Cancer Institute's Clinical Proteomic Technologies for Cancer initiative (NCI-CPTC) has taken a critical step to close the gap between biomarker discovery and qualification by introducing a pre-clinical "verification" stage in the pipeline, partnering with clinical laboratory organizations to develop and implement common standards, and developing regulatory science documents with the US Food and Drug Administration to educate the proteomics community on analytical evaluation requirements for multiplex assays in order to ensure the safety and effectiveness of these tests for their intended use

    Evaluation of peak-picking algorithms for protein mass spectrometry

    Get PDF
    Peak picking is an early key step in MS data analysis. We compare three commonly used approaches to peak picking and discuss their merits by means of statistical analysis. Methods investigated encompass signal-to-noise ratio, continuous wavelet transform, and a correlation-based approach using a Gaussian template. Functionality of the three methods is illustrated and discussed in a practical context using a mass spectral data set created with MALDI-TOF technology. Sensitivity and specificity are investigated using a manually defined reference set of peaks. As an additional criterion, the robustness of the three methods is assessed by a perturbation analysis and illustrated using ROC curves

    DPHL: A DIA Pan-human Protein Mass Spectrometry Library for Robust Biomarker Discovery

    Get PDF
    To address the increasing need for detecting and validating protein biomarkers in clinical specimens, mass spectrometry (MS)-based targeted proteomic techniques, including the selected reaction monitoring (SRM), parallel reaction monitoring (PRM), and massively parallel data-independent acquisition (DIA), have been developed. For optimal performance, they require the fragment ion spectra of targeted peptides as prior knowledge. In this report, we describe a MS pipeline and spectral resource to support targeted proteomics studies for human tissue samples. To build the spectral resource, we integrated common open-source MS computational tools to assemble a freely accessible computational workflow based on Docker. We then applied the workflow to generate DPHL, a comprehensive DIA pan-human library, from 1096 data-dependent acquisition (DDA) MS raw files for 16 types of cancer samples. This extensive spectral resource was then applied to a proteomic study of 17 prostate cancer (PCa) patients. Thereafter, PRM validation was applied to a larger study of 57 PCa patients and the differential expression of three proteins in prostate tumor was validated. As a second application, the DPHL spectral resource was applied to a study consisting of plasma samples from 19 diffuse large B cell lymphoma (DLBCL) patients and 18 healthy control subjects. Differentially expressed proteins between DLBCL patients and healthy control subjects were detected by DIA-MS and confirmed by PRM. These data demonstrate that the DPHL supports DIA and PRM MS pipelines for robust protein biomarker discovery. DPHL is freely accessible at https://www.iprox.org/page/project.html?id=IPX0001400000

    Revisiting biomarker discovery by plasma proteomics

    Get PDF
    Clinical analysis of blood is the most widespread diagnostic procedure in medicine, and blood biomarkers are used to categorize patients and to support treatment decisions. However, existing biomarkers are far from comprehensive and often lack specificity and new ones are being developed at a very slow rate. As described in this review, mass spectrometry (MS)-based proteomics has become a powerful technology in biological research and it is now poised to allow the characterization of the plasma proteome in great depth. Previous "triangular strategies" aimed at discovering single biomarker candidates in small cohorts, followed by classical immunoassays in much larger validation cohorts. We propose a "rectangular" plasma proteome profiling strategy, in which the proteome patterns of large cohorts are correlated with their phenotypes in health and disease. Translating such concepts into clinical practice will require restructuring several aspects of diagnostic decision-making, and we discuss some first steps in this direction

    Toward a Standardized Strategy of Clinical Metabolomics for the Advancement of Precision Medicine

    Get PDF
    Despite the tremendous success, pitfalls have been observed in every step of a clinical metabolomics workflow, which impedes the internal validity of the study. Furthermore, the demand for logistics, instrumentations, and computational resources for metabolic phenotyping studies has far exceeded our expectations. In this conceptual review, we will cover inclusive barriers of a metabolomics-based clinical study and suggest potential solutions in the hope of enhancing study robustness, usability, and transferability. The importance of quality assurance and quality control procedures is discussed, followed by a practical rule containing five phases, including two additional "pre-pre-" and "post-post-" analytical steps. Besides, we will elucidate the potential involvement of machine learning and demonstrate that the need for automated data mining algorithms to improve the quality of future research is undeniable. Consequently, we propose a comprehensive metabolomics framework, along with an appropriate checklist refined from current guidelines and our previously published assessment, in the attempt to accurately translate achievements in metabolomics into clinical and epidemiological research. Furthermore, the integration of multifaceted multi-omics approaches with metabolomics as the pillar member is in urgent need. When combining with other social or nutritional factors, we can gather complete omics profiles for a particular disease. Our discussion reflects the current obstacles and potential solutions toward the progressing trend of utilizing metabolomics in clinical research to create the next-generation healthcare system.11Ysciescopu

    Biomarker Discovery and Validation for Proteomics and Genomics: Modeling And Systematic Analysis

    Get PDF
    Discovery and validation of protein biomarkers with high specificity is the main challenge of current proteomics studies. Different mass spectrometry models are used as shotgun tools for discovery of biomarkers which is usually done on a small number of samples. In the discovery phase, feature selection plays a key role. The first part of this work focuses on the feature selection problem and proposes a new Branch and Bound algorithm based on U-curve assumption. The U-curve branch-and-bound algorithm (UBB) for optimization was introduced recently by Barrera and collaborators. In this work we introduce an improved algorithm (IUBB) for finding the optimal set of features based on the U-curve assumption. The results for a set of U-curve problems, generated from a cost model, show that the IUBB algorithm makes fewer evaluations and is more robust than the original UBB algorithm. The two algorithms are also compared in finding the optimal features of a real classification problem designed using the data model. The results show that IUBB outperforms UBB in finding the optimal feature sets. On the other hand, the result indicate that the performance of the error estimator is crucial to the success of the feature selection algorithm. To study the effect of error estimation methods, in the next section of the work, we study the effect of the complexity of the decision boundary on the performance of error estimation methods. First, a model is developed which quantifies the complexity of a classification problem purely in terms of the geometry of the decision boundary, without relying on the Bayes error. Then, this model is used in a simulation study to analyze the bias and root-mean-square error (RMS) of a few widely used error estimation methods relative to the complexity of the decision boundary. The results show that all the estimation methods lose accuracy as complexity increases. Validation of a set of selected biomarkers from a list of candidates is an important stage in the biomarker identification pipeline and is the focus of the the next section of this work. This section analyzes the Selected Reaction Monitoring (SRM) pipeline in a systematic fashion, by modelling the main stages of the biomarker validation process. The proposed models for SRM and protein mixture are then used to study the effect of different parameters on the final performance of biomarker validation. We focus on the sensitivity of the SRM pipeline to the working parameters, in order to identify the bottlenecks where time and energy should be spent in designing the experiment

    Quantification and Simulation of Liquid Chromatography-Mass Spectrometry Data

    Get PDF
    Computational mass spectrometry is a fast evolving field that has attracted increased attention over the last couple of years. The performance of software solutions determines the success of analysis to a great extent. New algorithms are required to reflect new experimental procedures and deal with new instrument generations. One essential component of algorithm development is the validation (as well as comparison) of software on a broad range of data sets. This requires a gold standard (or so-called ground truth), which is usually obtained by manual annotation of a real data set. Comprehensive manually annotated public data sets for mass spectrometry data are labor-intensive to produce and their quality strongly depends on the skill of the human expert. Some parts of the data may even be impossible to annotate due to high levels of noise or other ambiguities. Furthermore, manually annotated data is usually not available for all steps in a typical computational analysis pipeline. We thus developed the most comprehensive simulation software to date, which allows to generate multiple levels of ground truth and features a plethora of settings to reflect experimental conditions and instrument settings. The simulator is used to generate several distinct types of data. The data are subsequently employed to evaluate existing algorithms. Additionally, we employ simulation to determine the influence of instrument attributes and sample complexity on the ability of algorithms to recover information. The results give valuable hints on how to optimize experimental setups. Furthermore, this thesis introduces two quantitative approaches, namely a decharging algorithm based on integer linear programming and a new workflow for identification of differentially expressed proteins for a large in vitro study on toxic compounds. Decharging infers the uncharged mass of a peptide (or protein) by clustering all its charge variants. The latter occur frequently under certain experimental conditions. We employ simulation to show that decharging is robust against missing values even for high complexity data and that the algorithm outperforms other solutions in terms of mass accuracy and run time on real data. The last part of this thesis deals with a new state-of-the-art workflow for protein quantification based on isobaric tags for relative and absolute quantitation (iTRAQ). We devise a new approach to isotope correction, propose an experimental design, introduce new metrics of iTRAQ data quality, and confirm putative properties of iTRAQ data using a novel approach. All tools developed as part of this thesis are implemented in OpenMS, a C++ library for computational mass spectrometry
    corecore