5 research outputs found

    Knowledge-based variable selection for learning rules from proteomic data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The incorporation of biological knowledge can enhance the analysis of biomedical data. We present a novel method that uses a proteomic knowledge base to enhance the performance of a rule-learning algorithm in identifying putative biomarkers of disease from high-dimensional proteomic mass spectral data. In particular, we use the Empirical Proteomics Ontology Knowledge Base (EPO-KB) that contains previously identified and validated proteomic biomarkers to select <it>m/z</it>s in a proteomic dataset prior to analysis to increase performance.</p> <p>Results</p> <p>We show that using EPO-KB as a pre-processing method, specifically selecting all biomarkers found only in the biofluid of the proteomic dataset, reduces the dimensionality by 95% and provides a statistically significantly greater increase in performance over no variable selection and random variable selection.</p> <p>Conclusion</p> <p>Knowledge-based variable selection even with a sparsely-populated resource such as the EPO-KB increases overall performance of rule-learning for disease classification from high-dimensional proteomic mass spectra.</p

    A Perl procedure for protein identification by Peptide Mass Fingerprinting

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>One of the topics of major interest in proteomics is protein identification. Protein identification can be achieved by analyzing the mass spectrum of a protein sample through different approaches. One of them, called Peptide Mass Fingerprinting (PMF), combines mass spectrometry (MS) data with searching strategies in a suitable database of known protein to provide a list of candidate proteins ranked by a score. To this aim, several algorithms and software tools have been proposed. However, the scoring methods and mainly the statistical evaluation of the results can be significantly improved.</p> <p>Results</p> <p>In this work, a Perl procedure for protein identification by PMF, called MsPI (Mass spectrometry Protein Identification), is presented. The implemented scoring methods were derived from the literature. MsPI implements a strategy to remove the contaminant masses present in the acquired spectra. Moreover, MsPI includes a statistical method to assign to each candidate protein, in addition to the scoring value, a p-value. Results obtained by MsPI on a dataset of 10 protein samples were compared with those achieved using two other software tools, i.e. Piums and Mascot. Piums implements one of the scoring methods available in MsPI, while Mascot is one of the most frequently used software tools in the protein identification field. MsPI scripts are available for downloading on the web site <url>http://aimed11.unipv.it/MsPI</url>.</p> <p>Conclusion</p> <p>The performances of MsPI seem to be better than those of Piums and Mascot. In fact, on the considered dataset, MsPI includes in its candidate proteins list, the "true" proteins nine times over ten, whereas Piums includes in its list the "true" proteins only four time over ten. Even if Mascot also correctly includes in the candidates list the "true" proteins nine times over ten, it provides longer candidate lists, therefore increasing the number of false positives when the molecular weight of the proteins in the sample is approximatively known (e.g. by the 1-D/2-D electrophoresis gel). Moreover, being MsPI a Perl tool, it can be easily extended and customized by the final users.</p

    BIOINFORMATICS APPROACHES TO MALDI-TOF MASS SPECTROMETRY DATA ANALYSIS

    Get PDF
    Despite the increasing performance of Mass spectrometry (MS) and others analytical tools, only few biomarkers have been validated and proved to be robust and clinically relevant; indeed a large numbers of proteomic biomarkers have been described, but they are not yet clinical implemented [1]. MALDI-TOF MS seems one of the more powerful tool for biomarkers discovery [2, 3], and shows interesting clinical properties, for instance the possibility to directly search in peripheral fuids for proteins related to an altered physiological state: samples (urine, plasma, serum, etc.) can be collected easily and cheaply by non-invasive, or very low-invasive, methods [4]. The combination of some biomarkers is actually considered more informative than a single biomarker [5, 6], and the improvement in the bioinformatics analysis of MS data could probably help this investigation, decreasing costs and time necessary for each discovery [7]. It is possible to approach the problems related to the analysis of (MALDI-TOF) MS data in two ways, either trying to increase the number of available samples or by reducing the complexity of the problem [8]: in the first case, we developed an approach to compare small datasets from different sources (i.e. hospitals), based on mutual information and mass spectra alignment, that showed significant performance increase compare to the competing ones tested. In the latter case, we developed novel methods and approaches to compare MALDI-TOF MS profiles of normal and Renal Cell Carcinoma (RCC) patients, with the goal of isolating the more interesting subset of small proteins and peptides from the whole analysed peptidome. MS-based profiling is in fact able to detect differently expressed proteins or peptides during physiological and pathological processes. Every MALDI-TOF MS spectrum, that reports the relative abundance of sample analytes, could be considered as a snapshot of samples peptidome in a definite mass range. The relationship between mass/charge ratio, or m/z, and concentration of detected peptides can be represented by networks. Tumor case and control subjects show different peptidome profiles, due to differences in biomolecular and/or biochemical features of cancer cells: they will show some changes in the networks that describe them. We use graphs to create networks representation of data and to evaluate networks properties. We explore the networks properties comparing cases versus controls datasets, and subdividing cases in the different histological subtypes of RCC, clear cell RCC (ccRCC) and not-ccRCC, using different methods both for networks creation and analysis, and for results evaluation. We identify, for each datasets (controls, ccRCC and not-ccRCC) some interesting mass ranges within which we believe biomarkers signals should be searched. In conclusion, we have developed a set of methods which we believe improve the current computational approaches for the analysis of mass spectrometry data. These results have been published or presented at workshops and conferences

    Analytical Techniques for the Improvement of Mass Spectrometry Protein Profiling

    Get PDF
    Bioinformatics is rapidly advancing through the "post-genomic" era following the sequencing of the human genome. In preparation for studying the inner workings behind genes, proteins and even smaller biological elements, several subdivisions of bioinformatics have developed. The subdivision of proteomics, concerning the structure and function of proteins, has been aided by the mass spectrometry data source. Biofluid or tissue samples are rapidly assayed for their protein composition. The resulting mass spectra are analyzed using machine learning techniques to discover reliable patterns which discriminate samples from two populations, for example, healthy or diseased, or treatment responders versus non-responders. However, this data source is imperfect and faces several challenges: unwanted variability arising from the data collection process, obtaining a robust discriminative model that generalizes well to future data, and validating a predictive pattern statistically and biologically.This thesis presents several techniques which attempt to intelligently deal with the problems facing each stage of the analytical process. First, an automatic preprocessing method selection system is demonstrated. This system learns from data and selects a combination of preprocessing methods which is most appropriate for the task at hand. This reduces the noise affecting potential predictive patterns. Our results suggest that this method can help adapt to data from different technologies, improving downstream predictive performance. Next, the issues of feature selection and predictive modeling are revisited with respect to the unique challenges posed by proteomic profile data. Approaches to model selection through kernel learning are also investigated. Key insights are obtained for designing the feature selection and predictive modeling portion of the analytical framework. Finally, methods for interpreting the resultsof predictive modeling are demonstrated. These methods are used to assure the user of various desirable properties: validation of the strength of a predictive model, validation of reproducible signal across multiple data generation sessions and generalizability of predictive models to future data. A method for labeling profile features with biological identities is also presented, which aids in the interpretation of the data. Overall, these novel techniques give the protein profiling community additional support and leverage to aid the predictive capability of the technology

    A New Approach for the Analysis of Mass Spectrometry Data for Biomarker Discovery

    No full text
    In the last few years a growing interest has been devoted to disease diagnosis based on proteomic profiles of body fluids generated by mass spectrometry. In this work, we will present a new approach for their analysis for biomarker discovery. In particular, we will describe a new strategy for the analysis of SELDI/MALDI-TOF serum data based on the following three steps: i) data-preprocessing, ii) feature (mass/charge ratio, m/z) reduction and selection, iii) association of the selected features to a list of compatible known proteins. The method is applied to an ovarian cancer dataset
    corecore