755 research outputs found

    EFSIS: Ensemble Feature Selection Integrating Stability

    Get PDF
    Ensemble learning that can be used to combine the predictions from multiple learners has been widely applied in pattern recognition, and has been reported to be more robust and accurate than the individual learners. This ensemble logic has recently also been more applied in feature selection. There are basically two strategies for ensemble feature selection, namely data perturbation and function perturbation. Data perturbation performs feature selection on data subsets sampled from the original dataset and then selects the features consistently ranked highly across those data subsets. This has been found to improve both the stability of the selector and the prediction accuracy for a classifier. Function perturbation frees the user from having to decide on the most appropriate selector for any given situation and works by aggregating multiple selectors. This has been found to maintain or improve classification performance. Here we propose a framework, EFSIS, combining these two strategies. Empirical results indicate that EFSIS gives both high prediction accuracy and stability.Comment: 20 pages, 3 figure

    Genetic Programming for Biomarker Detection in Classification of Mass Spectrometry Data

    No full text
    Mass spectrometry (MS) is currently the most commonly used technology in biochemical research for proteomic analysis. The primary goal of proteomic profiling using mass spectrometry is the classification of samples from different experimental states. To classify the MS samples, the identification of protein or peptides (biomarker detection) that are expressed differently between the classes, is required. However, due to the high dimensionality of the data and the small number of samples, classification of MS data is extremely challenging. Another important aspect of biomarker detection is the verification of the detected biomarker that acts as an intermediate step before passing these biomarkers to the experimental validation stage. Biomarker detection aims at altering the input space of the learning algorithm for improving classification of proteomic or metabolomic data. This task is performed through feature manipulation. Feature manipulation consists of three aspects: feature ranking, feature selection, and feature construction. Genetic programming (GP) is an evolutionary computation algorithm that has the intrinsic capability for the three aspects of feature manipulation. The ability of GP for feature manipulation in proteomic biomarker discovery has not been fully investigated. This thesis, therefore, proposes an embedded methodology for these three aspects of feature manipulation in high dimensional MS data using GP. The thesis also presents a method for biomarker verification, using GP. The thesis investigates the use of GP for both single-objective and multi-objective feature selection and construction. In feature ranking, the thesis proposes a GP-based method for ranking subsets of features by using GP as an ensemble approach. The proposed algorithm uses GP capability to combine the advantages of different feature ranking metrics and evolve a new ranking scheme for the subset of the features selected from the top ranked features. The capability of GP as a classifier is also investigated by this method. The results show that GP can select a smaller number of features and provide a better ranking of the selected features, which can improve the classification performance of five classifiers. In feature construction, this thesis proposes a novel multiple feature construction method, which uses a single GP tree to generate a new set of high-level features from the original set of selected features. The results show that the proposed new algorithm outperforms two feature selection algorithms. In feature selection, the thesis introduces the first GP multi-objective method for biomarker detection, which simultaneously increase the classification accuracy and reduce the number of detected features. The proposed multi-objective method can obtain better subsets of features than the single-objective algorithm and two traditional multi-objective approaches for feature selection. This thesis also develops the first multi-objective multiple feature construction algorithm for MS data. The proposed method aims at both maximising the classification performance and minimizing the cardinality of the constructed new high-level features. The results show that GP can dis- cover the complex relationships between the features and can significantly improve classification performance and reduce the cardinality. For biomarker verification, the thesis proposes the first GP biomarker verification method through measuring the peptide detectability. The method solves the imbalance problem in the data and shows improvement over the benchmark algorithms. Also, the algorithm outperforms a well-known peptide detection method. The thesis also introduces a new GP method for alignment of MS data as a preprocessing stage, which will further help in improving the biomarker detection process

    Updates in metabolomics tools and resources: 2014-2015

    Get PDF
    Data processing and interpretation represent the most challenging and time-consuming steps in high-throughput metabolomic experiments, regardless of the analytical platforms (MS or NMR spectroscopy based) used for data acquisition. Improved machinery in metabolomics generates increasingly complex datasets that create the need for more and better processing and analysis software and in silico approaches to understand the resulting data. However, a comprehensive source of information describing the utility of the most recently developed and released metabolomics resources—in the form of tools, software, and databases—is currently lacking. Thus, here we provide an overview of freely-available, and open-source, tools, algorithms, and frameworks to make both upcoming and established metabolomics researchers aware of the recent developments in an attempt to advance and facilitate data processing workflows in their metabolomics research. The major topics include tools and researches for data processing, data annotation, and data visualization in MS and NMR-based metabolomics. Most in this review described tools are dedicated to untargeted metabolomics workflows; however, some more specialist tools are described as well. All tools and resources described including their analytical and computational platform dependencies are summarized in an overview Table

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

    Machine learning applications in proteomics research: How the past can boost the future

    Get PDF
    Machine learning is a subdiscipline within artificial intelligence that focuses on algorithms that allow computers to learn solving a (complex) problem from existing data. This ability can be used to generate a solution to a particularly intractable problem, given that enough data are available to train and subsequently evaluate an algorithm on. Since MS-based proteomics has no shortage of complex problems, and since publicly available data are becoming available in ever growing amounts, machine learning is fast becoming a very popular tool in the field. We here therefore present an overview of the different applications of machine learning in proteomics that together cover nearly the entire wet- and dry-lab workflow, and that address key bottlenecks in experiment planning and design, as well as in data processing and analysis.acceptedVersio

    Quantitative analysis of mass spectrometry proteomics data : Software for improved life science

    Get PDF
    The rapid advances in life science, including the sequencing of the human genome and numerous other techiques, has given an extraordinary ability to aquire data on biological systems and human disease. Even so, drug development costs are higher than ever, while the rate of new approved treatments is historically low. A potential explanation to this discrepancy might be the difficulty of understanding the biology underlying the acquired data; the difficulty to refine the data to useful knowledge through interpretation. In this thesis the refinement of the complex data from mass spectrometry proteomics is studied. A number of new algorithms and programs are presented and demonstrated to provide increased analytical ability over previously suggested alternatives. With the higher goal of increasing the mass spectrometry laboratory scientific output, pragmatic studies were also performed, to create new set on compression algorithms for reduced storage requirement of mass spectrometry data, and also to characterize instrument stability. The final components of this thesis are the discussion of the technical and instrumental weaknesses associated with the currently employed mass spectrometry proteomics methodology, and the discussion of current lacking academical software quality and the reasons thereof. As a whole, the primary algorithms, the enabling technology, and the weakness discussions all aim to improve the current capability to perform mass spectrometry proteomics. As this technology is crucial to understand the main functional components of biology, proteins, this quest should allow better and higher quality life science data, and ultimately increase the chances of developing new treatments or diagnostics

    Evaluation of computer methods for biomarker discovery on computational grids

    Get PDF
    Background: Discovering biomarkers is a fundamental step to understand and deal with genetic diseases. Methods using classic Computer Science algorithms have been adapted in order to support processing large biological data sets, aiming to find useful information to understand causing conditions of diseases such as cancer. Results: This paper describes some promising biomarker discovery methods based on several grid architectures. Each technique has some features that make it more suitable for a particular grid architecture. This matching depends on the parallelizing capabilities of the method and the resource availability in each processing/storage node. Conclusion: The study described in this paper analyzed the performance of biomarker discovery methods in different grid architectures. We have found some methods are more suited for certain grid architectures, resulting in significant performance improvement and producing more accurate results

    MSQBAT - A Software Suite for LC-MS Protein Quantification

    Get PDF
    Accessing the relative changes in protein abundance is essential for a proper understanding of the various processes underlying disease progression and development. Nowadays, mass spectrometry-based proteomics allows for the identification of several thousand proteins in a single analysis. Unfortunately, mass spectrometry is inherently not quantitative, which is why additional techniques for protein quantification have to be developed. To measure quantitative changes in protein abundance, biological samples need either to be labeled using stable isotopes or protein abundances have to be computed using so called label-free techniques. Label-based quantification approaches are costly and the number of samples that can be quantified against each other is limited. Furthermore, depending on the sample, the introduction of the labels can be elaborate. Label-free quantification is not confronted with these limitations; principally, an unlimited number of samples can be quantified without the introduction of isotopes. Yet these advantages have their price: The development of label-free quantification algorithms is not trivial and requires profound knowledge both in bioinformatics and mass spectrometry. Namely the design of systems flexible enough to quantify data deriving from different mass spectrometric systems and proteomic workflows require additional experience and time. In order to quantify data acquired by LC-MALDI-MS, a novel software suite termed MSQBAT was developed and evaluated. MSQBAT is a platform independent software suite for MS1-based, label-free protein quantification. In contrast to other software solutions, MSQBAT is highly flexible and suited for the quantification of mass spectrometric data from various instrumental setups and proteomic workflows, such as (Ge)LC-MALDI-MS and (Ge)LC-ESI-MS. Quantification capabilities were evaluated using spike-in experiments analyzed using both different proteomic workflows and instruments. Human proteins were spiked in variable concentrations into a complex E.coli back-ground proteome and processed using both an LC-MS and a GeLC-MS approach. Samples were chromatographically separated on a nanoACQUITY UPLC system using a 120 minutes gradient and subsequently analyzed by an AB SCIEX TOF/TOF 5800 system and an AB SCIEX QTRAP 6500 system. Furthermore, a publicly available quantification benchmark data set has been used to evaluate LC-ESI-MS quantification capabilities. Obtained results show that MSQBAT can be applied to quantify data deriving from both LC-/GeLC-MALDI-MS and LC-/GeLC-ESI-MS workflows with high accuracy. Therefore, this software suite has a range of application outperforming all currently available solutions

    Novel urinary and serological markers of prostate cancer using proteomics techniques: an important tool for early cancer diagnosis and treatment monitoring

    Get PDF
    In Africa, Prostate cancer (PCa) is the most frequently diagnosed solid organ tumour in males and use of prostate specific antigen (PSA) is presently fraught with diagnostic inaccuracies. Not least, in a multi-ethnic society like South Africa, proteome differences between African, Caucasian and Mixed-Ancestry PCa patients are largely unknown. Hence, discovery and validation of affordable, non-invasive and reliable diagnostic biomarkers of PCa would expand the frontiers of PCa management. We have employed two high-throughput proteomics technologies to identify novel urine- and blood-based biomarkers for early diagnosis and treatment monitoring of prostate cancer in a South African cohort as well as elucidate proteome differences in patients from our heterogeneous cohort. We compared the urinary proteomes of PCa, Benign Prostatic Hyperplasia (BPH), disease controls comprising patients with other uropathies (DC) and normal healthy controls (NC) both by pooling and individual discovery shotgun proteomic assessment on a nano-Liquid chromatography (nLC) coupled Hybrid Quadrupole-Orbitrap Mass Spectrometer platform. In-silico verification of identified biomarkers was performed using the Human Protein Atlas (HPA) as well as SRMAtlas; and verified potential biomarkers were experimentally prevalidated using a targeted parallel reaction monitoring (PRM) proteomics approach. Further, we employed the CT100+ antigen microarray platform to assess the differential humoral antibody response of PCa, DC and BPH patients in our cohort to a panel of 123 tumour-associated cancer antigens. Candidate antigen biomarkers were analyzed for ethnic group variation in our cohort and potential cancer diagnostic and immunotherapeutic inferences were drawn. Using these approaches, we identified 5595 and 9991 non-redundant peptides from the pooled and individual experiments respectively. While nine proteins demonstrated ethnic trend, 37 and 73 proteins were differentially expressed by pooled and individual analysis respectively. All 32 verified biomarkers were prevalidated with parallel reaction monitoring. Good PRM signals for 12 top ranking biomarker was observed, including PSA and prostatic acid phosphatase. We also identified 41 potential diagnostic and immunotherapeutic antigen biomarkers. Proteogenomic functional pathway analyses of differentially expressed antigens showed similar enrichments of biologic processes. We identified herein novel urinary and blood-based potential diagnostic biomarkers and immunotherapeutic targets of PCa in a South African PCa Cohort using multiple proteomics approaches
    • …
    corecore