4,636 research outputs found

    Protein Bioinformatics Infrastructure for the Integration and Analysis of Multiple High-Throughput “omics” Data

    Get PDF
    High-throughput “omics” technologies bring new opportunities for biological and biomedical researchers to ask complex questions and gain new scientific insights. However, the voluminous, complex, and context-dependent data being maintained in heterogeneous and distributed environments plus the lack of well-defined data standard and standardized nomenclature imposes a major challenge which requires advanced computational methods and bioinformatics infrastructures for integration, mining, visualization, and comparative analysis to facilitate data-driven hypothesis generation and biological knowledge discovery. In this paper, we present the challenges in high-throughput “omics” data integration and analysis, introduce a protein-centric approach for systems integration of large and heterogeneous high-throughput “omics” data including microarray, mass spectrometry, protein sequence, protein structure, and protein interaction data, and use scientific case study to illustrate how one can use varied “omics” data from different laboratories to make useful connections that could lead to new biological knowledge

    Survey: Data Mining Techniques in Medical Data Field

    Get PDF
    Now days most of the research area are working on data mining techniques in medical data. Knowledge discovery and data mining have found numerous applications in business and scientific domain. Valuable knowledge can be discovered from application of data mining techniques in healthcare system. In this study, we briefly examine the potential use of classification based data mining techniques such as Rule based, decision tree, machine learning algorithms like Support Vector Machines, Principle Component Analysis etc., Rough Set Theory and Fuzzy logic. In particular we consider a case study using classification techniques on a medical data set of diabetic patients

    Methods for peptide identification by spectral comparison

    Get PDF
    BACKGROUND: Tandem mass spectrometry followed by database search is currently the predominant technology for peptide sequencing in shotgun proteomics experiments. Most methods compare experimentally observed spectra to the theoretical spectra predicted from the sequences in protein databases. There is a growing interest, however, in comparing unknown experimental spectra to a library of previously identified spectra. This approach has the advantage of taking into account instrument-dependent factors and peptide-specific differences in fragmentation probabilities. It is also computationally more efficient for high-throughput proteomics studies. RESULTS: This paper investigates computational issues related to this spectral comparison approach. Different methods have been empirically evaluated over several large sets of spectra. First, we illustrate that the peak intensities follow a Poisson distribution. This implies that applying a square root transform will optimally stabilize the peak intensity variance. Our results show that the square root did indeed outperform other transforms, resulting in improved accuracy of spectral matching. Second, different measures of spectral similarity were compared, and the results illustrated that the correlation coefficient was most robust. Finally, we examine how to assemble multiple spectra associated with the same peptide to generate a synthetic reference spectrum. Ensemble averaging is shown to provide the best combination of accuracy and efficiency. CONCLUSION: Our results demonstrate that when combined, these methods can boost the sensitivity and specificity of spectral comparison. Therefore they are capable of enhancing and complementing existing tools for consistent and accurate peptide identification

    Computational methods and tools for protein phosphorylation analysis

    Get PDF
    Signaling pathways represent a central regulatory mechanism of biological systems where a key event in their correct functioning is the reversible phosphorylation of proteins. Protein phosphorylation affects at least one-third of all proteins and is the most widely studied posttranslational modification. Phosphorylation analysis is still perceived, in general, as difficult or cumbersome and not readily attempted by many, despite the high value of such information. Specifically, determining the exact location of a phosphorylation site is currently considered a major hurdle, thus reliable approaches are necessary for the detection and localization of protein phosphorylation. The goal of this PhD thesis was to develop computation methods and tools for mass spectrometry-based protein phosphorylation analysis, particularly validation of phosphorylation sites. In the first two studies, we developed methods for improved identification of phosphorylation sites in MALDI-MS. In the first study it was achieved through the automatic combination of spectra from multiple matrices, while in the second study, an optimized protocol for sample loading and washing conditions was suggested. In the third study, we proposed and evaluated the hypothesis that in ESI-MS, tandem CID and HCD spectra of phosphopeptides can be accurately predicted and used in spectral library searching. This novel strategy for phosphosite validation and identification offered accuracy that outperformed the other currently existing popular methods and proved applicable to complex biological samples. And finally, we significantly improved the performance of our command-line prototype tool, added graphical user interface, and options for customizable simulation parameters and filtering of selected spectra, peptides or proteins. The new software, SimPhospho, is open-source and can be easily integrated in a phosphoproteomics data analysis workflow. Together, these bioinformatics methods and tools enable confident phosphosite assignment and improve reliable phosphoproteome identification and reportin

    Runtime Optimizations for Prediction with Tree-Based Models

    Full text link
    Tree-based models have proven to be an effective solution for web ranking as well as other problems in diverse domains. This paper focuses on optimizing the runtime performance of applying such models to make predictions, given an already-trained model. Although exceedingly simple conceptually, most implementations of tree-based models do not efficiently utilize modern superscalar processor architectures. By laying out data structures in memory in a more cache-conscious fashion, removing branches from the execution flow using a technique called predication, and micro-batching predictions using a technique called vectorization, we are able to better exploit modern processor architectures and significantly improve the speed of tree-based models over hard-coded if-else blocks. Our work contributes to the exploration of architecture-conscious runtime implementations of machine learning algorithms

    Chemical rule-based filtering of MS/MS spectra

    Get PDF
    Abstract Motivation: Identification of proteins by mass spectrometry–based proteomics requires automated interpretation of peptide tandem mass spectrometry spectra. The effectiveness of peptide identification can be greatly improved by filtering out extraneous noise peaks before the subsequent database searching steps. Results: Here we present a novel chemical rule-based filtering algorithm, termed CRF, which makes use of the predictable patterns (rules) of collision-induced peptide fragmentation. The algorithm selects peak pairs that obey the common fragmentation rules within plausible limits of mass tolerance as well as peak intensity and produces spectra that can be subsequently submitted to any search engine. CRF increases the positive predictive value and decreases the number of random matches and thus improves performance by 15–20% in terms of peptide annotation using search engines, such as X!Tandem. Importantly, the algorithm also achieves data compression rates of ∼75%. Availability: The MATLAB source code and a web server are available at http://hydrax.icgeb.trieste.it/CRFilter/ Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online

    Anatomy and evolution of database search engines — a central component of mass spectrometry based proteomic workflows

    Get PDF
    Sequence database search engines are bioinformatics algorithms that identify peptides from tandem mass spectra using a reference protein sequence database. Two decades of development, notably driven by advances in mass spectrometry, have provided scientists with more than 30 published search engines, each with its own properties. In this review, we present the common paradigm behind the different implementations, and its limitations for modern mass spectrometry datasets. We also detail how the search engines attempt to alleviate these limitations, and provide an overview of the different software frameworks available to the researcher. Finally, we highlight alternative approaches for the identification of proteomic mass spectrometry datasets, either as a replacement for, or as a complement to, sequence database search engines.acceptedVersio

    Ultra Performance Liquid Chromatography and High Resolution Mass Spectrometry for the Analysis of Plant Lipids

    Get PDF
    Holistic analysis of lipids is becoming increasingly popular in the life sciences. Recently, several interesting, mass spectrometry-based studies have been conducted, especially in plant biology. However, while great advancements have been made we are still far from detecting all the lipids species in an organism. In this study we developed an ultra performance liquid chromatography-based method using a high resolution, accurate mass, mass spectrometer for the comprehensive profiling of more than 260 polar and non-polar Arabidopsis thaliana leaf lipids. The method is fully compatible to the commonly used lipid extraction protocols and provides a viable alternative to the commonly used direct infusion-based shotgun lipidomics approaches. The whole process is described in detail and compared to alternative lipidomic approaches. Next to the developed method we also introduce an in-house developed database search software (GoBioSpace), which allows one to perform targeted or un-targeted lipidomic and metabolomic analysis on mass spectrometric data of every kind

    Bridging data management platforms and visualization tools to enable ad-hoc and smart analytics in life sciences

    Full text link
    Core facilities have to offer technologies that best serve the needs of their users and provide them a competitive advantage in research. They have to set up and maintain instruments in the range of ten to a hundred, which produce large amounts of data and serve thousands of active projects and customers. Particular emphasis has to be given to the reproducibility of the results. More and more, the entire process from building the research hypothesis, conducting the experiments, doing the measurements, through the data explorations and analysis is solely driven by very few experts in various scientific fields. Still, the ability to perform the entire data exploration in real-time on a personal computer is often hampered by the heterogeneity of software, the data structure formats of the output, and the enormous data sizes. These impact the design and architecture of the implemented software stack. At the Functional Genomics Center Zurich (FGCZ), a joint state-of-the-art research and training facility of ETH Zurich and the University of Zurich, we have developed the B-Fabric system, which has served for more than a decade, an entire life sciences community with fundamental data science support. In this paper, we sketch how such a system can be used to glue together data (including metadata), computing infrastructures (clusters and clouds), and visualization software to support instant data exploration and visual analysis. We illustrate our in-daily life implemented approach using visualization applications of mass spectrometry data

    Sharing of worldwide distributed carbohydrate-related digital resources: online connection of the Bacterial Carbohydrate Structure DataBase and GLYCOSCIENCES.de

    Get PDF
    Functional glycomics, the scientific attempt to identify and assign functions to all glycan molecules synthesized by an organism, is an emerging field of science. In recent years, several databases have been started, all aiming to support deciphering the biological function of carbohydrates. However, diverse encoding and storage schemes are in use amongst these databases, significantly hampering the interchange of data. The mutual online access between the Bacterial Carbohydrate Structure DataBase (BCSDB) and the GLYCOSCIENCES.de portal, as a first reported attempt of a structure-based direct interconnection of two glyco-related databases is described. In this approach, users have to learn only one interface, will always have access to the latest data of both services, and will have the results of both searches presented in a consistent way. The establishment of this connection helped to find shortcomings and inconsistencies in the database design and functionality related to underlying data concepts and structural representations. For the maintenance of the databases, duplication of work can be easily avoided, and will hopefully lead to a better worldwide acceptance of both services within the community of glycoscienists. BCSDB is available at and the GLYCOSCIENCES.de portal a
    corecore