1,937 research outputs found

    Chemical rule-based filtering of MS/MS spectra

    Get PDF
    Abstract Motivation: Identification of proteins by mass spectrometry–based proteomics requires automated interpretation of peptide tandem mass spectrometry spectra. The effectiveness of peptide identification can be greatly improved by filtering out extraneous noise peaks before the subsequent database searching steps. Results: Here we present a novel chemical rule-based filtering algorithm, termed CRF, which makes use of the predictable patterns (rules) of collision-induced peptide fragmentation. The algorithm selects peak pairs that obey the common fragmentation rules within plausible limits of mass tolerance as well as peak intensity and produces spectra that can be subsequently submitted to any search engine. CRF increases the positive predictive value and decreases the number of random matches and thus improves performance by 15–20% in terms of peptide annotation using search engines, such as X!Tandem. Importantly, the algorithm also achieves data compression rates of ∼75%. Availability: The MATLAB source code and a web server are available at http://hydrax.icgeb.trieste.it/CRFilter/ Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online

    A compatible exon-exon junction database for the identification of exon skipping events using tandem mass spectrum data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Alternative splicing is an important gene regulation mechanism. It is estimated that about 74% of multi-exon human genes have alternative splicing. High throughput tandem (MS/MS) mass spectrometry provides valuable information for rapidly identifying potentially novel alternatively-spliced protein products from experimental datasets. However, the ability to identify alternative splicing events through tandem mass spectrometry depends on the database against which the spectra are searched.</p> <p>Results</p> <p>We wrote scripts in perl, Bioperl, mysql and Ensembl API and built a theoretical exon-exon junction protein database to account for all possible combinations of exons for a gene while keeping the frame of translation (i.e., keeping only in-phase exon-exon combinations) from the Ensembl Core Database. Using our liver cancer MS/MS dataset, we identified a total of 488 non-redundant peptides that represent putative exon skipping events.</p> <p>Conclusion</p> <p>Our exon-exon junction database provides the scientific community with an efficient means to identify novel alternatively spliced (exon skipping) protein isoforms using mass spectrometry data. This database will be useful in annotating genome structures using rapidly accumulating proteomics data.</p

    Big Data Proteogenomics and High Performance Computing: Challenges and Opportunities

    Get PDF
    Proteogenomics is an emerging field of systems biology research at the intersection of proteomics and genomics. Two high-throughput technologies, Mass Spectrometry (MS) for proteomics and Next Generation Sequencing (NGS) machines for genomics are required to conduct proteogenomics studies. Independently both MS and NGS technologies are inflicted with data deluge which creates problems of storage, transfer, analysis and visualization. Integrating these big data sets (NGS+MS) for proteogenomics studies compounds all of the associated computational problems. Existing sequential algorithms for these proteogenomics datasets analysis are inadequate for big data and high performance computing (HPC) solutions are almost non-existent. The purpose of this paper is to introduce the big data problem of proteogenomics and the associated challenges in analyzing, storing and transferring these data sets. Further, opportunities for high performance computing research community are identified and possible future directions are discussed

    Big Data Proteogenomics and High Performance Computing: Challenges and Opportunities

    Get PDF
    Proteogenomics is an emerging field of systems biology research at the intersection of proteomics and genomics. Two high-throughput technologies, Mass Spectrometry (MS) for proteomics and Next Generation Sequencing (NGS) machines for genomics are required to conduct proteogenomics studies. Independently both MS and NGS technologies are inflicted with data deluge which creates problems of storage, transfer, analysis and visualization. Integrating these big data sets (NGS+MS) for proteogenomics studies compounds all of the associated computational problems. Existing sequential algorithms for these proteogenomics datasets analysis are inadequate for big data and high performance computing (HPC) solutions are almost non-existent. The purpose of this paper is to introduce the big data problem of proteogenomics and the associated challenges in analyzing, storing and transferring these data sets. Further, opportunities for high performance computing research community are identified and possible future directions are discussed

    Data access and integration in the ISPIDER proteomics grid

    Get PDF
    Grid computing has great potential for supporting the integration of complex, fast changing biological data repositories to enable distributed data analysis. One scenario where Grid computing has such potential is provided by proteomics resources which are rapidly being developed with the emergence of affordable, reliable methods to study the proteome. The protein identifications arising from these methods derive from multiple repositories which need to be integrated to enable uniform access to them. A number of technologies exist which enable these resources to be accessed in a Grid environment, but the independent development of these resources means that significant data integration challenges, such as heterogeneity and schema evolution, have to be met. This paper presents an architecture which supports the combined use of Grid data access (OGSA-DAI), Grid distributed querying (OGSA-DQP) and data integration (AutoMed) software tools to support distributed data analysis. We discuss the application of this architecture for the integration of several autonomous proteomics data resources

    A Framework for the Design and Analysis of High-Performance Applications on FPGAs using Partial Reconfiguration

    Get PDF
    The field-programmable gate array (FPGA) is a dynamically reconfigurable digital logic chip used to implement custom hardware. The large densities of modern FPGAs and the capability of the on-thely reconfiguration has made the FPGA a viable alternative to fixed logic hardware chips such as the ASIC. In high-performance computing, FPGAs are used as co-processors to speed up computationally intensive processes or as autonomous systems that realize a complete hardware application. However, due to the limited capacity of FPGA logic resources, denser FPGAs must be purchased if more logic resources are required to realize all the functions of a complex application. Alternatively, partial reconfiguration (PR) can be used to swap, on demand, idle components of the application with active components. This research uses PR to swap components to improve the performance of the application given the limited logic resources available with smaller but economical FPGAs. The swap is called ”resource sharing PR”. In a pipelined design of multiple hardware modules (pipeline stages), resource sharing PR is a technique that uses PR to improve the performance of pipeline bottlenecks. This is done by reconfiguring other pipeline stages, typically those that are idle waiting for data from a bottleneck, into an additional parallel bottleneck module. The target pipeline of this research is a two-stage “slow-toast” pipeline where the flow of data traversing the pipeline transitions from a relatively slow, bottleneck stage to a fast stage. A two stage pipeline that combines FPGA-based hardware implementations of well-known Bioinformatics search algorithms, the X! Tandem algorithm and the Smith-Waterman algorithm, is implemented for this research; the implemented pipeline demonstrates that characteristics of these algorithm. The experimental results show that, in a database of unknown peptide spectra, when matching spectra with 388 peaks or greater, performing resource sharing PR to instantiate a parallel X! Tandem module is worth the cost for PR. In addition, from timings gathered during experiments, a general formula was derived for determining the value of performing PR upon a fast module

    Quantitative analysis of mass spectrometry proteomics data : Software for improved life science

    Get PDF
    The rapid advances in life science, including the sequencing of the human genome and numerous other techiques, has given an extraordinary ability to aquire data on biological systems and human disease. Even so, drug development costs are higher than ever, while the rate of new approved treatments is historically low. A potential explanation to this discrepancy might be the difficulty of understanding the biology underlying the acquired data; the difficulty to refine the data to useful knowledge through interpretation. In this thesis the refinement of the complex data from mass spectrometry proteomics is studied. A number of new algorithms and programs are presented and demonstrated to provide increased analytical ability over previously suggested alternatives. With the higher goal of increasing the mass spectrometry laboratory scientific output, pragmatic studies were also performed, to create new set on compression algorithms for reduced storage requirement of mass spectrometry data, and also to characterize instrument stability. The final components of this thesis are the discussion of the technical and instrumental weaknesses associated with the currently employed mass spectrometry proteomics methodology, and the discussion of current lacking academical software quality and the reasons thereof. As a whole, the primary algorithms, the enabling technology, and the weakness discussions all aim to improve the current capability to perform mass spectrometry proteomics. As this technology is crucial to understand the main functional components of biology, proteins, this quest should allow better and higher quality life science data, and ultimately increase the chances of developing new treatments or diagnostics
    corecore