480 research outputs found

    LBE: A Computational Load Balancing Algorithm for Speeding up Parallel Peptide Search in Mass-Spectrometry based Proteomics

    Get PDF
    The most commonly employed method for peptide identification in mass-spectrometry based proteomics involves comparing experimentally obtained tandem MS/MS spectra against a set of theoretical MS/MS spectra. The theoretical MS/MS spectra data are predicted using protein sequence database. Most state-of-the-art peptide search algorithms index theoretical spectra data to quickly filter-in the relevant (similar) indexed spectra when searching an experimental MS/MS spectrum. Data filtration substantially reduces the required number of computationally expensive spectrum-to-spectrum comparison operations. However, the number of predicted (and indexed) theoretical spectra grows exponentially with increase in posttranslational modifications creating a memory and I/O bottleneck. In this paper, we present a parallel algorithm, called LBE, for efficient partitioning of theoretical spectra data on a distributed-memory architecture. Our proposed algorithm first groups the similar theoretical spectra. The groups are then finely split across the system allowing machines to perform almost equal amount of work when querying a MS/MS spectrum. Our results show that the compute load imbalance using LBE based data distribution is 20% allowing speedups of order of magnitudes over existing methods. The proposed algorithm has been implemented on a compute cluster using MPI library. Experimental results for increasing index sizes are reported in terms of execution time, speedups and memory footprint. To the best of our knowledge, LBE is the first load-balancing technique for MS/MS proteomics data on memory-distributed clusters that incorporates proteomics domain knowledge for efficient load-balancing. Source code is made available at: https://github.com/pcdslab/lbdslim/tree/mp

    SpectroGene: A Tool for Proteogenomic Annotations Using Top-Down Spectra

    Get PDF
    In the past decade, proteogenomics has emerged as a valuable technique that contributes to the state-of-the-art in genome annotation; however, previous proteogenomic studies were limited to bottom-up mass spectrometry and did not take advantage of top-down approaches. We show that top-down proteogenomics allows one to address the problems that remained beyond the reach of traditional bottom-up proteogenomics. In particular, we show that top-down proteogenomics leads to the discovery of previously unannotated genes even in extensively studied bacterial genomes and present SpectroGene, a software tool for genome annotation using top-down tandem mass spectra. We further show that top-down proteogenomics searches (against the six-frame translation of a genome) identify nearly all proteoforms found in traditional top-down proteomics searches (against the annotated proteome). SpectroGene is freely available at http://github.com/fenderglass/SpectroGene

    High Performance Computing Algorithms for Accelerating Peptide Identification from Mass-Spectrometry Data Using Heterogeneous Supercomputers

    Get PDF
    Fast and accurate identification of peptides and proteins from the mass spectrometry (MS) data is a critical problem in modern systems biology. Database peptide search is the most commonly used computational method to identify peptide sequences from the MS data. In this method, giga-bytes of experimentally generated MS data are compared against tera-byte sized databases of theoretically simulated MS data resulting in a compute- and data-intensive problem requiring days or weeks of computational times on desktop machines. Existing serial and high performance computing (HPC) algorithms strive to accelerate and improve the computational efficiency of the search, but exhibit sub-optimal performances due to their inefficient parallelization models, low resource utilization and high overhead costs

    Systematic Evaluation of Protein Sequence Filtering Algorithms for Proteoform Identification Using Top-Down Mass Spectrometry

    Get PDF
    Complex proteoforms contain various primary structural alterations resulting from variations in genes, RNA, and proteins. Top-down mass spectrometry is commonly used for analyzing complex proteoforms because it provides whole sequence information of the proteoforms. Proteoform identification by top-down mass spectral database search is a challenging computational problem because the types and/or locations of some alterations in target proteoforms are in general unknown. Although spectral alignment and mass graph alignment algorithms have been proposed for identifying proteoforms with unknown alterations, they are extremely slow to align millions of spectra against tens of thousands of protein sequences in high throughput proteome level analyses. Many software tools in this area combine efficient protein sequence filtering algorithms and spectral alignment algorithms to speed up database search. As a result, the performance of these tools heavily relies on the sensitivity and efficiency of their filtering algorithms. Here, we propose two efficient approximate spectrum-based filtering algorithms for proteoform identification. We evaluated the performances of the proposed algorithms and four existing ones on simulated and real top-down mass spectrometry data sets. Experiments showed that the proposed algorithms outperformed the existing ones for complex proteoform identification. In addition, combining the proposed filtering algorithms and mass graph alignment algorithms identified many proteoforms missed by ProSightPC in proteome-level proteoform analyses

    NBPMF: Novel Network-Based Inference Methods for Peptide Mass Fingerprinting

    Get PDF
    Proteins are large, complex molecules that perform a vast array of functions in every living cell. A proteome is a set of proteins produced in an organism, and proteomics is the large-scale study of proteomes. Several high-throughput technologies have been developed in proteomics, where the most commonly applied are mass spectrometry (MS) based approaches. MS is an analytical technique for determining the composition of a sample. Recently it has become a primary tool for protein identification, quantification, and post translational modification (PTM) characterization in proteomics research. There are usually two different ways to identify proteins: top-down and bottom-up. Top-down approaches are based on subjecting intact protein ions and large fragment ions to tandem MS directly, while bottom-up methods are based on mass spectrometric analysis of peptides derived from proteolytic digestion, usually with trypsin. In bottom-up techniques, peptide mass fingerprinting (PMF) is widely used to identify proteins from MS dataset. Conventional PMF representatives such as probabilistic MOWSE algorithm, is based on mass distribution of tryptic peptides. In this thesis, we developed a novel network-based inference software termed NBPMF. By analyzing peptide-protein bipartite network, we designed new peptide protein matching score functions. We present two methods: the static one, ProbS, is based on an independent probability framework; and the dynamic one, HeatS, depicts input dataset as dependent peptides. Moreover, we use linear regression to adjust the matching score according to the masses of proteins. In addition, we consider the order of retention time to further correct the score function. In the post processing, we design two algorithms: assignment of peaks, and protein filtration. The former restricts that a peak can only be assigned to one peptide in order to reduce random matches; and the latter assumes each peak can only be assigned to one protein. In the result validation, we propose two new target-decoy search strategies to estimate the false discovery rate (FDR). The experiments on simulated, authentic, and simulated authentic dataset demonstrate that our NBPMF approaches lead to significantly improved performance compared to several state-of-the-art methods

    New insights on the empirical predictability of spectral indicators for PV performance

    Get PDF
    Accurate produced PV energy estimation is critical to business decisions under long-term investments in PV on a utility scale. PV energy yield is affected by different sites' specific conditions. The variability of the spectral distribution after temperature and irradiation is a site condition that impacts energy yield estimates. Evaluating the impact of the spectral irradiance distribution on the PV performance generally requires accurate information about the PV device's spectral response and the site’s measured spectra. Detailed spectral and device information may not always be available. This study analyzes the interrelations between device-dependent and device-independent energetic spectral indicators with spectral data from nine sites with different climates and latitudes, aiming to relax the requirement for detailed device and spectral information. First, an apparent correlation of each site's spectral distributions' yearly Average Photon Energy with the corresponding latitude is observed. As the commonly applied device-dependent spectral indicator, it can be observed that the monthly mismatch factors of all nine sites exhibit a global linear relationship with the monthly average photon energies. This linear relationship with measured spectral data provides a predictive character for each PV device technology by allowing the estimation of the annual spectral impact from the annual Average Photon Energy, potentially for any site. This work also analyzes the validity of the Spectral Average Useful Fraction and the Spectral Enhancement Factor as alternative device-dependent spectral indicators. These require average spectra and, thus, would reduce the calculation complexity for spectral indicators. Finally, the proposed method was validated qualitatively using synthetic spectral data from the National Solar Radiation Database. The trends of the scatter plot between the synthetic Spectral Mismatch Factor and the Average Photon Energy that follow the experimental linear regression give an idea of the proposed method's functionality, despite the synthetic data's uncertainties.La estimación precisa de la energía fotovoltaica producida es fundamental para las decisiones comerciales en el marco de inversiones a largo plazo en proyectos fotovoltaicos. El rendimiento de la energía fotovoltaica se ve afectado por las condiciones locales epecíficas. La variabilidad de la distribución espectral después de la temperatura y la irradiación es una condición del sitio que afecta las estimaciones de rendimiento energético. Uno de los desafíos al evaluar el impacto de la variabilidad espectral es reducir la complejidad del cálculo. Esto implica realizar una estimación precisa y rápida del impacto espectral con la mínima información requerida a priori. Con este fin, la presente tesis busca analizar las interrelaciones entre los indicadores espectrales energéticos dependientes e independientes del dispositivo fotovoltaico con datos espectrales de varios climas y latitudes en todo el mundo. Debido al enfoque reduccionista que proveen los indicadores espectrales, analizamos la dependencia de la distribución espectral representada por la energía fotónica promedio con la latitud en diferentes escalas de tiempo mensuales y anuales. Al analizar los indicadores espectrales dependientes del dispositivo, se destaca el Spectral mismatch factor, que exhibe una relación lineal global con el la energía fotónica promedio en una escala mensual. El análisis exhaustivo de esta relación con los datos espectrales medidos también proporciona un carácter predictivo al permitir el cálculo del impacto espectral anual a partir de la energía fotónica promedio anual, y por lo tanto tal relación lineal propone ser un modelo empírico para el cálculo directo y sencillo del impacto espectral anual. Adicionalmente, analizamos la validez de manera global de dos indicadores espectrales dependientes del dispositivo, el Spectral Enhancement Factor y el Spectral Average Useful Fraction para los sitios seleccionados. Con ello se busca ofrecer un catálogo multiclimático integrado de las interrelaciones de indicadores espectrales en escala de tiempo anual y mensual. Finalmente, se realizó una validación cualitativa del método propuesto utilizando datos espectrales sintéticos de la National Solar Radiation Database. Las tendencias de los Spectral mismatch factor y Average Photon Energy anuales basados en data espectral sintética siguen a la regresión lineal experimental y por ende, dan una idea de la funcionalidad del método propuesto, pese a las incertidumbres propias de la data espectral sintética

    An Induced Environment Contamination Monitor for the Space Shuttle

    Get PDF
    The Induced Environment Contamination Monitor (IECM), a set of ten instruments integrated into a self-contained unit and scheduled to fly on shuttle Orbital Flight Tests 1 through 6 and on Spacelabs 1 and 2, is described. The IECM is designed to measure the actual environment to determine whether the strict controls placed on the shuttle system have solved the contamination problem. Measurements are taken during prelaunch, ascent, on-orbit, descent, and postlanding. The on-orbit measurements are molecular return flux, background spectral intensity, molecular deposition, and optical surface effects. During the other mission phases dew point, humidity, aerosol content, and trace gas are measured as well as optical surface effects and molecular deposition. The IECM systems and thermal design are discussed. Preflight and ground operations are presented together with associated ground support equipment. Flight operations and data reduction plans are given

    Graphics Processing Units: Abstract Modelling and Applications in Bioinformatics

    Get PDF
    The Graphical Processing Unit is a specialised piece of hardware that contains many low powered cores, available on both the consumer and industrial market. The original Graphical Processing Units were designed for processing high quality graphical images, for presentation to the screen, and were therefore marketed to the computer games market segment. More recently, frameworks such as CUDA and OpenCL allowed the specialised highly parallel architecture of the Graphical Processing Unit to be used for not just graphical operations, but for general computation. This is known as General Purpose Programming on Graphical Processing Units, and it has attracted interest from the scientific community, looking for ways to exploit this highly parallel environment, which was cheaper and more accessible than the traditional High Performance Computing platforms, such as the supercomputer. This interest in developing algorithms that exploit the parallel architecture of the Graphical Processing Unit has highlighted the need for scientists to be able to analyse proposed algorithms, just as happens for proposed sequential algorithms. In this thesis, we study the abstract modelling of computation on the Graphical Processing Unit, and the application of Graphical Processing Unit-based algorithms in the field of bioinformatics, the field of using computational algorithms to solve biological problems. We show that existing abstract models for analysing parallel algorithms on the Graphical Processing Unit are not able to sufficiently and accurately model all that is required. We propose a new abstract model, called the Abstract Transferring Graphical Processing Unit Model, which is able to provide analysis of Graphical Processing Unit-based algorithms that is more accurate than existing abstract models. It does this by capturing the data transfer between the Central Processing Unit and the Graphical Processing Unit. We demonstrate the accuracy and applicability of our model with several computational problems, showing that our model provides greater accuracy than the existing models, verifying these claims using experiments. We also contribute novel Graphics Processing Unit-base solutions to two bioinformatics problems: DNA sequence alignment, and Protein spectral identification, demonstrating promising levels of improvement against the sequential Central Processing Unit experiments

    Unraveling the metabolome of grapevine through FT-ICR-MS : from nutritional value to pathogen resistance

    Get PDF
    Grapevine (Vitis vinifera L.) is one of most important fruit crops in the world due to its numerous food products, namely fresh and dried table grapes, wine and intermediate products, with a high economic importance worldwide. Concerning nutritional value, grapes are highly studied and a great diversity of secondary bioactive metabolites has already been identified. However, an important grapevine by-product, also containing a high nutritional value, but sometimes disregarded is grapevine leaves. They are an abundant source of compounds with interest in human health and are already included in human diet in several countries. The study of the nutritional values of this by-product is essential towards the improvement of food systems. Hence, in this PhD dissertation an untargeted metabolomic profiling of the leaves of Vitis vinifera cultivar ‘Pinot noir’ was performed by Fourier-transform ion cyclotron-resonance mass spectrometry (FT-ICR-MS), (CHAPTER II). Numerous compounds with diverse nutritional and pharmacological properties, particularly polyphenols and phenolic compounds, several phytosterols and fatty acids (the most represented lipids’ secondary class), were identified. Grapevine leaves were also evaluated for their antioxidant capacity. It was found that leaves present a high antioxidant capacity, similar to berries, putting grapevine leaves at the top of the list of foods with the highest antioxidant activity. Traditional premium cultivars of wine and table grapes are highly susceptible to various diseases. Grapevine downy mildew, powdery mildew and gray mold are caused, respectively, by the biotrophic oomycete Plasmopara viticola (Berk. & Curt.) Berl. & de Toni) Beri, et de Toni], by the biotrophic fungus Erysiphe necator (Schweinf.) Burrill) and by the necrotrophic fungus Botrytis cinerea Pers.). In Europe, disease management became one of the main tasks for viticulture, being the current strategy, for disease control, the massive use of fungicides and pesticides in each growing season. This practice has several associated problems, from the environmental impact to the economical level, and even in human health. The alternative approach to the application of pesticides is breeding for resistance, clearly the most effective and sustainable approach, particularly if coupled to the selection of desirable traits from local grapevine cultivars. However, a successful breeding program of grape plants with increased resistance traits against pathogens requires not only an understanding of the innate resistance mechanisms of cultivars against fungi/oomycetes, but also the identification of biomarkers of tolerance or susceptibility. Among these, metabolic biomarkers may prove particularly useful, not only because they can be determined in a high throughput way but, above all, because metabolites provide an accurate image of the metabolic state of the plant. To better understand the metabolic differences associated with intrinsic defence mechanisms of grapevine to pathogens, the metabolome of several genotypes with different tolerance degrees to fungal/oomycete pathogens was compared through an untargeted metabolomics approach by FT-ICR-MS (CHAPTERS III, IV and V). First, a comparison of two Vitis vinifera (V. vinifera cv. Trincadeira e V. vinifera cv. Regent, susceptible and tolerant, respectively, to pathogens) was performed and discriminatory compounds between these two cultivars, were identified (CHAPTER III). Also, through the comparison of the metabolome of one Vitis vinifera (V. vinifera cv. Cabernet Sauvignon, susceptible to pathogens) and one Vitis species (Vitis rotundifolia, tolerant), was possible to distinguish both genotypes and determine that Vitis rotundifolia metabolome appeared to be more complex according to the chemical formulas analysed (CHAPTER IV). Albeit grapevine metabolome is complex, it is possible to distinguish Vitis species and different genotypes within the same species. Ultimately, to identify compounds that contribute to the segregation between susceptible and tolerant grapevines, eleven Vitis genotypes, were compared at the metabolite level (CHAPTER V). From all the metabolites identified, seven compounds with a higher accumulation on susceptible genotypes were selected. Their metabolic pathways were analysed and the expression profile of biosynthesis and/or degradation enzymes coding genes was evaluated by Real-time Polymerase Chain Reaction (qPCR). qPCR studies require as internal controls one or more reference genes. Hence, in this study, ten possible reference genes were tested and the three most stable reference genes (ubiquitin-conjugating enzyme – UBQ, SAND family protein - SAND and elongation factor 1-alpha - EF1α) were established for our analysis and selected for qPCR data normalization. Our data revealed that the leucoanthocyanidin reductase 2 gene (LAR2) presented a significant increase of expression in susceptible genotypes, in accordance with catechin accumulation in this analysis group, being a possible metabolic constitutive biomarker, associated to susceptibility. The interaction of grapevine-P.viticola was also analysed by FT-ICR-MS (CHAPTERS VI and VII). The metabolome of Vitis vinifera cv. Trincadeira after 24 hours post-infection (hpi) was analysed and, based only on the chemical profile and representation plots, the discrimination between infected and non-infected grapevine leaves was possible (CHAPTER VI). A further analysis of Vitis vinifera cv. Trincadeira infected with P. viticola was performed through Matrix-assisted laser desorption/ionization (MALDI) FT-ICR-MS imaging, to identify leaf surface compounds related to the grapevine-pathogen interaction (CHAPTER VII). Putatively identified sucrose ions were more abundant on P. viticola infected leaves when compared to control ones. Also, sucrose was mainly located around the veins, which is an indicator of the correlation of putatively identified sucrose at P. viticola infection sites, leading to the hypothesis that the pathogen is extracting sucrose from grapevine to reproduce. Each chapter was written as a scientific article and has its own abstract, introduction, materials and methods, results and discussion, conclusion, acknowledgments and references. The results obtained in this PhD thesis are a starting point on the elucidation of the molecular mechanisms related to the intrinsic tolerance/susceptibility to different pathogens. Also, these results can be used for the development of new approaches and help to improve breeding and introgression line programs

    Sample Preparation in Metabolomics

    Get PDF
    Metabolomics is increasingly being used to explore the dynamic responses of living systems in biochemical research. The complexity of the metabolome is outstanding, requiring the use of complementary analytical platforms and methods for its quantitative or qualitative profiling. In alignment with the selected analytical approach and the study aim, sample collection and preparation are critical steps that must be carefully selected and optimized to generate high-quality metabolomic data. This book showcases some of the most recent developments in the field of sample preparation for metabolomics studies. Novel technologies presented include electromembrane extraction of polar metabolites from plasma samples and guidelines for the preparation of biospecimens for the analysis with high-resolution μ magic-angle spinning nuclear magnetic resonance (HR-μMAS NMR). In the following chapters, the spotlight is on sample preparation approaches that have been optimized for diverse bioanalytical applications, including the analysis of cell lines, bacteria, single spheroids, extracellular vesicles, human milk, plant natural products and forest trees
    corecore