157 research outputs found

    An Out-of-Core GPU based dimensionality reduction algorithm for Big Mass Spectrometry Data and its application in bottom-up Proteomics

    Get PDF
    Modern high resolution Mass Spectrometry instruments can generate millions of spectra in a single systems biology experiment. Each spectrum consists of thousands of peaks but only a small number of peaks actively contribute to deduction of peptides. Therefore, pre-processing of MS data to detect noisy and non-useful peaks are an active area of research. Most of the sequential noise reducing algorithms are impractical to use as a pre-processing step due to high time-complexity. In this paper, we present a GPU based dimensionality-reduction algorithm, called G-MSR, for MS2 spectra. Our proposed algorithm uses novel data structures which optimize the memory and computational operations inside GPU. These novel data structures include Binary Spectra and Quantized Indexed Spectra (QIS). The former helps in communicating essential information between CPU and GPU using minimum amount of data while latter enables us to store and process complex 3-D data structure into a 1-D array structure while maintaining the integrity of MS data. Our proposed algorithm also takes into account the limited memory of GPUs and switches between in-core and out-of-core modes based upon the size of input data. G-MSR achieves a peak speed-up of 386x over its sequential counterpart and is shown to process over a million spectra in just 32 seconds. The code for this algorithm is available as a GPL open-source at GitHub at the following link: https://github.com/pcdslab/G-MSR

    Signal and image processing methods for imaging mass spectrometry data

    Get PDF
    Imaging mass spectrometry (IMS) has evolved as an analytical tool for many biomedical applications. This thesis focuses on algorithms for the analysis of IMS data produced by matrix assisted laser desorption/ionization (MALDI) time-of-flight (TOF) mass spectrometer. IMS provides mass spectra acquired at a grid of spatial points that can be represented as hyperspectral data or a so-called datacube. Analysis of this large and complex data requires efficient computational methods for matrix factorization and for spatial segmentation. In this thesis, state of the art processing methods are reviewed, compared and improved versions are proposed. Mathematical models for peak shapes are reviewed and evaluated. A simulation model for MALDI-TOF is studied, expanded and developed into a simulator for 2D or 3D MALDI-TOF-IMS data. The simulation approach paves way to statistical evaluation of algorithms for analysis of IMS data by providing a gold standard dataset. [...

    Development of a complete advanced computational workflow for high-resolution LDI-MS metabolomics imaging data processing and visualization

    Get PDF
    La imatge per espectrometria de masses (MSI) mapeja la distribució espacial de les molècules en una mostra. Això permet extreure informació Metabolòmica espacialment corralada d'una secció de teixit. MSI no s'usa àmpliament en la metabolòmica espacial a causa de diverses limitacions relacionades amb les matrius MALDI, incloent la generació d'ions que interfereixen en el rang de masses més baix i la difusió lateral dels compostos. Hem desenvolupat un flux de treball que millora l'adquisició de metabòlits en un instrument MALDI utilitzant un "sputtering" per dipositar una nano-capa d'Au directament sobre el teixit. Això minimitza la interferència dels senyals del "background" alhora que permet resolucions espacials molt altes. S'ha desenvolupat un paquet R per a la visualització d'imatges i processament de les dades MSI, tot això mitjançant una implementació optimitzada per a la gestió de la memòria i la programació concurrent. A més, el programari desenvolupat inclou també un algoritme per a l'alineament de masses que millora la precisió de massa.La imagen por espectrometría de masas (MSI) mapea la distribución espacial de las moléculas en una muestra. Esto permite extraer información metabolòmica espacialmente corralada de una sección de tejido. MSI no se usa ampliamente en la metabolòmica espacial debido a varias limitaciones relacionadas con las matrices MALDI, incluyendo la generación de iones que interfieren en el rango de masas más bajo y la difusión lateral de los compuestos. Hemos desarrollado un flujo de trabajo que mejora la adquisición de metabolitos en un instrumento MALDI utilizando un “sputtering” para depositar una nano-capa de Au directamente sobre el tejido. Esto minimiza la interferencia de las señales del “background” a la vez que permite resoluciones espaciales muy altas. Se ha desarrollado un paquete R para la visualización de imágenes y procesado de los datos MSI, todo ello mediante una implementación optimizada para la gestión de la memoria y la programación concurrente. Además, el software desarrollado incluye también un algoritmo para el alineamiento de masas que mejora la precisión de masa.Mass spectrometry imaging (MSI) maps the spatial distributions of molecules in a sample. This allows extracting spatially-correlated metabolomics information from tissue sections. MSI is not widely used in spatial metabolomics due to several limitations related with MALDI matrices, including the generation of interfering ions and in the low mass range and the lateral compound delocalization. We developed a workflow to improve the acquisition of metabolites using a MALDI instrument. We sputter an Au nano-layer directly onto the tissue section enabling the acquisition of metabolites with minimal interference of background signals and ultra-high spatial resolution. We developed an R package for image visualization and MSI data processing, which is optimized to manage datasets larger than computer’s memory using a mutli-threaded implementation. Moreover, our software includes a label-free mass alignment algorithm for mass accuracy enhancement

    Mapping the proteome with data-driven methods: A cycle of measurement, modeling, hypothesis generation, and engineering

    Get PDF
    The living cell exhibits emergence of complex behavior and its modeling requires a systemic, integrative approach if we are to thoroughly understand and harness it. The work in this thesis has had the more narrow aim of quantitatively characterizing and mapping the proteome using data-driven methods, as proteins perform most functional and structural roles within the cell. Covered are the different parts of the cycle from improving quantification methods, to deriving protein features relying on their primary structure, predicting the protein content solely from sequence data, and, finally, to developing theoretical protein engineering tools, leading back to experiment.\ua0\ua0\ua0\ua0 High-throughput mass spectrometry platforms provide detailed snapshots of a cell\u27s protein content, which can be mined towards understanding how the phenotype arises from genotype and the interplay between the various properties of the constituent proteins. However, these large and dense data present an increased analysis challenge and current methods capture only a small fraction of signal. The first part of my work has involved tackling these issues with the implementation of a GPU-accelerated and distributed signal decomposition pipeline, making factorization of large proteomics scans feasible and efficient. The pipeline yields individual analyte signals spanning the majority of acquired signal, enabling high precision quantification and further analytical tasks.\ua0\ua0\ua0 Having such detailed snapshots of the proteome enables a multitude of undertakings. One application has been to use a deep neural network model to learn the amino acid sequence determinants of temperature adaptation, in the form of reusable deep model features. More generally, systemic quantities may be predicted from the information encoded in sequence by evolutionary pressure. Two studies taking inspiration from natural language processing have sought to learn the grammars behind the languages of expression, in one case predicting mRNA levels from DNA sequence, and in the other protein abundance from amino acid sequence. These two models helped build a quantitative understanding of the central dogma and, furthermore, in combination yielded an improved predictor of protein amount. Finally, a mathematical framework relying on the embedded space of a deep model has been constructed to assist guided mutation of proteins towards optimizing their abundance

    Large-Scale and Pan-Cancer Multi-omic Analyses with Machine Learning

    Get PDF
    Multi-omic data analysis has been foundational in many fields of molecular biology, including cancer research. Investigation of the relationship between different omic data types reveals patterns that cannot otherwise be found in a single data type alone. With recent technological advancements in mass spectrometry (MS), MS-based proteomics has enabled the quantification of thousands of proteins in hundreds of cell lines and human tissue samples. This thesis presents several machine learning-based methods that facilitate the integrative analysis of multi-omic data. First, we reviewed five existing multi-omic data integration methods and performed a benchmarking analysis, using a large-scale multi-omic cancer cell line dataset. We evaluated the performance of these machine learning methods for drug response prediction and cancer type classification. Our result provides recommendations to researchers regarding optimal machine learning method selection for their applications. Second, we generated a pan-cancer proteomic map of 949 cancer cell lines across 40 cancer types and developed a machine learning method DeeProM to analyse the multi-omic information of these lines. This pan-cancer proteomic map (ProCan-DepMapSanger) is now publicly available and represents a major resource for the scientific community, for biomarker discovery and for the study of fundamental aspects of protein regulation. Third, we focused on publicly available multi-omic datasets of both cancer cell lines and human tissue samples and developed a Transformer-based deep learning method, DeePathNet, which integrates human knowledge with machine intelligence. We applied DeePathNet on three evaluation tasks, namely drug response prediction, cancer type classification and breast cancer subtype classification. Taken together, our analyses and methods allowed more accurate cancer diagnosis and prognosis

    Deep Learning Techniques for Multi-Dimensional Medical Image Analysis

    Get PDF

    Deep Learning Techniques for Multi-Dimensional Medical Image Analysis

    Get PDF

    Developing computational methods for fundamentals and metrology of mass spectrometry imaging

    Get PDF
    MSI is a suite of powerful imaging tools that can be used to perform untargeted unlabelled analysis into the distribution of a wide range of molecules from a variety of different sample types. Despite widespread use in numerous different research areas, many aspects of MSI fundamentals remain unknown. Not only are experimental aspects such as desorption and ionisation not always fully understood, but the success (or failure) of many of the computational methods used to mine these data cannot yet be easily evaluated. In this thesis, multivariate analysis methods are used to investigate fundamentals of laser parameters in raster mode MALDI imaging, and DF and CF variables in LESA coupled to FAIMS. Following this, novel methods to evaluate clustering algorithms are described, including multivariate normality testing for distance metric evaluation, and means to generate synthetic data based on multivariate normal distribution sampling. These synthetic data are then used to evaluate a variety of different clustering algorithms used previously in MSI and other fields, and a new, more efficient algorithm using graph based clustering and a two phase subset sampling approach is described. This is then demonstrated on large synthetic and real MSI datasets producing extremely accurate and informative segmentation

    Double Backpropagation with Applications to Robustness and Saliency Map Interpretability

    Get PDF
    This thesis is concerned with works in connection to double backpropagation, which is a phenomenon that arises when first-order optimization methods are applied to a neural network's loss function, if this contains derivatives. Its connection to robustness and saliency map interpretability is explained
    corecore