25 research outputs found

    Mapping the proteome with data-driven methods: A cycle of measurement, modeling, hypothesis generation, and engineering

    Get PDF
    The living cell exhibits emergence of complex behavior and its modeling requires a systemic, integrative approach if we are to thoroughly understand and harness it. The work in this thesis has had the more narrow aim of quantitatively characterizing and mapping the proteome using data-driven methods, as proteins perform most functional and structural roles within the cell. Covered are the different parts of the cycle from improving quantification methods, to deriving protein features relying on their primary structure, predicting the protein content solely from sequence data, and, finally, to developing theoretical protein engineering tools, leading back to experiment.\ua0\ua0\ua0\ua0 High-throughput mass spectrometry platforms provide detailed snapshots of a cell\u27s protein content, which can be mined towards understanding how the phenotype arises from genotype and the interplay between the various properties of the constituent proteins. However, these large and dense data present an increased analysis challenge and current methods capture only a small fraction of signal. The first part of my work has involved tackling these issues with the implementation of a GPU-accelerated and distributed signal decomposition pipeline, making factorization of large proteomics scans feasible and efficient. The pipeline yields individual analyte signals spanning the majority of acquired signal, enabling high precision quantification and further analytical tasks.\ua0\ua0\ua0 Having such detailed snapshots of the proteome enables a multitude of undertakings. One application has been to use a deep neural network model to learn the amino acid sequence determinants of temperature adaptation, in the form of reusable deep model features. More generally, systemic quantities may be predicted from the information encoded in sequence by evolutionary pressure. Two studies taking inspiration from natural language processing have sought to learn the grammars behind the languages of expression, in one case predicting mRNA levels from DNA sequence, and in the other protein abundance from amino acid sequence. These two models helped build a quantitative understanding of the central dogma and, furthermore, in combination yielded an improved predictor of protein amount. Finally, a mathematical framework relying on the embedded space of a deep model has been constructed to assist guided mutation of proteins towards optimizing their abundance

    Efficient Optimization Algorithms for Nonlinear Data Analysis

    Get PDF
    Identification of low-dimensional structures and main sources of variation from multivariate data are fundamental tasks in data analysis. Many methods aimed at these tasks involve solution of an optimization problem. Thus, the objective of this thesis is to develop computationally efficient and theoretically justified methods for solving such problems. Most of the thesis is based on a statistical model, where ridges of the density estimated from the data are considered as relevant features. Finding ridges, that are generalized maxima, necessitates development of advanced optimization methods. An efficient and convergent trust region Newton method for projecting a point onto a ridge of the underlying density is developed for this purpose. The method is utilized in a differential equation-based approach for tracing ridges and computing projection coordinates along them. The density estimation is done nonparametrically by using Gaussian kernels. This allows application of ridge-based methods with only mild assumptions on the underlying structure of the data. The statistical model and the ridge finding methods are adapted to two different applications. The first one is extraction of curvilinear structures from noisy data mixed with background clutter. The second one is a novel nonlinear generalization of principal component analysis (PCA) and its extension to time series data. The methods have a wide range of potential applications, where most of the earlier approaches are inadequate. Examples include identification of faults from seismic data and identification of filaments from cosmological data. Applicability of the nonlinear PCA to climate analysis and reconstruction of periodic patterns from noisy time series data are also demonstrated. Other contributions of the thesis include development of an efficient semidefinite optimization method for embedding graphs into the Euclidean space. The method produces structure-preserving embeddings that maximize interpoint distances. It is primarily developed for dimensionality reduction, but has also potential applications in graph theory and various areas of physics, chemistry and engineering. Asymptotic behaviour of ridges and maxima of Gaussian kernel densities is also investigated when the kernel bandwidth approaches infinity. The results are applied to the nonlinear PCA and to finding significant maxima of such densities, which is a typical problem in visual object tracking.Siirretty Doriast

    Data processing for Life Sciences measurements with hyphenated Gas Chromatography-Ion Mobility Spectrometry

    Get PDF
    [eng] Recent progress in analytical chemistry instrumentation has increased the amount of data available for analysis. This progress has been encompassed by computational improvements, that have enabled new possibilities to analyze larger amounts of data. These two factors have allowed to analyze more complex samples in multiple life science fields, such as biology, medicine, pharmacology, or food science. One of the techniques that has benefited from these improvements is Gas Chromatography - Ion Mobility Spectrometry (GC-IMS). This technique is useful for the detection of Volatile Organic Compounds (VOCs) in complex samples. Ion Mobility Spectrometry is an analytical technique for characterizing chemical substances based on the velocity of gas-phase ions in an electric field. It is able to detect trace levels of volatile chemicals reaching for some analytes ppb concentrations. While the instrument has moderate selectivity it is very fast in the analysis, as an ion mobility spectrum can be acquired in tenths of milliseconds. As it operates at ambient pressure, it is found not only as laboratory instrumentation but also in-site, to perform screening applications. For instance it is often used in airports for the detection of drugs and explosives. To enhance the selectivity of the IMS, especially for the analysis of complex samples, a gas chromatograph can be used for sample pre-separation at the expense of the length of the analysis. While there is better instrumentation and more computational power, better algorithms are still needed to exploit and extract all the information present in the samples. In particular, GC-IMS has not received much attention compared to other analytical techniques. In this work we address some of the data analysis issues for GC-IMS: With respect to the pre-processing, we explore several baseline estimation methods and we suggest a variation of Asymmetric Least Squares, a popular baseline estimation technique, that is able to cope with signals that present large peaks or large dynamic range. This baseline estimation method is used in Gas Chromatography - Mass Spectrometry signals as well, as it suits both techniques. Furthermore, we also characterize spectral misalignments in a several months long study, and propose an alignment method based on monotonic cubic splines for its correction. Based on the misalignment characterization we propose an optimal time span between consecutive calibrant samples. We the explore the usage of Multivariate Curve Resolution methods for the deconvolution of overlapped peaks and their extraction into pure components. We propose the use of a sliding window in the retention time axis to extract the pure components from smaller windows. The pure components are tracked through the windows. This approach is able to extract analytes with lower response with respect to MCR, compounds that have a low variance in the overall matrix Finally we apply some of these developments to real world applications, on a dataset for the prevention of fraud and quality control in the classification of olive oils, measured with GC-IMS, and on data for biomarker discovery of prostate cancer by analyzing the headspace of urine samples with a GC-MS instrument[cat] Els avenços recents en instrumentació química i el progrés en les capacitats computacionals obren noves possibilitats per l’anàlisi de dades provinents de diversos camps en l’àmbit de les ciències de la vida, com la biologia, la medicina o la ciència de l’alimentació. Una de les tècniques que s’ha beneficiat d’aquests avenços és la cromatografia de gasos – espectrometria de mobilitat d’ions (GC-IMS). Aquesta tècnica és útil per detectar compostos orgànics volàtils en mostres complexes. L’IMS és una tècnica analítica per caracteritzar substàncies químiques basada en la velocitat d’ions en fase gasosa en un camp elèctric, capaç de detectar traces d’alguns volàtils en concentracions de ppb ràpidament. Per augmentar-ne la selectivitat, un cromatògraf de gasos pot emprar-se per pre-separar la mostra, a expenses de la durada de l’anàlisi. Tot i disposar de millores en la instrumentació i més poder computacional, calen millors algoritmes per extreure tota la informació de les mostres. En particular, GC-IMS no ha rebut molta atenció en comparació amb altres tècniques analítiques. En aquest treball, tractem alguns problemes de l’anàlisi de dades de GC-IMS: Pel que fa al pre-processat, explorem algoritmes d’estimació de la línia de base i en proposem una millora, adaptada a les necessitats de l’instrument. Aquest algoritme també s’utilitza en mostres de cromatografia de gasos espectrometria de masses (GC-MS), en tant que s’adapta correctament a ambdues tècniques. Caracteritzem els desalineaments espectrals que es produeixen en un estudi de diversos mesos de durada, i proposem un mètode d’alineat basat en splines cúbics monotònics per a la seva correcció i un interval de temps òptim entre dues mostres calibrants. Explorem l’ús de mètodes de resolució multivariant de corbes (MCR) per a la deconvolució de pics solapats i la seva extracció en components purs. Proposem l’ús d’una finestra mòbil en el temps de retenció. Aquesta millora permet extreure més informació d’analits. Finalment utilitzem alguns d’aquests desenvolupaments a dues aplicacions: la prevenció de frau en la classificació d’olis d’oliva, mesurada amb GC-IMS i la cerca de biomarcadors de càncer de pròstata en volàtils de la orina, feta amb GC-MS

    Connected Attribute Filtering Based on Contour Smoothness

    Get PDF
    A new attribute measuring the contour smoothness of 2-D objects is presented in the context of morphological attribute filtering. The attribute is based on the ratio of the circularity and non-compactness, and has a maximum of 1 for a perfect circle. It decreases as the object boundary becomes irregular. Computation on hierarchical image representation structures relies on five auxiliary data members and is rapid. Contour smoothness is a suitable descriptor for detecting and discriminating man-made structures from other image features. An example is demonstrated on a very-high-resolution satellite image using connected pattern spectra and the switchboard platform

    Connected Attribute Filtering Based on Contour Smoothness

    Get PDF

    Algorithms for internal validation clustering measures in the post genomic era.

    Get PDF
    Inferring cluster structure in microarray datasets is a fundamental task for the -omic sciences. A fundamental question in Statistics, Data Analysis and Classification, is the prediction of the number of clusters in a dataset, usually established via internal validation measures. Despite the wealth of internal measures available in the literature, new ones have been recently proposed, some of them specifically for microarray data. In this dissertation, a study of internal validation measures is given, paying particular attention to the stability based ones. Indeed, this class of measures is particularly prominent and promising in order to have a reliable estimate the number of clusters in a dataset. For those measures, a new general algorithmic paradigm is proposed here that highlights the richness of measures in this class and accounts for the ones already available in the literature. Moreover, some of the most representative validation measures are also considered. Experiments on 12 benchmark datasets are performed in order to assess both the intrinsic ability of a measure to predict the correct number of clusters in a dataset and its merit relative to the other measures. The main result is a hierarchy of internal validation measures in terms of precision and speed, highlighting some of their merits and limitations not reported before in the literature. This hierarchy shows that the faster the measure, the less accurate it is. In order to reduce the time performance gap between the fastest and the most precise measures, the technique of designing fast approximation algorithms is systematically applied. The end result is a speed-up of many of the measures studied here that brings the gap between the fastest and the most precise within one order of magnitude in time, with no degradation in their prediction power. Prior to this work, the time gap was at least two orders of magnitude
    corecore