128 research outputs found

    Optimal experimental designs for the exploration of reaction kinetic phase diagrams

    Get PDF

    Bayesian Model-based Methods for the Analysis of DNA Microarrays with Survival, Genetic, and Sequence Data

    Get PDF
    DNA microarrays measure the expression of thousands of genes or DNA fragments simultaneously in which probes have specific complementary hybridization. Gene expression or microarray data analysis problems have a prominent role in the biostatistics, biological sciences, and clinical medicine. The first paper proposes a method for finding associations between the survival time of the subjects and the gene expression of tumor microarrays. Measurement error is known to bias the estimates for survival regression coefficients, and this method minimizes bias. The latent variable model is shown to detect associations between potentially important genes and survival in a breast cancer dataset that conventional models did not detect, and the method is demonstrated to have robustness to misspecification with simulated data. The second paper considers the Expression Quantitative Trait Loci (eQTL) detection problem. An eQTL is a genetic locus that influences gene expression, and the major challenges with this type of data are multiple testing and computational issues. The proposed method extends the Mixture Over Marker (MOM) model to include a structured prior probability that accounts for the transcript location. The new technique exploits the fact that genetic markers are more likely to influence transcripts that share the same location on the genome. The third paper improves the analysis of Chromatin (Ch)-Immunoprecipitation (IP) (ChIP) microarray data. ChIP-chip data analysis estimates the motif of specific Transcription Factor Binding Sites (TFBSs) by comparing the IP DNA sample that is enriched for the TFBS and a control sample of general genomic DNA. The probes on the ChIP-chip array are uniformly spaced on the genome, and the probes that have relatively high intensity in the IP sample will have corresponding sequences that are likely to contain the TFBS motif. Present analytical methods use the array data to discover peaks or regions of IP enrichment then analyze the sequences of these peaks in a separate procedure to discover the motif. The proposed model will integrate enrichment peak finding and motif discovery through a Hidden Markov Model (HMM). Performance comparisons are made between the proposed HMM and the previously developed methods

    Novel chemometric proposals for advanced multivariate data analysis, processing and interpretation

    Full text link
    The present Ph.D. thesis, primarily conceived to support and reinforce the relation between academic and industrial worlds, was developed in collaboration with Shell Global Solutions (Amsterdam, The Netherlands) in the endeavour of applying and possibly extending well-established latent variable-based approaches (i.e. Principal Component Analysis - PCA - Partial Least Squares regression - PLS - or Partial Least Squares Discriminant Analysis - PLSDA) for complex problem solving not only in the fields of manufacturing troubleshooting and optimisation, but also in the wider environment of multivariate data analysis. To this end, novel efficient algorithmic solutions are proposed throughout all chapters to address very disparate tasks, from calibration transfer in spectroscopy to real-time modelling of streaming flows of data. The manuscript is divided into the following six parts, focused on various topics of interest: Part I - Preface, where an overview of this research work, its main aims and justification is given together with a brief introduction on PCA, PLS and PLSDA; Part II - On kernel-based extensions of PCA, PLS and PLSDA, where the potential of kernel techniques, possibly coupled to specific variants of the recently rediscovered pseudo-sample projection, formulated by the English statistician John C. Gower, is explored and their performance compared to that of more classical methodologies in four different applications scenarios: segmentation of Red-Green-Blue (RGB) images, discrimination of on-/off-specification batch runs, monitoring of batch processes and analysis of mixture designs of experiments; Part III - On the selection of the number of factors in PCA by permutation testing, where an extensive guideline on how to accomplish the selection of PCA components by permutation testing is provided through the comprehensive illustration of an original algorithmic procedure implemented for such a purpose; Part IV - On modelling common and distinctive sources of variability in multi-set data analysis, where several practical aspects of two-block common and distinctive component analysis (carried out by methods like Simultaneous Component Analysis - SCA - DIStinctive and COmmon Simultaneous Component Analysis - DISCO-SCA - Adapted Generalised Singular Value Decomposition - Adapted GSVD - ECO-POWER, Canonical Correlation Analysis - CCA - and 2-block Orthogonal Projections to Latent Structures - O2PLS) are discussed, a new computational strategy for determining the number of common factors underlying two data matrices sharing the same row- or column-dimension is described, and two innovative approaches for calibration transfer between near-infrared spectrometers are presented; Part V - On the on-the-fly processing and modelling of continuous high-dimensional data streams, where a novel software system for rational handling of multi-channel measurements recorded in real time, the On-The-Fly Processing (OTFP) tool, is designed; Part VI - Epilogue, where final conclusions are drawn, future perspectives are delineated, and annexes are included.La presente tesis doctoral, concebida principalmente para apoyar y reforzar la relación entre la academia y la industria, se desarrolló en colaboración con Shell Global Solutions (Amsterdam, Países Bajos) en el esfuerzo de aplicar y posiblemente extender los enfoques ya consolidados basados en variables latentes (es decir, Análisis de Componentes Principales - PCA - Regresión en Mínimos Cuadrados Parciales - PLS - o PLS discriminante - PLSDA) para la resolución de problemas complejos no sólo en los campos de mejora y optimización de procesos, sino también en el entorno más amplio del análisis de datos multivariados. Con este fin, en todos los capítulos proponemos nuevas soluciones algorítmicas eficientes para abordar tareas dispares, desde la transferencia de calibración en espectroscopia hasta el modelado en tiempo real de flujos de datos. El manuscrito se divide en las seis partes siguientes, centradas en diversos temas de interés: Parte I - Prefacio, donde presentamos un resumen de este trabajo de investigación, damos sus principales objetivos y justificaciones junto con una breve introducción sobre PCA, PLS y PLSDA; Parte II - Sobre las extensiones basadas en kernels de PCA, PLS y PLSDA, donde presentamos el potencial de las técnicas de kernel, eventualmente acopladas a variantes específicas de la recién redescubierta proyección de pseudo-muestras, formulada por el estadista inglés John C. Gower, y comparamos su rendimiento respecto a metodologías más clásicas en cuatro aplicaciones a escenarios diferentes: segmentación de imágenes Rojo-Verde-Azul (RGB), discriminación y monitorización de procesos por lotes y análisis de diseños de experimentos de mezclas; Parte III - Sobre la selección del número de factores en el PCA por pruebas de permutación, donde aportamos una guía extensa sobre cómo conseguir la selección de componentes de PCA mediante pruebas de permutación y una ilustración completa de un procedimiento algorítmico original implementado para tal fin; Parte IV - Sobre la modelización de fuentes de variabilidad común y distintiva en el análisis de datos multi-conjunto, donde discutimos varios aspectos prácticos del análisis de componentes comunes y distintivos de dos bloques de datos (realizado por métodos como el Análisis Simultáneo de Componentes - SCA - Análisis Simultáneo de Componentes Distintivos y Comunes - DISCO-SCA - Descomposición Adaptada Generalizada de Valores Singulares - Adapted GSVD - ECO-POWER, Análisis de Correlaciones Canónicas - CCA - y Proyecciones Ortogonales de 2 conjuntos a Estructuras Latentes - O2PLS). Presentamos a su vez una nueva estrategia computacional para determinar el número de factores comunes subyacentes a dos matrices de datos que comparten la misma dimensión de fila o columna y dos planteamientos novedosos para la transferencia de calibración entre espectrómetros de infrarrojo cercano; Parte V - Sobre el procesamiento y la modelización en tiempo real de flujos de datos de alta dimensión, donde diseñamos la herramienta de Procesamiento en Tiempo Real (OTFP), un nuevo sistema de manejo racional de mediciones multi-canal registradas en tiempo real; Parte VI - Epílogo, donde presentamos las conclusiones finales, delimitamos las perspectivas futuras, e incluimos los anexos.La present tesi doctoral, concebuda principalment per a recolzar i reforçar la relació entre l'acadèmia i la indústria, es va desenvolupar en col·laboració amb Shell Global Solutions (Amsterdam, Països Baixos) amb l'esforç d'aplicar i possiblement estendre els enfocaments ja consolidats basats en variables latents (és a dir, Anàlisi de Components Principals - PCA - Regressió en Mínims Quadrats Parcials - PLS - o PLS discriminant - PLSDA) per a la resolució de problemes complexos no solament en els camps de la millora i optimització de processos, sinó també en l'entorn més ampli de l'anàlisi de dades multivariades. A aquest efecte, en tots els capítols proposem noves solucions algorítmiques eficients per a abordar tasques dispars, des de la transferència de calibratge en espectroscopia fins al modelatge en temps real de fluxos de dades. El manuscrit es divideix en les sis parts següents, centrades en diversos temes d'interès: Part I - Prefaci, on presentem un resum d'aquest treball de recerca, es donen els seus principals objectius i justificacions juntament amb una breu introducció sobre PCA, PLS i PLSDA; Part II - Sobre les extensions basades en kernels de PCA, PLS i PLSDA, on presentem el potencial de les tècniques de kernel, eventualment acoblades a variants específiques de la recentment redescoberta projecció de pseudo-mostres, formulada per l'estadista anglés John C. Gower, i comparem el seu rendiment respecte a metodologies més clàssiques en quatre aplicacions a escenaris diferents: segmentació d'imatges Roig-Verd-Blau (RGB), discriminació i monitorització de processos per lots i anàlisi de dissenys d'experiments de mescles; Part III - Sobre la selecció del nombre de factors en el PCA per proves de permutació, on aportem una guia extensa sobre com aconseguir la selecció de components de PCA a través de proves de permutació i una il·lustració completa d'un procediment algorítmic original implementat per a la finalitat esmentada; Part IV - Sobre la modelització de fonts de variabilitat comuna i distintiva en l'anàlisi de dades multi-conjunt, on discutim diversos aspectes pràctics de l'anàlisis de components comuns i distintius de dos blocs de dades (realitzat per mètodes com l'Anàlisi Simultània de Components - SCA - Anàlisi Simultània de Components Distintius i Comuns - DISCO-SCA - Descomposició Adaptada Generalitzada en Valors Singulars - Adapted GSVD - ECO-POWER, Anàlisi de Correlacions Canòniques - CCA - i Projeccions Ortogonals de 2 blocs a Estructures Latents - O2PLS). Presentem al mateix temps una nova estratègia computacional per a determinar el nombre de factors comuns subjacents a dues matrius de dades que comparteixen la mateixa dimensió de fila o columna, i dos plantejaments nous per a la transferència de calibratge entre espectròmetres d'infraroig proper; Part V - Sobre el processament i la modelització en temps real de fluxos de dades d'alta dimensió, on dissenyem l'eina de Processament en Temps Real (OTFP), un nou sistema de tractament racional de mesures multi-canal registrades en temps real; Part VI - Epíleg, on presentem les conclusions finals, delimitem les perspectives futures, i incloem annexos.Vitale, R. (2017). Novel chemometric proposals for advanced multivariate data analysis, processing and interpretation [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/90442TESI

    A framework for enhancing process understanding using multivariate tools on commercial batch process data

    Get PDF
    EngD ThesisA lot of effort is made by pharmaceutical companies on the research and development of new pharmaceutical products and processes using the latest in quality by design tools, and process analytical technologies. Older pharmaceutical processes that were developed without the use of these tools are, however, somewhat neglected. Significant quantities of process data are routinely collected and stored but the information contained within this data is not extracted. Extensive literature on multivariate statistical process monitoring and control exists for exploring both batch and continuous process data. However, these methodologies rely on data from processes that are relatively well understood or controlled. Many industrial processes show batch to batch variability, which may be tolerated as it is not detrimental to the quality of the product, and the impact of this variability is not fully understood. The thesis presents a framework for exploring historical batch process data, to extract insights on where process control can be improved. The challenges presented with commercial process data are discussed. Multivariate tools such as multi-way principal component analysis are used to investigate variability in process data. The framework presented discusses the pre-processing steps necessary with batch process data, followed by outlier detection, and finally multivariate modelling of the data to identify where the process could benefit from improved understanding and control. This framework is demonstrated through the application to commercial process data from the active pharmaceutical drug substance manufacturing process of spironolactone at Piramal Healthcare, Morpeth, UK. In this case study, the process exhibits variability in drying times which traditional univariate data analysis has not been able to solve. The results demonstrated some of the challenges the use of the available data from commercial processes. Although the results from the multivariate data analysis did not show a significant statistical difference between the batches with long and short drying times, small differences were observed between these two groups. Further analysis of the crystallization process using infrared spectroscopic techniques which identified a potential root cause to the extended drying time.This EngD project was supported by the Engineering and Physical Sciences Research Council (EPSRC) and Piramal Healthcare, Morpeth

    Proceedings of the 2011 New York Workshop on Computer, Earth and Space Science

    Full text link
    The purpose of the New York Workshop on Computer, Earth and Space Sciences is to bring together the New York area's finest Astronomers, Statisticians, Computer Scientists, Space and Earth Scientists to explore potential synergies between their respective fields. The 2011 edition (CESS2011) was a great success, and we would like to thank all of the presenters and participants for attending. This year was also special as it included authors from the upcoming book titled "Advances in Machine Learning and Data Mining for Astronomy". Over two days, the latest advanced techniques used to analyze the vast amounts of information now available for the understanding of our universe and our planet were presented. These proceedings attempt to provide a small window into what the current state of research is in this vast interdisciplinary field and we'd like to thank the speakers who spent the time to contribute to this volume.Comment: Author lists modified. 82 pages. Workshop Proceedings from CESS 2011 in New York City, Goddard Institute for Space Studie

    Multivariate Prediction Models for Bio-Analytical Data

    No full text
    Quantitative bio-analytical techniques that enable parallel measurements of large numbers of biomolecules generate vast amounts of information for studying and characterising biological systems. These analytical methods are commonly referred to as omics technologies, and can be applied for measurements of e.g. mRNA transcript, protein or metabolite abundances in a biological sample. The work presented in this thesis focuses on the application of multivariate prediction models for modelling and analysis of biological data generated by omics technologies. Omics data commonly contain up to tens of thousands of variables, which are often both noisy and multicollinear. Multivariate statistical methods have previously been shown to be valuable for visualisation and predictive modelling of biological and chemical data with similar properties to omics data. In this thesis currently available multivariate modelling methods are used in new applications, and new methods are developed to address some of the specific challenges associated with modelling of biological data. Three closely related areas of multivariate modelling of biological data are described and demonstrated in this thesis. First, a multivariate projection method is used in a novel application for predictive modelling between omics data sets, demonstrating how data from two analytical sources can be integrated and modelled to- gether by exploring covariation patterns between the data sets. This approach is exemplified by modelling of data from two studies, the first containing proteomic and metabolic profiling data and the second containing transcriptomic and metabolic profiling data. Second, a method for piecewise multivariate modelling of short timeseries data is developed and demonstrated by modelling of simulated data as well as metabolic profiling data from a toxicity study, providing a new method for characterisation of multivariate bio-analytical time-series data. Third, a kernel-based method is developed and applied for non-linear multivariate prediction modelling of omics data, addressing the specific challenge of modelling non-linear variation in biological data

    Cell Type-specific Analysis of Human Interactome and Transcriptome

    Get PDF
    Cells are the fundamental building block of complex tissues in higher-order organisms. These cells take different forms and shapes to perform a broad range of functions. What makes a cell uniquely eligible to perform a task, however, is not well-understood; neither is the defining characteristic that groups similar cells together to constitute a cell type. Even for known cell types, underlying pathways that mediate cell type-specific functionality are not readily available. These functions, in turn, contribute to cell type-specific susceptibility in various disorders

    Proceedings, MSVSCC 2017

    Get PDF
    Proceedings of the 11th Annual Modeling, Simulation & Visualization Student Capstone Conference held on April 20, 2017 at VMASC in Suffolk, Virginia. 211 pp

    Iterative piecewise linear regression to accurately assess statistical significance in batch confounded differential expression analysis

    No full text
    10.1007/978-3-642-30191-9_15Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)7292 LNBI153-16
    corecore