1,345 research outputs found

    Methodology and theory for partial least squares applied to functional data

    Full text link
    The partial least squares procedure was originally developed to estimate the slope parameter in multivariate parametric models. More recently it has gained popularity in the functional data literature. There, the partial least squares estimator of slope is either used to construct linear predictive models, or as a tool to project the data onto a one-dimensional quantity that is employed for further statistical analysis. Although the partial least squares approach is often viewed as an attractive alternative to projections onto the principal component basis, its properties are less well known than those of the latter, mainly because of its iterative nature. We develop an explicit formulation of partial least squares for functional data, which leads to insightful results and motivates new theory, demonstrating consistency and establishing convergence rates.Comment: Published in at http://dx.doi.org/10.1214/11-AOS958 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Improvements to PLSc: Remaining problems and simple solutions

    Get PDF
    The recent article by Dijkstra and Henseler (2015b) presents a consistent partial least squares (PLSc) estimator that corrects for measurement error attenuation and provides evidence showing that, generally, PLSc performs comparably to a wide variety of more conventional estimators for structural equation models (SEM) with latent variables. However, PLSc does not adjust for other limitations of conventional PLS, namely: (1) bias in estimates of regression coefficients due to capitalization on chance; and (2) overestimation of composite reliability due to the proportionality relation between factor loadings and indicator weights. In this article, we illustrate these problems and then propose a simple solution: the use of unit-weighted composites, rather than those constructed from PLS results, combined with errors-in-variables regression (EIV) by using reliabilities obtained from factor analysis. Our simulations show that these two improvements perform as well as or better than PLSc. We also provide examples of how our proposed estimator can be easily implemented in various proprietary and open source software packages

    Data-Driven Fault Detection and Reasoning for Industrial Monitoring

    Get PDF
    This open access book assesses the potential of data-driven methods in industrial process monitoring engineering. The process modeling, fault detection, classification, isolation, and reasoning are studied in detail. These methods can be used to improve the safety and reliability of industrial processes. Fault diagnosis, including fault detection and reasoning, has attracted engineers and scientists from various fields such as control, machinery, mathematics, and automation engineering. Combining the diagnosis algorithms and application cases, this book establishes a basic framework for this topic and implements various statistical analysis methods for process monitoring. This book is intended for senior undergraduate and graduate students who are interested in fault diagnosis technology, researchers investigating automation and industrial security, professional practitioners and engineers working on engineering modeling and data processing applications. This is an open access book

    (Q)SAR Modelling of Nanomaterial Toxicity - A Critical Review

    Get PDF
    There is an increasing recognition that nanomaterials pose a risk to human health, and that the novel engineered nanomaterials (ENMs) in the nanotechnology industry and their increasing industrial usage poses the most immediate problem for hazard assessment, as many of them remain untested. The large number of materials and their variants (different sizes and coatings for instance) that require testing and ethical pressure towards non-animal testing means that expensive animal bioassay is precluded, and the use of (quantitative) structure activity relationships ((Q)SAR) models as an alternative source of hazard information should be explored. (Q)SAR modelling can be applied to fill the critical knowledge gaps by making the best use of existing data, prioritize physicochemical parameters driving toxicity, and provide practical solutions to the risk assessment problems caused by the diversity of ENMs. This paper covers the core components required for successful application of (Q)SAR technologies to ENMs toxicity prediction, and summarizes the published nano-(Q)SAR studies and outlines the challenges ahead for nano-(Q)SAR modelling. It provides a critical review of (1) the present status of the availability of ENMs characterization/toxicity data, (2) the characterization of nanostructures that meets the need of (Q)SAR analysis, (3) the summary of published nano-(Q)SAR studies and their limitations, (4) the in silico tools for (Q)SAR screening of nanotoxicity and (5) the prospective directions for the development of nano-(Q)SAR models

    Hyperspektral avbildning: algoritmiske fremskritt innen variabelt utvalg og anvendelser til trevitenskap

    Get PDF
    According to Beer’s Law there is a linear dependence between the absorbance of a material and the concentration of an absorbing species in the material. Thus, if one is interested in modeling the concentration of an absorbing species, it should be possible to do so by utilizing a linear model to describe the concentration of the species from a measurement of the absorbance of the material. This thesis is concerned with developing such models from hyperspectral measurements taken in the visible (vis) and near infrared (NIR) region of the electromagnetic spectrum. When developing such models, it is frequently the case that a majority of the wavelengths within a measured spectrum are not absorbed by the species of interest - and should therefore preferably be excluded from the developed model in order to optimize its performance. The process of identifying unnecessary wavelengths is often driven by trial and error, as such it tends to be time consuming and computationally demanding. During the work leading up to Paper I we discovered a conceptually very simple technique which allows calculations to be recycled when developing partial least squares (PLS) models from different combinations of wavelengths. The technique can greatly reduce the computational cost of ftting multiple regression models with various combinations of included/excluded wavelengths to a dataset. In Paper II we incorporate the fndings of Paper I into a genetic algorithm (GA) and demonstrate that the technique also can be used to simultaneously evaluate— in a computationally effcient manner—combinations of wavelengths which are preprocessed using different techniques. In Paper III and IV we develop models which solve wood science related issues. In Paper III samples of spruce (Picea abies) treated with a phosphorus-based fame retardant compound were scanned using a NIR hyperspectral camera. The resulting data was subsequently used to develop a PLS model which estimated the phosphorous content from the spectral signal. In Paper IV samples of thermally modified pine (Pinus sylvestris) were repeatedly scanned over time as they dried. The resulting time series sequences of hyperspectral NIR data was used to develop a regression model capable of estimating the moisture content of the pine from the spectra. In Paper V a generic method is developed for studying and summarizing hyperspectral time series sequences in terms of known and unknown variations. The main idea of the presented method is that spectral variations of known origin are removed from the data. The remaining residual data, containing variation of unknown origin, is then subjected to dimensionality reduction in order to identify new previously unknown variations in the data; variations which in the case of hyperspectral time series data may exhibit temporal as well as spatial patterns of interest. The developed concept was experimentally evaluated in Paper V on a piece of unmodified spruce (Picea abies) which was monitored using a vis-NIR hyperspectral camera as it dried over the course of 21 hours

    MALDI-ToF mass spectrometry biomarker profiling via multivariate data analysis application in the biopharmaceutical bioprocessing industry

    Get PDF
    PhD ThesisMatrix-assisted laser desorption/ionisation time-of-flight mass spectrometry (MALDI-ToF MS) is a technique by which protein profiles can be rapidly produced from biological samples. Proteomic profiling and biomarker identification using MALDI-ToF MS have been utilised widely in microbiology for bacteria identification and in clinical proteomics for disease-related biomarker discovery. To date, the benefits of MALDI-ToF MS have not been realised in the area of mammalian cell culture during bioprocessing. This thesis explores the approach of ‘intact-cell’ MALDI-ToF MS (ICM-MS) combined with projection to latent structures – discriminant analysis (PLS-DA), to discriminate between mammalian cell lines during bioprocessing. Specifically, the industrial collaborator, Lonza Biologics is interested in adopting this approach to discriminate between IgG monoclonal antibody producing Chinese hamster ovaries (CHO) cell lines based on their productivities and identify protein biomarkers which are associated with the cell line productivities. After classifying cell lines into two categories (high/low producers; Hs/Ls), it is hypothesised that Hs and Ls CHO cells exhibit different metabolic profiles and hence differences in phenotypic expression patterns will be observed. The protein expression patterns correlate to the productivities of the cell lines, and introduce between-class variability. The chemometric method of PLS-DA can use this variability to classify the cell lines as Hs or Ls. A number of differentially expressed proteins were matched and identified as biomarkers after a SwissProt/TrEMBL protein database search. The identified proteins revealed that proteins involved in biological processes such as protein biosynthesis, protein folding, glycolysis and cytoskeleton architecture were upregulated in Hs. This study demonstrates that ICM-MS combined with PLS-DA and a protein database search can be a rapid and valuable tool for biomarker discovery in the bioprocessing industry. It may help in providing clues to potential cell genetic engineering targets as well as a tool in process development in the bioprocessing industry. With the completion of the sequencing of the CHO genome, this study provides a foundation for rapid biomarker profiling of CHO cell lines in culture during recombinant protein manufacturing.Lonza Biologics

    Scalable learning for geostatistics and speaker recognition

    Get PDF
    With improved data acquisition methods, the amount of data that is being collected has increased severalfold. One of the objectives in data collection is to learn useful underlying patterns. In order to work with data at this scale, the methods not only need to be effective with the underlying data, but also have to be scalable to handle larger data collections. This thesis focuses on developing scalable and effective methods targeted towards different domains, geostatistics and speaker recognition in particular. Initially we focus on kernel based learning methods and develop a GPU based parallel framework for this class of problems. An improved numerical algorithm that utilizes the GPU parallelization to further enhance the computational performance of kernel regression is proposed. These methods are then demonstrated on problems arising in geostatistics and speaker recognition. In geostatistics, data is often collected at scattered locations and factors like instrument malfunctioning lead to missing observations. Applications often require the ability interpolate this scattered spatiotemporal data on to a regular grid continuously over time. This problem can be formulated as a regression problem, and one of the most popular geostatistical interpolation techniques, kriging is analogous to a standard kernel method: Gaussian process regression. Kriging is computationally expensive and needs major modifications and accelerations in order to be used practically. The GPU framework developed for kernel methods is extended to kriging and further the GPU's texture memory is better utilized for enhanced computational performance. Speaker recognition deals with the task of verifying a person's identity based on samples of his/her speech - "utterances". This thesis focuses on text-independent framework and three new recognition frameworks were developed for this problem. We proposed a kernelized Renyi distance based similarity scoring for speaker recognition. While its performance is promising, it does not generalize well for limited training data and therefore does not compare well to state-of-the-art recognition systems. These systems compensate for the variability in the speech data due to the message, channel variability, noise and reverberation. State-of-the-art systems model each speaker as a mixture of Gaussians (GMM) and compensate for the variability (termed "nuisance"). We propose a novel discriminative framework using a latent variable technique, partial least squares (PLS), for improved recognition. The kernelized version of this algorithm is used to achieve a state of the art speaker ID system, that shows results competitive with the best systems reported on in NIST's 2010 Speaker Recognition Evaluation

    Monitoring wine fermentation using ATR-MIR spectroscopy and chemometric techniques.

    Get PDF
    El vi és un dels productes amb valor afegit més apreciats al món i és per això que el control de la producció vinícola ha sigut sempre un tema prioritari per a la majoria dels cellers. La implementació d’anàlisis at-line com són les Tècniques Analítiques de Processos (PAT), no només permet un control del vi acabat si no que també dóna la possibilitat de prendre mesures correctives al llarg del procés evitant així obtenir un producte final defectuós. En aquesta tesi doctoral, es va investigar la possibilitat d’implementar diferents estratègies per controlar i detectar desviacions durant la fermentació alcohòlica utilitzant un equip portable i de resposta ràpida: un equip d’ espectroscòpia en l’infraroig mitjà, en el mode de reflectància total atenuada (ATR-MIR) el qual permet obtenir, en pocs segons, una gran quantitat d’informació sobre el procés de fermentació que es va tractar amb diferents tècniques quimiomètriques. Primer, utilitzant les dades espectrals i la regressió de mínims quadrats parcials, es van predir diferents paràmetres químics durant la fermentació alcohòlica. En segon lloc, es van comparar els espectres de fermentacions control amb fermentacions desviades utilitzant l’anàlisi discriminant per mínims quadrats parcialsEl vino es uno de los productos con valor añadido más apreciados del mundo y por ello, el control de la producción vinícola ha sido siempre un tema prioritario para la mayoría de bodegas. La implementación de análisis at-line como son las Técnicas Analíticas de Procesos (PAT), no sólo permite un control del vino acabado si no que también brinda la posibilidad de tomar medidas correctivas a lo largo del proceso evitando así obtener un producto final defectuoso. En esta tesis doctoral, se investigó la posibilidad de implementar diferentes estrategias para controlar y detectar desviaciones durante la fermentación alcohólica utilizando un equipo portátil y de respuesta rápida: un equipo de espectroscopia en el infrarrojo medio, en el modo de reflectancia total atenuada (ATR-MIR) el cual permite obtener, en pocos segundos, una gran cantidad de información sobre el proceso de fermentación que se trató con diferentes técnicas quimiométricas. Primero, usando los datos espectrales y la regresión de mínimos cuadrados parciales, se predijeron distintos parámetros químicos durante la fermentación alcohólica. En segundo lugar, se compararon los espectros de fermentaciones control con fermentaciones desviadas utilizando el análisis discriminante por mínimos cuadrados parcialesWine is one of the most appreciated high added-value products in the world and therefore, controlling wine production has always been a priority for most wineries. Implementing at-line analyses such as Process Analytical Technologies (PAT) guidelines, not only enables a control of the final wine but also gives the possibility to apply correcting measures throughout the process, thus avoiding a defective final product. In this doctoral thesis, we investigated the possibility of implementing different strategies to control and detect deviations during wine alcoholic fermentation using a fast and portable equipment: an Attenuated Total Reflectance Mid-Infrared (ATR-MIR) spectrometer which allows obtaining, in a few seconds, a large amount of information about the fermentation process, which was processed with different chemometric techniques. First, using the spectral data and Partial Least Square Regression, different chemical parameters were predicted during alcoholic fermentation. Secondly, we compared the spectra from both Normal Operation Conditions and deviated fermentations using Partial Least Squares Discriminant Analysis. ANOVA–simultaneous component analysis was applied to study the influence of several factors into the variance of the spectra. Multivariate Curve Resolution Alternating Least Squares was used to model both alcoholic and malolactic fermentations. Finally, a PAT methodolog

    Inverse Problems in Geosciences: Modelling the Rock Properties of an Oil Reservoir

    Get PDF

    Data-Driven Fault Detection and Reasoning for Industrial Monitoring

    Get PDF
    This open access book assesses the potential of data-driven methods in industrial process monitoring engineering. The process modeling, fault detection, classification, isolation, and reasoning are studied in detail. These methods can be used to improve the safety and reliability of industrial processes. Fault diagnosis, including fault detection and reasoning, has attracted engineers and scientists from various fields such as control, machinery, mathematics, and automation engineering. Combining the diagnosis algorithms and application cases, this book establishes a basic framework for this topic and implements various statistical analysis methods for process monitoring. This book is intended for senior undergraduate and graduate students who are interested in fault diagnosis technology, researchers investigating automation and industrial security, professional practitioners and engineers working on engineering modeling and data processing applications. This is an open access book
    • …