61 research outputs found
On Experimental Designs for Derivative Random Fields
Es werden differenzierbare zufällige Felder zweiter Ordnung untersucht und Vorschläge zur Versuchsplanung von Beobachtungen der abgeleiteten Felder unterbreitet. Von einem gewissen Standpunkt aus werden die folgenden Fragen beantwortet: Wie viele Informationen liefern Beobachtungen von Ableitungen für die Vorhersage des zugrunde liegenden Stochastischen Feldes? Wie beeinflusst eine a priori Wahl der Kovarianzfunktion das Informationsverhältnis zwischen verschiedenen abgeleiteten Feldern im Hinblick auf die Vorhersage? Als Zielfunktion wird das so genannte "imse-update" für den besten linearen Prädiktor betrachtet. Den zentralen Teil stellt die Untersuchung von Versuchsplänen mit (asymptotisch) verschwindenden Korrelationen dar. Hier wird insbesondere der Einfluss der Maternschen Klasse und J-Besselschen Klassen von Kovarianzfuntionen untersucht. Ferner wird der Einfluss gleichzeitiger Beobachtung von verschiedenen Ableitungen untersucht. Schließlich werden einige empirische Studien durchgeführt, aus denen einige praktische Ratschläge abgeleitet werden
Sensor scheduling with time, energy and communication constraints
In this paper, we present new algorithms and analysis for the linear inverse sensor placement and scheduling problems over multiple time instances with power and communications constraints. The proposed algorithms, which deal directly with minimizing the mean squared error (MSE), are based on the convex relaxation approach to address the binary optimization scheduling problems that are formulated in sensor network scenarios. We propose to balance the energy and communications demands of operating a network of sensors over time while we still guarantee a minimum level of estimation accuracy. We measure this accuracy by the MSE for which we provide average case and lower bounds analyses that hold in general, irrespective of the scheduling algorithm used. We show experimentally how the proposed algorithms perform against state-of-the-art methods previously described in the literature
Optimal transport representations and functional principal components for distribution-valued processes
We develop statistical models for samples of distribution-valued stochastic
processes through time-varying optimal transport process representations under
the Wasserstein metric when the values of the process are univariate
distributions. While functional data analysis provides a toolbox for the
analysis of samples of real- or vector-valued processes, there is at present no
coherent statistical methodology available for samples of distribution-valued
processes, which are increasingly encountered in data analysis. To address the
need for such methodology, we introduce a transport model for samples of
distribution-valued stochastic processes that implements an intrinsic approach
whereby distributions are represented by optimal transports. Substituting
transports for distributions addresses the challenge of centering
distribution-valued processes and leads to a useful and interpretable
representation of each realized process by an overall transport and a
real-valued trajectory, utilizing a scalar multiplication operation for
transports. This representation facilitates a connection to Gaussian processes
that proves useful, especially for the case where the distribution-valued
processes are only observed on a sparse grid of time points. We study the
convergence of the key components of the proposed representation to their
population targets and demonstrate the practical utility of the proposed
approach through simulations and application examples
Variable selection and predictive models in Big Data environments
Mención Internacional en el título de doctorIn recent years, the advances in data collection technologies have presented a difficult
challenge by extracting increasingly complex and larger datasets. Traditionally,
statistics methodologies treated with datasets where the number of variables did
not exceed the number of observations, however, dealing with problems where the
number of variables is larger than the number of observations has become more and
more common, and can be seen in areas like economics, genetics, climate data, computer
vision etc. This problem has required the development of new methodologies
suitable for a high dimensional framework.
Most of the statistical methodologies are limited to the study of averages. Least
squares regression, principal component analysis, partial least squares... All these
techniques provide mean based estimations, and are built around the key idea that
the data is normally distributed. But this is an assumption that is usually unverified
in real datasets, where skewness and outliers can easily be found. The estimation
of other metrics like the quantiles can help providing a more complete image of the
data distribution.
This thesis is built around these two core ideas. The development of more robust,
quantile based methodologies suitable for high dimensional problems. The thesis is
structured as a compendium of articles, divided into four chapters where each chapter
has independent content and structure but is nevertheless encompassed within
the main objective of the thesis.
First, Chapter 1 introduces basic concepts and results, assumed to be known
or referenced in the rest of the thesis. A possible solution when dealing with high
dimensional problems in the field of regression is the usage of variable selection techniques.
In this regard, sparse group lasso (SGL) has proven to be a very effective
alternative. However, the mathematical formulation of this estimator introduces
some bias in the model, which means that it is possible that the variables selected by the model are not the truly significant ones. Chapter 2 studies the formulation
of an adaptive sparse group lasso for quantile regression, a more flexible formulation
that makes use of the adaptive idea, this is, the usage of adaptive weights in
the penalization to help correcting the bias, improving this way variable selection
and prediction accuracy. An alternative solution to the high dimensional problem
is the usage of a dimension reduction technique like partial least squares. Partial
least squares (PLS) is a methodology initially proposed in the field of chemometrics
as an alternative to traditional least squares regression when the data is high dimensional
or faces colinearity. It works by projecting the independent data matrix
into a subspace of uncorrelated variables that maximize the covariance with the response
matrix. However, being an iterative process based on least squares makes this
methodology extremely sensitive to the presence of outliers or heteroscedasticity.
Chapter 3 defines the fast partial quantile regression, a technique that performs
a projection into a subspace where a quantile covariance metric is maximized, effectively
extending partial least squares to the quantile regression framework. Another
field where it is common to find high dimensional data is in functional data analysis,
where the observations are functions measured along time, instead of scalars.
A key technique in this field is functional principal component analysis (FPCA), a
methodology that provides an orthogonal set of basis functions that best explains
the variability in the data. However, FPCA fails capturing shifts in the scale of the
data affecting the quantiles.
Chapter 4 introduces the functional quantile factor model. A methodology that
extends the concept of FPCA to quantile regression, obtaining a model that can
explain the quantiles of the data conditional on a set of common functions.
In Chapter 5, asgl, a Python package that solves penalized least squares and
quantile regression models in low and high dimensional is introduced frameworks is
introduced, filling a gap in the currently available implementations of these models.
Finally, Chapter 6 presents the final conclusions of this thesis, including possible
lines of research and future work.En los últimos años, los avances en las tecnologías de recopilación de datos han planteado un difícil reto al extraer conjuntos de datos cada vez más complejos y de mayor tamaño. Tradicionalmente, las metodologías estadísticas trataban con conjuntos de datos en los que el número de variables no superaba el número de observaciones, sin embargo, enfrentarse a problemas en los que el número de variables es mayor que el número de observaciones se ha convertido en algo cada vez más común, y puede verse en áreas como la economía, la genética, los datos relacionados con el clima, la visión por ordenador, etc. Este problema ha exigido el desarrollo de nuevas metodologías adecuadas para un marco de alta dimensión. La mayoría de las metodologías estadísticas se limitan al estudio de la media. Regresión por mínimos cuadrados, análisis de componentes principales, mínimos cuadrados parciales... Todas estas técnicas proporcionan estimaciones basadas en la media, y están construidas en torno a la idea clave de que los datos se distribuyen normalmente. Pero esta es una suposición que no suele verificarse en los conjuntos de datos reales, en los que es fácil encontrar asimetrías y valores atípicos. La estimación de otras métricas como los cuantiles puede ayudar a proporcionar una imagen más completa de la distribución de los datos. Esta tesis se basa en estas dos ideas fundamentales. El desarrollo de metodologías más robustas, basadas en cuantiles, adecuadas para problemas de alta dimensión. La tesis está estructurada como un compendio de artículos, divididos en cuatro capítulos en los que cada uno de ellos tiene un contenido y una estructura independientes pero que, sin embargo, se engloban dentro del objetivo principal de la tesis. En primer lugar, el Capítulo 1 introduce conceptos y resultados básicos, que se suponen conocidos o a los que se hace referencia en el resto de la tesis. Una posible solución cuando se trata con problemas de alta dimensión en el campo de la regresión es el uso de técnicas de selección de variables. En este sentido, el sparse group lasso (SGL) ha demostrado ser una alternativa muy eficaz. Sin embargo, la formulación matemática de este estimador introduce cierto sesgo en el modelo, lo que significa que es posible que las variables seleccionadas por el modelo no sean las verdaderamente significativas. El Capítulo 2 estudia la formulación de un adaptive sparse group lasso para la regresión cuantílica, una formulación más flexible que hace uso de la idea adaptive, es decir, el uso de pesos adaptativos en la penalización para ayudar a corregir el sesgo, mejorando así la selección de variables y la precisión de las predicciones. Una solución alternativa al problema de la alta dimensionalidad es el uso de una técnica de reducción de dimensión como los mínimos cuadrados parciales. Los mínimos cuadrados parciales (PLS por sus siglas en inglés) es una metodología definida inicialmente en el campo de la quimiometría como una alternativa a la regresión tradicional por mínimos cuadrados cuando los datos son de alta dimensión o tienen problemas de colinearidad. Funciona proyectando la matriz de datos independiente en un subespacio de variables no correlacionadas que maximiza la covarianza con la matriz de respuesta. Sin embargo, al ser un proceso iterativo basado en mínimos cuadrados, esta metodología es extremadamente sensible a la presencia de valores atípicos o heteroscedasticidad. El Capítulo 3 define el fast partial quantile regression, una técnica que realiza una proyección en un subespacio en el que se maximiza una métrica de covarianza cuantílica, extendiendo de forma efectiva los mínimos cuadrados parciales al marco de la regresión cuantílica. Otro campo en el que es habitual encontrar datos de alta dimensión es el del análisis de datos funcionales, en el que las observaciones son funciones medidas a lo largo del tiempo, en lugar de escalares. Una técnica clave en este campo es el análisis de componentes principales funcionales (FPCA por sus siglas en inglés), una metodología que proporciona una base ortogonal de funciones que explica la mayor cantidad posible de variabilidad en los datos. Sin embargo, el FPCA no capta los cambios de escala de los datos que afectan a los cuantiles. El Capítulo 4 presenta el functional quantile factor model. Una metodología que extiende el concepto de FPCA a la regresión cuantílica, obteniendo un modelo que puede explicar los cuantiles de los datos condicionados a un conjunto de funciones comunes. En el capítulo 5 asgl, un paquete para Python que resuelve modelos de mínimos cuadrados y regresión cuantílica penalizados en entornos de baja y alta dimensión es presentado, llenando un vacío en las implementaciones actualmente disponibles de estos modelos. Por último, el Capítulo 6 presenta las conclusiones finales de esta tesis, incluyendo posibles líneas de investigación y trabajo futuro.I want to acknowledge the financial support received by research grants and projects PIPF UC3M, ECO2015-66593-P (Ministerio de Economía y Competitividad, Spain) and PID2020-113961GB-I00 (Agencia Estatal de Investigación, Spain).Programa de Doctorado en Ingeniería Matemática por la Universidad Carlos III de MadridPresidenta: María Luz Durban Reguera.- Secretaria: María Ángeles Gil Álvarez.- Vocal: Ying We
The Meta-Model Approach for Simulation-based Design Optimization.
The design of products and processes makes increasing use of computer simulations for the prediction of its performance. These computer simulations are considerably cheaper than their physical equivalent. Finding the optimal design has therefore become a possibility. One approach for finding the optimal design using computer simulations is the meta-model approach, which approximates the behaviour of the computer simulation outcome using a limited number of time-consuming computer simulations. This thesis contains four main contributions, which are illustrated by industrial cases. First, a method is presented for the construction of an experimental design for computer simulations when the design space is restricted by many (nonlinear) constraints. The second contribution is a new approach for the approximation of the simulation outcome. This approximation method is particularly useful when the simulation model outcome reacts highly nonlinear to its inputs. Third, the meta-model based approach is extended to a robust optimization framework. Using this framework, many uncertainties can be taken into account, including uncertainty on the simulation model outcome. The fourth main contribution is the extension of the approach for use in integral design of many parts of complex systems.
Engineering-Driven Learning Approaches for Bio-Manufacturing and Personalized Medicine
Healthcare problems have tremendous impact on human life. The past two decades have witnessed various biomedical research advances and clinical therapeutic effectiveness, including minimally invasive surgery, regenerative medicine, and immune therapy. However, the development of new treatment methods relies heavily on heuristic approaches and the experience of well-trained healthcare professionals. Therefore, it is often hindered by patient-specific genotypes and phenotypes, operator-dependent post-surgical outcomes, and exorbitant cost. Towards clinically effective and in-expensive treatments, this thesis develops analytics-based methodologies that integrate statistics, machine learning, and advanced manufacturing.
Chapter 1 of my thesis introduces a novel function-on-function surrogate model with application to tissue-mimicking of 3D-printed medical prototypes. Using synthetic metamaterials to mimic biological tissue, 3D-printed medical prototypes are becoming increasingly important in improving surgery success rates. Here, the objective is to model mechanical response curves via functional metamaterial structures, and then conduct a tissue-mimicking optimization to find the best metamaterial structure. The proposed function-on-function surrogate model utilizes a Gaussian process for efficient emulation and optimization. For functional inputs, we propose a spectral-distance correlation function, which captures important spectral differences between two functional inputs. Dependencies for functional outputs are then modeled via a co-kriging framework. We further adopt shrinkage priors to learn and incorporate important physics. Finally, we demonstrate the effectiveness of the proposed emulator in a real-world study on heart surgery.
Chapter 2 proposes an adaptive design method for experimentation under response censoring, often encountered in biomedical experiments. Censoring would result in a significant loss of information, and thereby a poor predictive model over an input domain. For such problems, experimental design is paramount for maximizing predictive power with a limited budget for expensive experimental runs. We propose an integrated censored mean-squared error (ICMSE) design method, which first estimates the posterior probability of a new observation being censored and then adaptively chooses design points that minimize predictive uncertainty under censoring. Adopting a Gaussian process model with product correlation functions, our ICMSE criterion has an easy-to-evaluate expression for efficient design optimization. We demonstrate the effectiveness of the ICMSE method in an application of medical device testing.
Chapter 3 develops an active image synthesis method for efficient labeling (AISEL) to improve the learning performance in healthcare and medicine tasks. This is because the limited availability of data and the high costs of data collection are the key challenges when applying deep neural networks to healthcare applications. Our AISEL can generate a complementary dataset, with labels actively acquired to incorporate underlying physical knowledge at hand. AISEL framework first leverages a bidirectional generative invertible network (GIN) to extract interpretable features from training images and generate physically meaningful virtual ones. It then efficiently samples virtual images to exploit uncertain regions and explore the entire image space. We demonstrate the effectiveness of AISEL on a heart surgery study, where it lowers the labeling cost by 90% while achieving a 15% improvement in prediction accuracy.
Chapter 4 presents a calibration-free statistical framework for the promising chimeric antigen receptor T cell therapy in fighting cancers. The objective is to effectively recover critical quality attributes under the intrinsic patient-to-patient variability, and therefore lower the cost of cell therapy. Our calibration-free approach models the patient-to-patient variability via a patient-specific calibration parameter. We adopt multiple biosensors to construct a patient-invariance statistic and alleviate the effect of the calibration parameter. Using the patient-invariance statistic, we can then recover the critical quality attribute during cell culture, free from the calibration parameter. In a T cell therapy study, our method effectively recovers viable cell concentration for cell culture monitoring and scale-up.Ph.D
Bayesian quadrature, energy minimization, and space-filling design
A standard objective in computer experiments is to approximate the behavior of an unknown function on a compact domain from a few evaluations inside the domain. When little is known about the function, space-filling design is advisable: typically, points of evaluation spread out across the available space are obtained by minimizing a geometrical (for instance, covering radius) or a discrepancy criterion measuring distance to uniformity. The paper investigates connections between design for integration (quadrature design), construction of the (continuous) best linear unbiased estimator (BLUE) for the location model, space-filling design, and minimization of energy (kernel discrepancy) for signed measures. Integrally strictly positive definite kernels define strictly convex energy functionals, with an equivalence between the notions of potential and directional derivative, showing the strong relation between discrepancy minimization and more traditional design of optimal experiments. In particular, kernel herding algorithms, which are special instances of vertex-direction methods used in optimal design, can be applied to the construction of point sequences with suitable space-filling properties
- …