8 research outputs found

    Smooth generalized linear models for aggregated data

    Get PDF
    Mención Internacional en el título de doctorAggregated data commonly appear in areas such as epidemiology, demography, and public health. Generally, the aggregation process is done to protect the privacy of patients, to facilitate compact presentation, or to make it comparable with other coarser datasets. However, this process may hinder the visualization of the underlying distribution that follows the data. Also, it prohibits the direct analysis of relationships between aggregated data and potential risk factors, which are commonly measured at a finer resolution. Therefore, it is of interest to develop statistical methodologies that deal with the disaggregation of coarse health data at a finer scale. For example, in the spatial setting, it could be desirable to obtain estimates, from coarse areal data, at a fine spatial grid or units less coarser than the original ones. These two cases are known as the area-to-point (ATP) and area-to-area (ATA) cases, respectively, which are illustrated in the first chapter of this thesis. Moreover, we can have spatial data recorded at coarse units over time. In some cases, the temporal dimension can also be in an aggregated form, hindering the visualization of the evolution of the underlying process over time. In this thesis we propose the use of a novel non-parametric method that we called composite link mixed model or, more succinctly, CLMM. In our proposed model, we look at the observed data as indirect observations of an underlying process (defined at a finer resolution than observed data), which we want to estimate. The mixed model formulation of our proposal allow us to include fine-scale population information and complex structures as random effects as parts of the modelling of the underlying trend. Since the CLMM is based on the approach given by Eilers (2007), called penalized composite link model (PCLM), we briefly review the PCLM approach in the first section of the second chapter of this thesis. Then, in the second section of this chapter, we introduce the CLMM approach under an univariate setting, which can be seen as a reformulation of the PCLM into a mixed model framework. This is achieved by following the mixed model reformulation of P-splines proposed in Currie and Durbán (2002) and Currie et al. (2006), which is also reviewed here. Then, the parameter estimation of the CLMM can be done under the framework of mixed model theory. This offers another alternative for the estimation of the PCLM, avoiding the use of information criteria for smoothing parameter selection. In the third section of the second chapter, we extend the CLMM approach to the multidimensional (array) case, where Kronecker products are involved in the extended model formulation. Illustrations for the univariate and the multidimensional array settings are presented throughout the second chapter, using mortality and fertility datasets. In the third chapter, we present a new methodology for the analysis of spatially aggregated data, by extending the CLMM approach developed in the second chapter to the spatial case. The spatial CLMM provides smoothed solutions for the ATP and ATA cases described in the first chapter, i.e., it gives a smoothed estimation for the underlying spatial trend, from aggregated data, at a finer resolution. The ATP and ATA cases are illustrated using several mortality (or morbidity) datasets, and simulation studies of the prediction performance between our approach and the area-to-point Poisson kriging of Goovaerts (2006) are realized. Also, in the third chapter we provide a methodology to deal with the overdispersion problem, which is based on the PRIDE (‘penalized regression with individual deviance effects’) approach of Perperoglou and Eilers (2010). In the fourth chapter, we generalize the methodology developed in the third chapter for the analysis of spatio-temporally aggregated data. Under this framework, we adapt the SAP (‘separation of anisotropic penalties’) algorithm of Rodríguez- Álvarez et al. (2015) and the GLAM (‘generalized linear array model’) algorithms given in Currie et al. (2006) and Eilers et al. (2006), to the CLMM context. The use of these efficient algorithms allow us to avoid possible storage problems and to speed up the computational time of the model estimation. We illustrate the methodology presented in this chapter by using a Q fever incidence dataset recorded in the Netherlands at municipality level and by months. Our aim, then, is to estimate smoothed incidences at a fine spatial grid over the study area throughout the 53 weeks of 2009. A simulation study is provided at the end of chapter four, in order to evaluate the prediction performance of our approach under three different coarse situations, using a detailed (and confidential) Q fever incidence dataset. Finally, the fifth chapter summarizes the main contributions made in this thesis and further work.Datos agregados aparecen comúnmente en áreas como la epidemiología, demografía, y salud pública. Generalmente, el proceso de agregación es efectuado para proteger la privacidad de los pacientes, para facilitar una presentación compacta, o para hacerlos comparables con otros conjuntos de datos más gruesos. Sin embargo, este proceso puede dificultar la visualización de la distribución subyacente que siguen los datos. Además, prohíbe el análisis directo de relaciones entre los datos agregados y factores de riesgos potenciales, los cuales son medidos usualmente en una resolución más fina. En consecuencia, es de interés el desarrollar metodologías estadísticas que traten la desagregación de datos de salud gruesos a una escala más fina. Por ejemplo, en el caso espacial, podría ser deseable obtener estimaciones, a partir de datos disponibles en unidades geográficas gruesas, en una malla espacial fina o en unidades menos gruesas que las originales. Estos dos casos se conocen como los casos área-a-punto (ATP, ‘area-to-point’) y área-a-área (ATA, ‘area-to-area’), respectivamente, los cuales son ilustrados en el primer capítulo de esta tesis. Más aún, podemos tener datos espaciales registrados en unidades geográficas gruesas a lo largo del tiempo. En algunos casos, la dimensión temporal también puede estar en una forma agregada, dificultando la visualización de la evolución del proceso subyacente a lo largo del tiempo. En esta tesis proponemos el uso de un novedoso método no-paramétrico que llamamos modelo mixto de enlace compuesto o, más brevemente, CLMM (‘composite link mixed model’). En nuestro modelo propuesto, miramos a los datos observados como observaciones indirectas de un proceso subyacente (definido en una resolución más fina que los datos observados), el cual queremos estimar. La formulación de modelo mixto en nuestra propuesta nos permite incluir información de la población medida en una escala fina y estructuras complejas como efectos aleatorios, como partes de la modelización de la tendencia subyacente. Dado que el CLMM est´a basado en el enfoque dado por Eilers (2007), llamado modelo de enlace compuesto penalizado (PCLM, ‘penalized composite link model’), revisaremos brevemente el enfoque PCLM en la primera sección del segundo capítulo de esta tesis. Luego, en la segunda sección de este capítulo, introduciremos el enfoque CLMM bajo un marco univariante, el cual puede ser visto como una reformulación del PCLM en un marco de modelo mixto. Esto es logrado siguiendo la reformulación como modelo mixto de los P-splines propuestos por Currie y Durbán (2002) y Currie et al. (2006), el cual es también revisado aquí. Luego, la estimación de parámetros del CLMM puede hacerse bajo el marco de la teoría de los modelos mixtos. Esto ofrece otra alternativa para la estimación del PCLM, evitando el uso de criterios de información para la selección del parámetro de suavizado. En la tercera sección del segundo capítulo, extendemos el enfoque CLMM al caso (array) multidimensional, en donde productos de Kronecker están implicados en la formulación del modelo extendido. Ilustraciones para los casos univariantes y (array) multidimensional son presentados a lo largo del segundo capítulo, usando conjuntos de datos de mortalidad y fertilidad. En el tercer capítulo, presentamos una nueva metodología para el análisis de datos agregados espacialmente, extendiendo el enfoque CLMM desarrollado en el segundo capítulo al caso espacial. El CLMM espacial proporciona soluciones suavizadas para los casos ATP y ATA descritos en el primer capítulo, es decir, entrega una estimación suavizada para la tendencia espacial subyacente, a partir de datos agregados, en una resolución más fina. Los casos ATP y ATA son ilustrados usando diferentes conjuntos de datos de mortalidad (o morbilidad), y estudios de simulación sobre el desempeño de predicción entre nuestro enfoque y el Poisson kriging área-a-punto de Goovaerts (2006) son realizados. Además, en el tercer capítulo proporcionamos una metodología para lidiar con el problema de sobredispersión, el cual está basado en el enfoque PRIDE (‘penalized regression with individual deviance effects’) de Perperoglou y Eilers (2010). En el cuarto capítulo, generalizamos la metodología desarrollada en el tercer capítulo para el análisis de datos agregados espacio-temporalmente. Bajo este contexto, adaptamos el algoritmo SAP (‘separation of anisotropic penalties’) de Rodríguez- Álvarez et al. (2015) y los algoritmos GLAM (‘generalized linear array model’) dados por Currie et al. (2006) y Eilers et al. (2006) en el contexto de los CLMMs. El uso de estos algoritmos eficientes nos permite evitar posibles problemas de almacenamiento y acelerar el tiempo de cómputo de la estimación del modelo. Ilustramos la metodología presentada en este capítulo usando un conjunto de datos sobre incidencia de fiebre Q registradas en Holanda a nivel municipal y por meses. Nuestro objetivo, luego, es el de estimar incidencias suavizadas en una malla espacial fina sobre el área de estudio a lo largo de las 53 semanas del 2009. Un estudio de simulación es dado al final del cuarto capítulo, de manera de evaluar el desempeño de predicción de nuestro enfoque bajo tres diferentes situaciones de agregación, usando un conjunto de datos detallado (y confidencial) de incidencia de fiebre Q. Finalmente, el quinto capítulo resume las contribuciones principales hechas en esta tesis y el trabajo a futuro.The work presented in this thesis was supported by the Spanish Ministry of Economy and Competitiveness grants MTM2011-28285-C02-02 and MTM2014-52184-P.Programa Oficial de Doctorado en Ingeniería MatemáticaPresidente: Miguel Ángel Martínez Beneito.- Secretario: Irene Albarrán Lozano.- Vocal: Jutta Gamp

    Penalized composite link mixed models for two-dimensional count data

    Get PDF
    Mortality data provide valuable information for the study of the spatial distribution of mortality risk, in disciplines such as spatial epidemiology, medical demography, and public health. However, they are often available in an aggregated form over irregular geographical units, hindering the visualization of the underlying mortality risk and the detection of meaningful patterns. Also, it could be of interest to obtain mortality risk estimates on a finer spatial resolution, such that they can be linked with potential risk factors — in a posterior correlation analysis — that are usually measured in a different spatial resolution than mortality data. In this paper, we propose the use of the penalized composite link model and its representation as a mixed model to deal with these issues. This model takes into account the nature of mortality rates by incorporating the population size at the finest resolution, and allows the creation of mortality maps at a desirable scale, reducing the visual bias resulting from the spatial aggregation within original units. We illustrate our proposal with the analysis of several datasets related with deaths by respiratory diseases, cardiovascular diseases, and lung cancer.Acknowledgements: The first and second authors acknowledge financial support from the Spanish Ministry of Economy and Competitiveness grants MTM2011-28285-C02-02 and MTM2014-52184. The third author acknowledges financial support from the Basque Government through the BERC 2014-2017 program and by the Spanish Ministry of Economy and Competitiveness MINECO: BCAM Severo Ochoa excellence accreditation SEV-2013-0323

    Modelling latent trends from spatio-temporally grouped data using composite link mixed models

    Get PDF
    Epidemiological data are frequently recorded at coarse spatio-temporal resolutions. The aggregation process is done for several reasons: to protect confidential patients' information, to compare with other datasets at a coarser resolution than the original, or to summarize data in a compact manner. However, we lose detailed patterns that follow the original data, which can be of interest for researchers and public health officials. In this paper we propose the use of the penalized composite link model (Eilers, 2007), together with its mixed model representation, to estimate the underlying trend behind grouped data at a finer spatio-temporal resolution. Also, this model allows the incorporation of fine-scale population into the estimation procedure. We assume the underlying trend is smooth across space and time. The mixed model representation enables the use of sophisticated algorithms such as the SAP algorithm of RodríguezÁlvarez et al. (2015) for fast estimation of the amount of smoothness. We illustrate our proposal with the analysis of data obtained during the largest outbreak of Q fever in the Netherlands.The first and the second authors acknowledge financial support from the Spanish Ministry of Economy and Competitiveness grants MTM2011-28285-C02-02 and MTM2014-52184-P. The third author acknowledges financial support from the Basque Government through the BERC 2014-2017 program and by the Spanish Ministry of Economy and Competitiveness MINECO: BCAM Severo Ochoa excellence accreditation SEV-2013-0323

    Penalized composite link models for aggregated spatial count data: A mixed model approach

    Get PDF
    Mortality data provide valuable information for the study of the spatial distribution of mortality risk, in disciplines such as spatial epidemiology and public health. However, they are frequently available in an aggregated form over irregular geographical units, hindering the visualization of the underlying mortality risk. Also, it can be of interest to obtain mortality risk estimates on a finer spatial resolution, such that they can be linked to potential risk factors that are usually measured in a different spatial resolution. In this paper, we propose the use of the penalized composite link model and its mixed model representation. This model considers the nature of mortality rates by incorporating the population size at the finest resolution, and allows the creation of mortality maps at a finer scale, thus reducing the visual bias resulting from the spatial aggregation within original units. We also extend the model by considering individual random effects at the aggregated scale, in order to take into account the overdispersion. We illustrate our novel proposal using two datasets: female deaths by lung cancer in Indiana, USA, and male lip cancer incidence in Scotland counties. We also compare the performance of our proposal with the area-to-point Poisson kriging approach.We would like to thank two reviewers and an associate editor for their constructive comments and suggestions on the original manuscript. We also thank Dr. Pierre Goovaerts, who provided the high resolution population estimates described in Section 3.1. This research was supported by the Spanish Ministry of Economy and Competitiveness grants MTM2011-28285-C02-02 and MTM2014-52184-P. The research of Dae-Jin Lee was also supported by the Basque Government through the BERC 2014-2017 and ELKARTEK programs and by the Spanish Ministry of Economy and Competitiveness MINECO: BCAM Severo Ochoa excellence accreditation SEV-2013-0323. The research of Paul H. C. Eilers was also supported by the Universidad Carlos III de Madrid-Banco Santander Chair of Excellence program

    Hyperarid soil microbial community response to simulated rainfall

    Get PDF
    The exceptionally long and protracted aridity in the Atacama Desert (AD), Chile, provides an extreme, terrestrial ecosystem that is ideal for studying microbial community dynamics under hyperarid conditions. Our aim was to characterize the temporal response of hyperarid soil AD microbial communities to ex situ simulated rainfall (5% g water/g dry soil for 4 weeks) without nutrient amendment. We conducted replicated microcosm experiments with surface soils from two previously well-characterized AD hyperarid locations near Yungay at 1242 and 1609 masl (YUN1242 and YUN1609) with distinct microbial community compositions and average soil relative humidity levels of 21 and 17%, respectively. The bacterial and archaeal response to soil wetting was evaluated by 16S rRNA gene qPCR, and amplicon sequencing. Initial YUN1242 bacterial and archaeal 16S rRNA gene copy numbers were significantly higher than for YUN1609. Over the next 4 weeks, qPCR results showed significant increases in viable bacterial abundance, whereas archaeal abundance decreased. Both communities were dominated by 10 prokaryotic phyla (Actinobacteriota, Proteobacteria, Chloroflexota, Gemmatimonadota, Firmicutes, Bacteroidota, Planctomycetota, Nitrospirota, Cyanobacteriota, and Crenarchaeota) but there were significant site differences in the relative abundances of Gemmatimonadota and Chloroflexota, and specific actinobacterial orders. The response to simulated rainfall was distinct for the two communities. The actinobacterial taxa in the YUN1242 community showed rapid changes while the same taxa in the YUN1609 community remained relatively stable until day 30. Analysis of inferred function of the YUN1242 microbiome response implied an increase in the relative abundance of known spore-forming taxa with the capacity for mixotrophy at the expense of more oligotrophic taxa, whereas the YUN1609 community retained a stable profile of oligotrophic, facultative chemolithoautotrophic and mixotrophic taxa. These results indicate that bacterial communities in extreme hyperarid soils have the capacity for growth in response to simulated rainfall; however, historic variations in long-term hyperaridity exposure produce communities with distinct putative metabolic capacities

    Modeling latent spatio-temporal disease incidence using penalized composite link models.

    No full text
    Epidemiological data are frequently recorded at coarse spatio-temporal resolutions to protect confidential information or to summarize it in a compact manner. However, the detailed patterns followed by the source data, which may be of interest to researchers and public health officials, are overlooked. We propose to use the penalized composite link model (Eilers PCH (2007)), combined with spatio-temporal P-splines methodology (Lee D.-J., Durban M (2011)) to estimate the underlying trend within data that have been aggregated not only in space, but also in time. Model estimation is carried out within a generalized linear mixed model framework, and sophisticated algorithms are used to speed up computations that otherwise would be unfeasible. The model is then used to analyze data obtained during the largest outbreak of Q-fever in the Netherlands

    Orbit-to-ground framework to decode and predict biosignature patterns in terrestrial analogues

    No full text
    In the search for biosignatures on Mars, there is an abundance of data from orbiters and rovers to characterize global and regional habitability, but much less information is available at the scales and resolutions of microbial habitats and biosignatures. Understanding whether the distribution of terrestrial biosignatures is characterized by recognizable and predictable patterns could yield signposts to optimize search efforts for life on other terrestrial planets. We advance an adaptable framework that couples statistical ecology with deep learning to recognize and predict biosignature patterns at nested spatial scales in a polyextreme terrestrial environment. Drone flight imagery connected simulated HiRISE data to ground surveys, spectroscopy and biosignature mapping to reveal predictable distributions linked to environmental factors. Artificial intelligence–machine learning models successfully identified geologic features with high probabilities for containing biosignatures at spatial scales relevant to rover-based astrobiology exploration. Targeted approaches augmented by deep learning delivered 56.9–87.5% probabilities of biosignature detection versus <10% for random searches and reduced the physical search space by 85–97%. Libraries of biosignature distributions, detection probabilities, predictive models and search roadmaps for many terrestrial environments will standardize analogue science research, enabling agnostic comparisons at all scales
    corecore