93,702 research outputs found

    Smooth generalized linear models for aggregated data

    Get PDF
    Mención Internacional en el título de doctorAggregated data commonly appear in areas such as epidemiology, demography, and public health. Generally, the aggregation process is done to protect the privacy of patients, to facilitate compact presentation, or to make it comparable with other coarser datasets. However, this process may hinder the visualization of the underlying distribution that follows the data. Also, it prohibits the direct analysis of relationships between aggregated data and potential risk factors, which are commonly measured at a finer resolution. Therefore, it is of interest to develop statistical methodologies that deal with the disaggregation of coarse health data at a finer scale. For example, in the spatial setting, it could be desirable to obtain estimates, from coarse areal data, at a fine spatial grid or units less coarser than the original ones. These two cases are known as the area-to-point (ATP) and area-to-area (ATA) cases, respectively, which are illustrated in the first chapter of this thesis. Moreover, we can have spatial data recorded at coarse units over time. In some cases, the temporal dimension can also be in an aggregated form, hindering the visualization of the evolution of the underlying process over time. In this thesis we propose the use of a novel non-parametric method that we called composite link mixed model or, more succinctly, CLMM. In our proposed model, we look at the observed data as indirect observations of an underlying process (defined at a finer resolution than observed data), which we want to estimate. The mixed model formulation of our proposal allow us to include fine-scale population information and complex structures as random effects as parts of the modelling of the underlying trend. Since the CLMM is based on the approach given by Eilers (2007), called penalized composite link model (PCLM), we briefly review the PCLM approach in the first section of the second chapter of this thesis. Then, in the second section of this chapter, we introduce the CLMM approach under an univariate setting, which can be seen as a reformulation of the PCLM into a mixed model framework. This is achieved by following the mixed model reformulation of P-splines proposed in Currie and Durbán (2002) and Currie et al. (2006), which is also reviewed here. Then, the parameter estimation of the CLMM can be done under the framework of mixed model theory. This offers another alternative for the estimation of the PCLM, avoiding the use of information criteria for smoothing parameter selection. In the third section of the second chapter, we extend the CLMM approach to the multidimensional (array) case, where Kronecker products are involved in the extended model formulation. Illustrations for the univariate and the multidimensional array settings are presented throughout the second chapter, using mortality and fertility datasets. In the third chapter, we present a new methodology for the analysis of spatially aggregated data, by extending the CLMM approach developed in the second chapter to the spatial case. The spatial CLMM provides smoothed solutions for the ATP and ATA cases described in the first chapter, i.e., it gives a smoothed estimation for the underlying spatial trend, from aggregated data, at a finer resolution. The ATP and ATA cases are illustrated using several mortality (or morbidity) datasets, and simulation studies of the prediction performance between our approach and the area-to-point Poisson kriging of Goovaerts (2006) are realized. Also, in the third chapter we provide a methodology to deal with the overdispersion problem, which is based on the PRIDE (‘penalized regression with individual deviance effects’) approach of Perperoglou and Eilers (2010). In the fourth chapter, we generalize the methodology developed in the third chapter for the analysis of spatio-temporally aggregated data. Under this framework, we adapt the SAP (‘separation of anisotropic penalties’) algorithm of Rodríguez- Álvarez et al. (2015) and the GLAM (‘generalized linear array model’) algorithms given in Currie et al. (2006) and Eilers et al. (2006), to the CLMM context. The use of these efficient algorithms allow us to avoid possible storage problems and to speed up the computational time of the model estimation. We illustrate the methodology presented in this chapter by using a Q fever incidence dataset recorded in the Netherlands at municipality level and by months. Our aim, then, is to estimate smoothed incidences at a fine spatial grid over the study area throughout the 53 weeks of 2009. A simulation study is provided at the end of chapter four, in order to evaluate the prediction performance of our approach under three different coarse situations, using a detailed (and confidential) Q fever incidence dataset. Finally, the fifth chapter summarizes the main contributions made in this thesis and further work.Datos agregados aparecen comúnmente en áreas como la epidemiología, demografía, y salud pública. Generalmente, el proceso de agregación es efectuado para proteger la privacidad de los pacientes, para facilitar una presentación compacta, o para hacerlos comparables con otros conjuntos de datos más gruesos. Sin embargo, este proceso puede dificultar la visualización de la distribución subyacente que siguen los datos. Además, prohíbe el análisis directo de relaciones entre los datos agregados y factores de riesgos potenciales, los cuales son medidos usualmente en una resolución más fina. En consecuencia, es de interés el desarrollar metodologías estadísticas que traten la desagregación de datos de salud gruesos a una escala más fina. Por ejemplo, en el caso espacial, podría ser deseable obtener estimaciones, a partir de datos disponibles en unidades geográficas gruesas, en una malla espacial fina o en unidades menos gruesas que las originales. Estos dos casos se conocen como los casos área-a-punto (ATP, ‘area-to-point’) y área-a-área (ATA, ‘area-to-area’), respectivamente, los cuales son ilustrados en el primer capítulo de esta tesis. Más aún, podemos tener datos espaciales registrados en unidades geográficas gruesas a lo largo del tiempo. En algunos casos, la dimensión temporal también puede estar en una forma agregada, dificultando la visualización de la evolución del proceso subyacente a lo largo del tiempo. En esta tesis proponemos el uso de un novedoso método no-paramétrico que llamamos modelo mixto de enlace compuesto o, más brevemente, CLMM (‘composite link mixed model’). En nuestro modelo propuesto, miramos a los datos observados como observaciones indirectas de un proceso subyacente (definido en una resolución más fina que los datos observados), el cual queremos estimar. La formulación de modelo mixto en nuestra propuesta nos permite incluir información de la población medida en una escala fina y estructuras complejas como efectos aleatorios, como partes de la modelización de la tendencia subyacente. Dado que el CLMM est´a basado en el enfoque dado por Eilers (2007), llamado modelo de enlace compuesto penalizado (PCLM, ‘penalized composite link model’), revisaremos brevemente el enfoque PCLM en la primera sección del segundo capítulo de esta tesis. Luego, en la segunda sección de este capítulo, introduciremos el enfoque CLMM bajo un marco univariante, el cual puede ser visto como una reformulación del PCLM en un marco de modelo mixto. Esto es logrado siguiendo la reformulación como modelo mixto de los P-splines propuestos por Currie y Durbán (2002) y Currie et al. (2006), el cual es también revisado aquí. Luego, la estimación de parámetros del CLMM puede hacerse bajo el marco de la teoría de los modelos mixtos. Esto ofrece otra alternativa para la estimación del PCLM, evitando el uso de criterios de información para la selección del parámetro de suavizado. En la tercera sección del segundo capítulo, extendemos el enfoque CLMM al caso (array) multidimensional, en donde productos de Kronecker están implicados en la formulación del modelo extendido. Ilustraciones para los casos univariantes y (array) multidimensional son presentados a lo largo del segundo capítulo, usando conjuntos de datos de mortalidad y fertilidad. En el tercer capítulo, presentamos una nueva metodología para el análisis de datos agregados espacialmente, extendiendo el enfoque CLMM desarrollado en el segundo capítulo al caso espacial. El CLMM espacial proporciona soluciones suavizadas para los casos ATP y ATA descritos en el primer capítulo, es decir, entrega una estimación suavizada para la tendencia espacial subyacente, a partir de datos agregados, en una resolución más fina. Los casos ATP y ATA son ilustrados usando diferentes conjuntos de datos de mortalidad (o morbilidad), y estudios de simulación sobre el desempeño de predicción entre nuestro enfoque y el Poisson kriging área-a-punto de Goovaerts (2006) son realizados. Además, en el tercer capítulo proporcionamos una metodología para lidiar con el problema de sobredispersión, el cual está basado en el enfoque PRIDE (‘penalized regression with individual deviance effects’) de Perperoglou y Eilers (2010). En el cuarto capítulo, generalizamos la metodología desarrollada en el tercer capítulo para el análisis de datos agregados espacio-temporalmente. Bajo este contexto, adaptamos el algoritmo SAP (‘separation of anisotropic penalties’) de Rodríguez- Álvarez et al. (2015) y los algoritmos GLAM (‘generalized linear array model’) dados por Currie et al. (2006) y Eilers et al. (2006) en el contexto de los CLMMs. El uso de estos algoritmos eficientes nos permite evitar posibles problemas de almacenamiento y acelerar el tiempo de cómputo de la estimación del modelo. Ilustramos la metodología presentada en este capítulo usando un conjunto de datos sobre incidencia de fiebre Q registradas en Holanda a nivel municipal y por meses. Nuestro objetivo, luego, es el de estimar incidencias suavizadas en una malla espacial fina sobre el área de estudio a lo largo de las 53 semanas del 2009. Un estudio de simulación es dado al final del cuarto capítulo, de manera de evaluar el desempeño de predicción de nuestro enfoque bajo tres diferentes situaciones de agregación, usando un conjunto de datos detallado (y confidencial) de incidencia de fiebre Q. Finalmente, el quinto capítulo resume las contribuciones principales hechas en esta tesis y el trabajo a futuro.The work presented in this thesis was supported by the Spanish Ministry of Economy and Competitiveness grants MTM2011-28285-C02-02 and MTM2014-52184-P.Programa Oficial de Doctorado en Ingeniería MatemáticaPresidente: Miguel Ángel Martínez Beneito.- Secretario: Irene Albarrán Lozano.- Vocal: Jutta Gamp

    The Missing Globalization Puzzle: Another Explanation

    Get PDF
    This study suggests another explanation of the "missing globalization puzzle" typically observed in the empirical gravity models. In contrast to the previous research that focused on aggregated trade flows, we employ the trade flows in manufacturing products broken down by 25 three-digit ISIC Rev.2 categories. We estimate the distance coefficient using the log-linear specification of the standard as well as the generalized gravity equations. Our data set comprises trade flows for 22 OECD countries that span the time period from 1970 till 2000. We observe a substantial decline in the value of the distance elasticity in most manufacturing industries.Gravity model, missing globalization puzzle, distance coefficient

    Calculation of solvency capital requirements for non-life underwriting risk using generalized linear models

    Get PDF
    The paper presents various GLM models using individual rating factors to calculate the solvency capital requirements for non-life underwriting risk in insurance. First, we consider the potential heterogeneity of claim frequency and the occurrence of large claims in the models. Second, we analyse how the distribution of frequency and severity varies depending on the modelling approach and examine how they are projected into SCR estimates according to the Solvency II Directive. In addition, we show that neglecting of large claims is as consequential as neglecting the heterogeneity of claim frequency. The claim frequency and severity are managed using generalized linear models, that is, negative-binomial and gamma regression. However, the different individual probabilities of large claims are represented by the binomial model and the large claim severity is managed using generalized Pareto distribution. The results are obtained and compared using the simulation of frequency-severity of an actual insurance portfolio.Web of Science26446645

    Turbo-Aggregate: Breaking the Quadratic Aggregation Barrier in Secure Federated Learning

    Get PDF
    Federated learning is a distributed framework for training machine learning models over the data residing at mobile devices, while protecting the privacy of individual users. A major bottleneck in scaling federated learning to a large number of users is the overhead of secure model aggregation across many users. In particular, the overhead of the state-of-the-art protocols for secure model aggregation grows quadratically with the number of users. In this paper, we propose the first secure aggregation framework, named Turbo-Aggregate, that in a network with NN users achieves a secure aggregation overhead of O(NlogN)O(N\log{N}), as opposed to O(N2)O(N^2), while tolerating up to a user dropout rate of 50%50\%. Turbo-Aggregate employs a multi-group circular strategy for efficient model aggregation, and leverages additive secret sharing and novel coding techniques for injecting aggregation redundancy in order to handle user dropouts while guaranteeing user privacy. We experimentally demonstrate that Turbo-Aggregate achieves a total running time that grows almost linear in the number of users, and provides up to 40×40\times speedup over the state-of-the-art protocols with up to N=200N=200 users. Our experiments also demonstrate the impact of model size and bandwidth on the performance of Turbo-Aggregate

    CyTOF workflow: differential discovery in high-throughput high-dimensional cytometry datasets [version 3; peer review: 2 approved]

    Get PDF
    High-dimensional mass and flow cytometry (HDCyto) experiments have become a method of choice for high-throughput interrogation and characterization of cell populations. Here, we present an updated R-based pipeline for differential analyses of HDCyto data, largely based on Bioconductor packages. We computationally define cell populations using FlowSOM clustering, and facilitate an optional but reproducible strategy for manual merging of algorithm-generated clusters. Our workflow offers different analysis paths, including association of cell type abundance with a phenotype or changes in signalling markers within specific subpopulations, or differential analyses of aggregated signals. Importantly, the differential analyses we show are based on regression frameworks where the HDCyto data is the response; thus, we are able to model arbitrary experimental designs, such as those with batch effects, paired designs and so on. In particular, we apply generalized linear mixed models or linear mixed models to analyses of cell population abundance or cell-population-specific analyses of signaling markers, allowing overdispersion in cell count or aggregated signals across samples to be appropriately modeled. To support the formal statistical analyses, we encourage exploratory data analysis at every step, including quality control (e.g., multi-dimensional scaling plots), reporting of clustering results (dimensionality reduction, heatmaps with dendrograms) and differential analyses (e.g., plots of aggregated signals)
    corecore