27 research outputs found

    Nonparametric Density and Regression Estimation for Samples of Very Large Size

    Get PDF
    Programa Oficial de Doutoramento en Estatística e Investigación Operativa. 555V01[Abstract] This dissertation mainly deals with the problem of bandwidth selection in the context of nonparametric density and regression estimation for samples of very large size. Some bandwidth selection methods have the disadvantage of high computational complexity. This implies that the number of operations required to compute the bandwidth grows very rapidly as the sample size increases, so that the computational cost associated with these algorithms makes them unsuitable for samples of very large size. In the present thesis, this problem is addressed through the use of subagging, an ensemble method that combines bootstrap aggregating or bagging with the use of subsampling. The latter reduces the computational cost associated with the process of bandwidth selection, while the former is aimed at achieving signi cant reductions in the variability of the bandwidth selector. Thus, subagging versions are proposed for bandwidth selection methods based on widely known criteria such as cross-validation or bootstrap. When applying subagging to the cross-validation bandwidth selector, both for the Parzen{Rosenblatt estimator and the Nadaraya{ Watson estimator, the proposed selectors are studied and their asymptotic properties derived. The empirical behavior of all the proposed bandwidth selectors is shown through various simulation studies and applications to real datasets.[Resumen] Esta disertación aborda principalmente el problema de la selección de la ventana en el contexto de la estimación no paramétrica de la densidad y de la regresión para muestras de gran tamaño. Algunos métodos de selección de la ventana tienen el inconveniente de contar con una elevada complejidad computacional. Esto implica que el número de operaciones necesarias para el cálculo de la ventana crece muy rápidamente a medida que el tamaño muestral aumenta, de manera que el coste computacional asociado a estos algoritmos los hace inadecuados para muestras de gran tamaño. En la presente tesis, este problema se aborda mediante el uso del subagging, un método de aprendizaje conjunto que combina el bootstrap aggregating o bagging con el uso de submuestreo. Este ultimo reduce el coste computacional asociado al proceso de selección de la ventana, mientras que el primero tiene como objetivo conseguir reducciones signi cativas en la variabilidad del selector de la ventana. Así, se proponen versiones subagging para métodos de selección de la ventana basados en criterios ampliamente conocidos, como la validación cruzada o el bootstrap. Al aplicar subagging al selector de la ventana de tipo validación cruzada, tanto para el estimador de Parzen{Rosenblatt como para el estimador de Nadaraya{Watson, se estudian los selectores propuestos y se derivan sus propiedades asintóticas. El comportamiento empírico de todos los selectores de la ventana propuestos se muestra mediante varios estudios de simulación y aplicaciones a conjuntos de datos reales[Resumo] Esta disertación aborda o problema da selección da ventá no contexto da estimación non paramétrica da densidade e da regresión para mostras de gran tamaño. Algúns métodos de selección da ventá teñen o inconveniente de contar cunha alta complexidade computacional. Isto implica que o número de operacións necesarias para o cálculo da ventá crece moi rápidamente a medida que aumenta o tamaño muestral, polo que o coste computacional asociado a estes algoritmos fainos inadecuados para mostras de gran tamaño. Na presente tese, este problema abórdase mediante o uso do subagging, un método de aprendizaxe conxunta que combina o bootstrap aggregating ou bagging co uso de submostraxe. Este último reduce o custo computacional asociado ao proceso de selección da ventá, mentres que o primeiro ten como obxectivo conseguir reducións signi cativas na variabilidade do selector da ventá. Así, propóñense versións subagging para métodos de selección da ventá baseados en criterios amplamente coñecidos, como a validación cruzada ou o bootstrap. Ao aplicar subagging ao selector da ventá de tipo validación cruzada, tanto para o estimador de Parzen{Rosenblatt como para o estimador de Nadaraya{Watson, estúrdanse os selectores propostos e derívanse as súas propiedades asintóticas. O comportamento empírico de todos os selectores da ventá propostos mostrase mediante varios estudos de simulación e aplicacións a conxuntos de datos reais.This research has been supported by MINECO Grant MTM2017-82724-R, and by the Xunta de Galicia (Grupos de Referencia Competitiva ED431C-2016-015, ED431C- 2020-14, Centro Singular de Investigación de Galicia ED431G/01 and Centro de Investigación del Sistema Universitario de Galicia ED431G 2019/01), all of them through the ERDF (European Regional Development Fund). Additionally, this work has been partially carried out during a visit to the Texas A&M University, College Station, financed by INDITEX, with reference INDITEX-UDC 2019. The author is grateful to the Centro de Coordinación de Alertas y Emergencias Sanitarias for kindly providing the COVID-19 hospitalization dataset.Xunta de Galicia; ED431C-2016-015Xunta de Galicia; ED431C-2020-14Xunta de Galicia; ED431G/01Xunta de Galicia; ED431G 2019/0

    Bagging cross-validated bandwidths with application to big data

    Get PDF
    Versión final aceptada de: https://doi.org/10.1093/biomet/asaa092This is a pre-copyedited, author-produced version of an article accepted for publication in [insert journal title] following peer review. The version of record of: D Barreiro-Ures, R Cao, M Francisco-Fernández, J D Hart, Bagging cross-validated bandwidths with application to big data, Biometrika, Volume 108, Issue 4, December 2021, Pages 981– 988, https://doi.org/10.1093/biomet/asaa092, published by Oxford University Press, is available online at: https:// doi.org/10.1093/biomet/asaa092.Hall & Robinson (2009) proposed and analysed the use of bagged cross-validation to choose the band-width of a kernel density estimator. They established that bagging greatly reduces the noise inherent in ordinary cross-validation, and hence leads to a more efficient bandwidth selector. The asymptotic theory of Hall & Robinson (2009) assumes that N , the number of bagged subsamples, is ∞. We expand upon their theoretical results by allowing N to be finite, as it is in practice. Our results indicate an important difference in the rate of convergence of the bagged cross-validation bandwidth for the cases N = ∞ and N < ∞. Simulations quantify the improvement in statistical efficiency and computational speed that can result from using bagged cross-validation as opposed to a binned implementation of ordinary cross-validation. The performance of the bagged bandwidth is also illustrated on a real, very large, dataset. Finally, a byproduct of our study is the correction of errors appearing in the Hall & Robinson (2009) expression for the asymptotic mean squared error of the bagging selectorThe authors thank Andrew Robinson, a referee, the editor and an associate editor for numerous useful comments that significantly improved this article. The authors are also grateful for the insight of Professor Anirban Bhattacharya. The first. three authors were supported by the Spanish Ministry of Economy and Competitiveness (MTM2017-82724-R) and by the Xunta de Galicia (ED431C-2016-015, ED431C-2020-14 and ED431G 2019/01). The work of Barreiro-Ures was carried out during a visit to Texas A&M University, College Station, financed by Inditex.Xunta de Galicia; ED431C-2016-015Xunta de Galicia; ED431C-2020-14Xunta de Galicia; ED431G 2019/0

    Second-Order Inference for the Mean of a Variable Missing at Random

    Get PDF
    We present a second-order estimator of the mean of a variable subject to missingness, under the missing at random assumption. The estimator improves upon existing methods by using an approximate second-order expansion of the parameter functional, in addition to the first-order expansion employed by standard doubly robust methods. This results in weaker assumptions about the convergence rates necessary to establish consistency, local efficiency, and asymptotic linearity. The general estimation strategy is developed under the targeted minimum loss-based estimation (TMLE) framework. We present a simulation comparing the sensitivity of the first and second order estimators to the convergence rate of the initial estimators of the outcome regression and missingness score. In our simulation, the second-order TMLE improved the coverage probability of a confidence interval by up to 85%. In addition, we present a first-order estimator inspired by a second-order expansion of the parameter functional. This estimator only requires one-dimensional smoothing, whereas implementation of the second-order TMLE generally requires kernel smoothing on the covariate space. The first-order estimator proposed is expected to have improved finite sample performance compared to existing first-order estimators. In our simulations, the proposed first-order estimator improved the coverage probability by up to 90%. We provide an illustration of our methods using a publicly available dataset to determine the effect of an anticoagulant on health outcomes of patients undergoing percutaneous coronary intervention. We provide R code implementing the proposed estimator

    Learning understandable classifier models.

    Get PDF
    The topic of this dissertation is the automation of the process of extracting understandable patterns and rules from data. An unprecedented amount of data is available to anyone with a computer connected to the Internet. The disciplines of Data Mining and Machine Learning have emerged over the last two decades to face this challenge. This has led to the development of many tools and methods. These tools often produce models that make very accurate predictions about previously unseen data. However, models built by the most accurate methods are usually hard to understand or interpret by humans. In consequence, they deliver only decisions, and are short of any explanations. Hence they do not directly lead to the acquisition of new knowledge. This dissertation contributes to bridging the gap between the accurate opaque models and those less accurate but more transparent for humans. This dissertation first defines the problem of learning from data. It surveys the state-of-the-art methods for supervised learning of both understandable and opaque models from data, as well as unsupervised methods that detect features present in the data. It describes popular methods of rule extraction from unintelligible models which rewrite them into an understandable form. Limitations of rule extraction are described. A novel definition of understandability which ties computational complexity and learning is provided to show that rule extraction is an NP-hard problem. Next, a discussion whether one can expect that even an accurate classifier has learned new knowledge. The survey ends with a presentation of two approaches to building of understandable classifiers. On the one hand, understandable models must be able to accurately describe relations in the data. On the other hand, often a description of the output of a system in terms of its input requires the introduction of intermediate concepts, called features. Therefore it is crucial to develop methods that describe the data with understandable features and are able to use those features to present the relation that describes the data. Novel contributions of this thesis follow the survey. Two families of rule extraction algorithms are considered. First, a method that can work with any opaque classifier is introduced. Artificial training patterns are generated in a mathematically sound way and used to train more accurate understandable models. Subsequently, two novel algorithms that require that the opaque model is a Neural Network are presented. They rely on access to the network\u27s weights and biases to induce rules encoded as Decision Diagrams. Finally, the topic of feature extraction is considered. The impact on imposing non-negativity constraints on the weights of a neural network is considered. It is proved that a three layer network with non-negative weights can shatter any given set of points and experiments are conducted to assess the accuracy and interpretability of such networks. Then, a novel path-following algorithm that finds robust sparse encodings of data is presented. In summary, this dissertation contributes to improved understandability of classifiers in several tangible and original ways. It introduces three distinct aspects of achieving this goal: infusion of additional patterns from the underlying pattern distribution into rule learners, the derivation of decision diagrams from neural networks, and achieving sparse coding with neural networks with non-negative weights

    Nonparametric Inference for Regression Models with Spatially Correlated Errors

    Get PDF
    Programa Oficial de Doutoramento en Estatística e Investigación Operativa. 5017V01[Abstract] Regression estimation can be approached using nonparametric procedures, producing exible estimators and avoiding misspeci cation problems. Alternatively, parametric methods may be preferable to nonparametric approaches if the regression function belongs to the assumed parametric family. However, a bad speci cation of this family can lead to wrong conclusions. Regression function misspeci cation problems can be somewhat tackled by applying a goodness-of- t test. For data presenting some kind of complexity, for example, circular data, the approaches used in regression estimation or in goodness-of- t tests have to be conveniently adapted. Moreover, it might occur that the variables of interest can present a certain type of dependence. For example, they can be spatially correlated, where observations which are close in space tend to be more similar than observations that are far apart. The goal of this thesis is twofold, rst, some inference problems for regression models with Euclidean response and covariates, and spatially correlated errors are analyzed. More speci - cally, a testing procedure for parametric regression models in the presence of spatial correlation is proposed. The second aim is to design and study new approaches to deal with regression function estimation and goodness-of- t tests for models with a circular response and an Rd-valued covariate. In this setting, nonparametric proposals to estimate the circular regression function are provided and studied, under the assumption of independence and also for spatially correlated errors. Moreover, goodness-of- t tests for assessing a parametric regression model are presented in these two frameworks. Comprehensive simulation studies and application of the different techniques to real datasets complete this dissertation.[Resumo] A estimación da regresión pode ser abordada empregando técnicas non paramétricas, dando lugar a estimadores exibles e evitando problemas de mala especi ficación. Alternativamente, os métodos paramétricos poden ser preferibles se a función de regresión pertence á familia paramétrica asumida. Porén, unha mala especi ficación desta familia pode levar a conclusións equivocadas. Os problemas de especi cación incorrecta da función de regresión poden ser abordados aplicando un contraste de bondade de axuste. Para datos que presentan algún tipo de complexidade, por exemplo, datos circulares, os métodos empregados na estimación ou nos contrastes, deben adaptarse convenientemente. Ademais, pode ocorrer que as variables de interese poidan presentar un certo tipo de dependencia. Por exemplo, poden estar espacialmente correladas, onde as observacións que están preto no espazo tenden a ser máis similares que as observacións que están lonxe. O obxectivo desta tese é dobre, primeiro, analízanse problemas de inferencia para modelos de regresión con resposta e covariables Euclídeas, e erros espacialmente correlados. Máis concretamente, contrástase se a función de regresión pertence a unha familia paramétrica, en presenza de correlación espacial. O segundo obxectivo é deseñar e estudar novos procedementos para abordar estimación e contrastes da función regresión para modelos con resposta circular e covariable con valores en Rd. Neste contexto, preséntanse e estúdanse propostas non paramétricas para estimar a función de regresión circular, baixo o suposto de independencia e tamén para erros espacialmente correlados. Ademais, nestes dous contextos, preséntanse contrastes para avaliar un modelo de regresión paramétrico. Esta memoria complétase con estudos de simulación exhaustivos e aplicacións a conxuntos de datos reais.[Resumen] La estimación de la regresión puede ser abordada usando técnicas no paramétricas, dando lugar a estimadores flexibles y evitando problemas de mala especificación. Alternativamente, los métodos paramétricos pueden ser preferibles si la función de regresión pertenece a la familia paramétrica asumida. Sin embargo, una mala especificación de esta familia puede llevar a conclusiones equivocadas. Los problemas de especificación incorrecta de la función de regresión pueden ser abordados aplicando un contraste de bondad de ajuste. Para datos que presentan algún tipo de complejidad, por ejemplo, datos circulares, los métodos utilizados en la estimación o en los contrastes, deben adaptarse convenientemente. Además, puede ocurrir que las variables de interés puedan presentar un cierto tipo de dependencia. Por ejemplo, pueden estar espacialmente correladas, donde las observaciones que están cerca en el espacio tienden a ser más similares que las observaciones que están lejos. El objetivo de esta tesis es doble, primero, se analizan problemas de inferencia para modelos de regresión con respuesta y covariables Euclídeas, y errores espacialmente correlados. Más concretamente, se contrasta si la función de regresión pertenece a una familia paramétrica, en presencia de correlación espacial. El segundo objetivo es diseñar y estudiar nuevos procedimientos para abordar estimación y contrastes de la función regresión para modelos con respuesta circular y covariable con valores en J.Rd. En este contexto, se presentan y estudian propuestas no paramétricas para estimar la función de regresión, bajo el supuesto de independencia y también para errores espacialmente correlados. Además, en estos dos contextos, se presentan contrastes para evaluar un modelo de regresión paramétrico. Esta memoria se completa con estudios de simulación exhaustivos y aplicaciones a conjuntos de datos reales. Palabras clave: contraste de bondad de ajuste, estadística circular, estimación no paramétrica, regresión lineal-circular, dependencia espacia

    EDMON - Electronic Disease Surveillance and Monitoring Network: A Personalized Health Model-based Digital Infectious Disease Detection Mechanism using Self-Recorded Data from People with Type 1 Diabetes

    Get PDF
    Through time, we as a society have been tested with infectious disease outbreaks of different magnitude, which often pose major public health challenges. To mitigate the challenges, research endeavors have been focused on early detection mechanisms through identifying potential data sources, mode of data collection and transmission, case and outbreak detection methods. Driven by the ubiquitous nature of smartphones and wearables, the current endeavor is targeted towards individualizing the surveillance effort through a personalized health model, where the case detection is realized by exploiting self-collected physiological data from wearables and smartphones. This dissertation aims to demonstrate the concept of a personalized health model as a case detector for outbreak detection by utilizing self-recorded data from people with type 1 diabetes. The results have shown that infection onset triggers substantial deviations, i.e. prolonged hyperglycemia regardless of higher insulin injections and fewer carbohydrate consumptions. Per the findings, key parameters such as blood glucose level, insulin, carbohydrate, and insulin-to-carbohydrate ratio are found to carry high discriminative power. A personalized health model devised based on a one-class classifier and unsupervised method using selected parameters achieved promising detection performance. Experimental results show the superior performance of the one-class classifier and, models such as one-class support vector machine, k-nearest neighbor and, k-means achieved better performance. Further, the result also revealed the effect of input parameters, data granularity, and sample sizes on model performances. The presented results have practical significance for understanding the effect of infection episodes amongst people with type 1 diabetes, and the potential of a personalized health model in outbreak detection settings. The added benefit of the personalized health model concept introduced in this dissertation lies in its usefulness beyond the surveillance purpose, i.e. to devise decision support tools and learning platforms for the patient to manage infection-induced crises

    STK /WST 795 Research Reports

    Get PDF
    These documents contain the honours research reports for each year for the Department of Statistics.Honours Research Reports - University of Pretoria 20XXStatisticsBSs (Hons) Mathematical Statistics, BCom (Hons) Statistics, BCom (Hons) Mathematical StatisticsUnrestricte
    corecore