27 research outputs found
Nonparametric Density and Regression Estimation for Samples of Very Large Size
Programa Oficial de Doutoramento en Estatística e Investigación Operativa. 555V01[Abstract]
This dissertation mainly deals with the problem of bandwidth selection in the context
of nonparametric density and regression estimation for samples of very large
size. Some bandwidth selection methods have the disadvantage of high computational
complexity. This implies that the number of operations required to compute
the bandwidth grows very rapidly as the sample size increases, so that the computational
cost associated with these algorithms makes them unsuitable for samples of
very large size. In the present thesis, this problem is addressed through the use of
subagging, an ensemble method that combines bootstrap aggregating or bagging with
the use of subsampling. The latter reduces the computational cost associated with
the process of bandwidth selection, while the former is aimed at achieving signi cant
reductions in the variability of the bandwidth selector. Thus, subagging versions
are proposed for bandwidth selection methods based on widely known criteria such
as cross-validation or bootstrap. When applying subagging to the cross-validation
bandwidth selector, both for the Parzen{Rosenblatt estimator and the Nadaraya{
Watson estimator, the proposed selectors are studied and their asymptotic properties
derived. The empirical behavior of all the proposed bandwidth selectors is shown
through various simulation studies and applications to real datasets.[Resumen]
Esta disertación aborda principalmente el problema de la selección de la ventana en
el contexto de la estimación no paramétrica de la densidad y de la regresión para
muestras de gran tamaño. Algunos métodos de selección de la ventana tienen el
inconveniente de contar con una elevada complejidad computacional. Esto implica
que el número de operaciones necesarias para el cálculo de la ventana crece muy
rápidamente a medida que el tamaño muestral aumenta, de manera que el coste
computacional asociado a estos algoritmos los hace inadecuados para muestras de
gran tamaño. En la presente tesis, este problema se aborda mediante el uso del subagging,
un método de aprendizaje conjunto que combina el bootstrap aggregating o
bagging con el uso de submuestreo. Este ultimo reduce el coste computacional asociado
al proceso de selección de la ventana, mientras que el primero tiene como objetivo
conseguir reducciones signi cativas en la variabilidad del selector de la ventana. Así,
se proponen versiones subagging para métodos de selección de la ventana basados
en criterios ampliamente conocidos, como la validación cruzada o el bootstrap. Al
aplicar subagging al selector de la ventana de tipo validación cruzada, tanto para el
estimador de Parzen{Rosenblatt como para el estimador de Nadaraya{Watson, se
estudian los selectores propuestos y se derivan sus propiedades asintóticas. El comportamiento
empírico de todos los selectores de la ventana propuestos se muestra
mediante varios estudios de simulación y aplicaciones a conjuntos de datos reales[Resumo]
Esta disertación aborda o problema da selección da ventá no contexto da estimación
non paramétrica da densidade e da regresión para mostras de gran tamaño. Algúns
métodos de selección da ventá teñen o inconveniente de contar cunha alta complexidade
computacional. Isto implica que o número de operacións necesarias para o
cálculo da ventá crece moi rápidamente a medida que aumenta o tamaño muestral,
polo que o coste computacional asociado a estes algoritmos fainos inadecuados para
mostras de gran tamaño. Na presente tese, este problema abórdase mediante o uso
do subagging, un método de aprendizaxe conxunta que combina o bootstrap aggregating
ou bagging co uso de submostraxe. Este último reduce o custo computacional
asociado ao proceso de selección da ventá, mentres que o primeiro ten como obxectivo
conseguir reducións signi cativas na variabilidade do selector da ventá. Así,
propóñense versións subagging para métodos de selección da ventá baseados en criterios
amplamente coñecidos, como a validación cruzada ou o bootstrap. Ao aplicar
subagging ao selector da ventá de tipo validación cruzada, tanto para o estimador
de Parzen{Rosenblatt como para o estimador de Nadaraya{Watson, estúrdanse os
selectores propostos e derívanse as súas propiedades asintóticas. O comportamento
empírico de todos os selectores da ventá propostos mostrase mediante varios estudos
de simulación e aplicacións a conxuntos de datos reais.This research has been supported by MINECO Grant MTM2017-82724-R, and by the
Xunta de Galicia (Grupos de Referencia Competitiva ED431C-2016-015, ED431C-
2020-14, Centro Singular de Investigación de Galicia ED431G/01 and Centro de
Investigación del Sistema Universitario de Galicia ED431G 2019/01), all of them
through the ERDF (European Regional Development Fund). Additionally, this work
has been partially carried out during a visit to the Texas A&M University, College
Station, financed by INDITEX, with reference INDITEX-UDC 2019.
The author is grateful to the Centro de Coordinación de Alertas y Emergencias
Sanitarias for kindly providing the COVID-19 hospitalization dataset.Xunta de Galicia; ED431C-2016-015Xunta de Galicia; ED431C-2020-14Xunta de Galicia; ED431G/01Xunta de Galicia; ED431G 2019/0
Bagging cross-validated bandwidths with application to big data
Versión final aceptada de: https://doi.org/10.1093/biomet/asaa092This is a pre-copyedited, author-produced version of an article accepted for publication in [insert journal title]
following peer review. The version of record of: D Barreiro-Ures, R Cao, M Francisco-Fernández, J D Hart, Bagging
cross-validated bandwidths with application to big data, Biometrika, Volume 108, Issue 4, December 2021, Pages 981–
988, https://doi.org/10.1093/biomet/asaa092, published by Oxford University Press, is available online at: https://
doi.org/10.1093/biomet/asaa092.Hall & Robinson (2009) proposed and analysed the use of bagged cross-validation to choose the band-width of a kernel density estimator. They established that bagging greatly reduces the noise inherent in ordinary cross-validation, and hence leads to a more efficient bandwidth selector. The asymptotic theory of Hall & Robinson (2009) assumes that N , the number of bagged subsamples, is ∞. We expand upon their theoretical results by allowing N to be finite, as it is in practice. Our results indicate an important difference
in the rate of convergence of the bagged cross-validation bandwidth for the cases N = ∞ and N < ∞.
Simulations quantify the improvement in statistical efficiency and computational speed that can result from
using bagged cross-validation as opposed to a binned implementation of ordinary cross-validation. The
performance of the bagged bandwidth is also illustrated on a real, very large, dataset. Finally, a byproduct of
our study is the correction of errors appearing in the Hall & Robinson (2009) expression for the asymptotic
mean squared error of the bagging selectorThe authors thank Andrew Robinson, a referee, the editor and an associate editor for numerous useful comments that significantly improved this article. The authors are also grateful for the insight of Professor Anirban Bhattacharya. The first. three authors were supported by the Spanish Ministry of Economy and Competitiveness (MTM2017-82724-R) and by the Xunta de Galicia (ED431C-2016-015, ED431C-2020-14 and ED431G 2019/01). The work of Barreiro-Ures was carried out during a visit to Texas A&M University, College Station, financed by Inditex.Xunta de Galicia; ED431C-2016-015Xunta de Galicia; ED431C-2020-14Xunta de Galicia; ED431G 2019/0
Second-Order Inference for the Mean of a Variable Missing at Random
We present a second-order estimator of the mean of a variable subject to
missingness, under the missing at random assumption. The estimator improves
upon existing methods by using an approximate second-order expansion of the
parameter functional, in addition to the first-order expansion employed by
standard doubly robust methods. This results in weaker assumptions about the
convergence rates necessary to establish consistency, local efficiency, and
asymptotic linearity. The general estimation strategy is developed under the
targeted minimum loss-based estimation (TMLE) framework. We present a
simulation comparing the sensitivity of the first and second order estimators
to the convergence rate of the initial estimators of the outcome regression and
missingness score. In our simulation, the second-order TMLE improved the
coverage probability of a confidence interval by up to 85%. In addition, we
present a first-order estimator inspired by a second-order expansion of the
parameter functional. This estimator only requires one-dimensional smoothing,
whereas implementation of the second-order TMLE generally requires kernel
smoothing on the covariate space. The first-order estimator proposed is
expected to have improved finite sample performance compared to existing
first-order estimators. In our simulations, the proposed first-order estimator
improved the coverage probability by up to 90%. We provide an illustration of
our methods using a publicly available dataset to determine the effect of an
anticoagulant on health outcomes of patients undergoing percutaneous coronary
intervention. We provide R code implementing the proposed estimator
Recommended from our members
Essays on the Economics of Higher Education and Employment
This dissertation studies legal and institutional policies that help to reduce the barriers to educational attainment and employment. The first chapter examines the effect of availability of juvenile record laws on education attainment and employment using state statue revisions after the passage of the federal Second Chance Act. The second chapter examines enrollment patterns of students who drop out from community colleges and identify four typologies of college dropouts and important factors that contribute to college success. The third chapter estimates the impact of federal Pell Grant eligibility on financial aid packages, labor supply while in schools, and academic outcomes for community college students. The three chapters together shed light on how federal, state, and institutional policies can help reduce the academic and employment barriers for the marginalized population in the United States
Learning understandable classifier models.
The topic of this dissertation is the automation of the process of extracting understandable patterns and rules from data. An unprecedented amount of data is available to anyone with a computer connected to the Internet. The disciplines of Data Mining and Machine Learning have emerged over the last two decades to face this challenge. This has led to the development of many tools and methods. These tools often produce models that make very accurate predictions about previously unseen data. However, models built by the most accurate methods are usually hard to understand or interpret by humans. In consequence, they deliver only decisions, and are short of any explanations. Hence they do not directly lead to the acquisition of new knowledge. This dissertation contributes to bridging the gap between the accurate opaque models and those less accurate but more transparent for humans. This dissertation first defines the problem of learning from data. It surveys the state-of-the-art methods for supervised learning of both understandable and opaque models from data, as well as unsupervised methods that detect features present in the data. It describes popular methods of rule extraction from unintelligible models which rewrite them into an understandable form. Limitations of rule extraction are described. A novel definition of understandability which ties computational complexity and learning is provided to show that rule extraction is an NP-hard problem. Next, a discussion whether one can expect that even an accurate classifier has learned new knowledge. The survey ends with a presentation of two approaches to building of understandable classifiers. On the one hand, understandable models must be able to accurately describe relations in the data. On the other hand, often a description of the output of a system in terms of its input requires the introduction of intermediate concepts, called features. Therefore it is crucial to develop methods that describe the data with understandable features and are able to use those features to present the relation that describes the data. Novel contributions of this thesis follow the survey. Two families of rule extraction algorithms are considered. First, a method that can work with any opaque classifier is introduced. Artificial training patterns are generated in a mathematically sound way and used to train more accurate understandable models. Subsequently, two novel algorithms that require that the opaque model is a Neural Network are presented. They rely on access to the network\u27s weights and biases to induce rules encoded as Decision Diagrams. Finally, the topic of feature extraction is considered. The impact on imposing non-negativity constraints on the weights of a neural network is considered. It is proved that a three layer network with non-negative weights can shatter any given set of points and experiments are conducted to assess the accuracy and interpretability of such networks. Then, a novel path-following algorithm that finds robust sparse encodings of data is presented. In summary, this dissertation contributes to improved understandability of classifiers in several tangible and original ways. It introduces three distinct aspects of achieving this goal: infusion of additional patterns from the underlying pattern distribution into rule learners, the derivation of decision diagrams from neural networks, and achieving sparse coding with neural networks with non-negative weights
Nonparametric Inference for Regression Models with Spatially Correlated Errors
Programa Oficial de Doutoramento en Estatística e Investigación Operativa. 5017V01[Abstract]
Regression estimation can be approached using nonparametric procedures, producing
exible estimators and avoiding misspeci cation problems. Alternatively, parametric
methods may be preferable to nonparametric approaches if the regression function
belongs to the assumed parametric family. However, a bad speci cation of this family
can lead to wrong conclusions. Regression function misspeci cation problems can
be somewhat tackled by applying a goodness-of- t test. For data presenting some
kind of complexity, for example, circular data, the approaches used in regression
estimation or in goodness-of- t tests have to be conveniently adapted. Moreover, it
might occur that the variables of interest can present a certain type of dependence.
For example, they can be spatially correlated, where observations which are close in
space tend to be more similar than observations that are far apart. The goal of this
thesis is twofold, rst, some inference problems for regression models with Euclidean
response and covariates, and spatially correlated errors are analyzed. More speci -
cally, a testing procedure for parametric regression models in the presence of spatial
correlation is proposed. The second aim is to design and study new approaches to
deal with regression function estimation and goodness-of- t tests for models with a
circular response and an Rd-valued covariate. In this setting, nonparametric proposals
to estimate the circular regression function are provided and studied, under
the assumption of independence and also for spatially correlated errors. Moreover,
goodness-of- t tests for assessing a parametric regression model are presented in
these two frameworks. Comprehensive simulation studies and application of the
different techniques to real datasets complete this dissertation.[Resumo]
A estimación da regresión pode ser abordada empregando técnicas non paramétricas,
dando lugar a estimadores
exibles e evitando problemas de mala especi ficación.
Alternativamente, os métodos paramétricos poden ser preferibles se a función de
regresión pertence á familia paramétrica asumida. Porén, unha mala especi ficación
desta familia pode levar a conclusións equivocadas. Os problemas de especi cación
incorrecta da función de regresión poden ser abordados aplicando un contraste de
bondade de axuste. Para datos que presentan algún tipo de complexidade, por
exemplo, datos circulares, os métodos empregados na estimación ou nos contrastes,
deben adaptarse convenientemente. Ademais, pode ocorrer que as variables de interese
poidan presentar un certo tipo de dependencia. Por exemplo, poden estar
espacialmente correladas, onde as observacións que están preto no espazo tenden a
ser máis similares que as observacións que están lonxe. O obxectivo desta tese é
dobre, primeiro, analízanse problemas de inferencia para modelos de regresión con
resposta e covariables Euclídeas, e erros espacialmente correlados. Máis concretamente,
contrástase se a función de regresión pertence a unha familia paramétrica,
en presenza de correlación espacial. O segundo obxectivo é deseñar e estudar novos
procedementos para abordar estimación e contrastes da función regresión para modelos
con resposta circular e covariable con valores en Rd. Neste contexto, preséntanse
e estúdanse propostas non paramétricas para estimar a función de regresión circular,
baixo o suposto de independencia e tamén para erros espacialmente correlados.
Ademais, nestes dous contextos, preséntanse contrastes para avaliar un modelo de
regresión paramétrico. Esta memoria complétase con estudos de simulación exhaustivos
e aplicacións a conxuntos de datos reais.[Resumen]
La estimación de la regresión puede ser abordada usando técnicas no paramétricas,
dando lugar a estimadores flexibles y evitando problemas de mala especificación.
Alternativamente, los métodos paramétricos pueden ser preferibles si la función de
regresión pertenece a la familia paramétrica asumida. Sin embargo, una mala especificación
de esta familia puede llevar a conclusiones equivocadas. Los problemas
de especificación incorrecta de la función de regresión pueden ser abordados aplicando
un contraste de bondad de ajuste. Para datos que presentan algún tipo de
complejidad, por ejemplo, datos circulares, los métodos utilizados en la estimación o
en los contrastes, deben adaptarse convenientemente. Además, puede ocurrir que las
variables de interés puedan presentar un cierto tipo de dependencia. Por ejemplo,
pueden estar espacialmente correladas, donde las observaciones que están cerca en el
espacio tienden a ser más similares que las observaciones que están lejos. El objetivo
de esta tesis es doble, primero, se analizan problemas de inferencia para modelos de
regresión con respuesta y covariables Euclídeas, y errores espacialmente correlados.
Más concretamente, se contrasta si la función de regresión pertenece a una familia
paramétrica, en presencia de correlación espacial. El segundo objetivo es diseñar y
estudiar nuevos procedimientos para abordar estimación y contrastes de la función
regresión para modelos con respuesta circular y covariable con valores en J.Rd. En
este contexto, se presentan y estudian propuestas no paramétricas para estimar la
función de regresión, bajo el supuesto de independencia y también para errores espacialmente
correlados. Además, en estos dos contextos, se presentan contrastes para
evaluar un modelo de regresión paramétrico. Esta memoria se completa con estudios
de simulación exhaustivos y aplicaciones a conjuntos de datos reales.
Palabras clave: contraste de bondad de ajuste, estadística circular, estimación
no paramétrica, regresión lineal-circular, dependencia espacia
EDMON - Electronic Disease Surveillance and Monitoring Network: A Personalized Health Model-based Digital Infectious Disease Detection Mechanism using Self-Recorded Data from People with Type 1 Diabetes
Through time, we as a society have been tested with infectious disease outbreaks of different magnitude, which often pose major public health challenges. To mitigate the challenges, research endeavors have been focused on early detection mechanisms through identifying potential data sources, mode of data collection and transmission, case and outbreak detection methods. Driven by the ubiquitous nature of smartphones and wearables, the current endeavor is targeted towards individualizing the surveillance effort through a personalized health model, where the case detection is realized by exploiting self-collected physiological data from wearables and smartphones.
This dissertation aims to demonstrate the concept of a personalized health model as a case detector for outbreak detection by utilizing self-recorded data from people with type 1 diabetes. The results have shown that infection onset triggers substantial deviations, i.e. prolonged hyperglycemia regardless of higher insulin injections and fewer carbohydrate consumptions. Per the findings, key parameters such as blood glucose level, insulin, carbohydrate, and insulin-to-carbohydrate ratio are found to carry high discriminative power. A personalized health model devised based on a one-class classifier and unsupervised method using selected parameters achieved promising detection performance. Experimental results show the superior performance of the one-class classifier and, models such as one-class support vector machine, k-nearest neighbor and, k-means achieved better performance. Further, the result also revealed the effect of input parameters, data granularity, and sample sizes on model performances.
The presented results have practical significance for understanding the effect of infection episodes amongst people with type 1 diabetes, and the potential of a personalized health model in outbreak detection settings. The added benefit of the personalized health model concept introduced in this dissertation lies in its usefulness beyond the surveillance purpose, i.e. to devise decision support tools and learning platforms for the patient to manage infection-induced crises
STK /WST 795 Research Reports
These documents contain the honours research reports for each year for the Department of Statistics.Honours Research Reports - University of Pretoria 20XXStatisticsBSs (Hons) Mathematical Statistics, BCom (Hons) Statistics, BCom (Hons) Mathematical StatisticsUnrestricte