1,373 research outputs found
An application of extreme value theory in medical sciences
Tese de mestrado em BioestatÃstica, apresentada à Universidade de Lisboa, através da Faculdade de Ciências, em 2018Valores altos de pressão arterial são considerados um fator de risco de doenças cardiovasculares, ver [Hajar, 2016]. Estas doenças são a principal causa de morte em Portugal. Com o objectivo de criar um perfil da população Portuguesa em relação aos riscos de doenças cardiovasculares, um estudo foi desenvolvido em 2005, pela Associação Nacional de Farmácias através do seu Departamento de Serviços Farmacêuticos. O interesse principal do presente estudo consiste em modelar valores elevados de pressão arterial sistólica em indivÃduos que sofrem de uma categoria particular de hipertensão. Um estudo similar foi desenvolvido para modelar os valores elevados de nÃveis de colesterol total, ver [de Zea Bermudez and Mendes, 2012]. A presente dissertação tem dois principais interesses: estudar a distribuição geográfica dos valores elevados de pressão arterial sistólica (em indivÃduos com valores normais de pressão arterial diastólica) em Portugal, i.e., ajustar modelos de valores extremos para cada distrito de Portugal e ilhas e analisar em particular o grupo de maior risco, i.e., indivÃduos idosos. Com esse propósito, a metodologia Peaks Over Threshold foi aplicada. Esta metodologia consiste em ajustar um modelo aos excessos (ou excedências) acima de um limiar de pressão arterial sistólica suficientemente elevado. Os modelos obtidos serão capazes de estimar quantis elevados e probabilidades de cauda de pressão arterial sistólica. Na presente dissertação, os indivÃduos foram divididos em quatro grupos distintos. Aqueles que apresentavam valores normais de pressão arterial sistólica e pressão arterial diastólica. Os que apresentavam valores superiores aos delineados pelas entidades médicas em um ou ambos os Ãndices, ver tabela 6.1. Dentro deste último grupo consideramos os indivÃduos que sofrem de hipertensão arterial sistólica isolada, caracterizada por valores de pressão arterial sistólica superior ou igual a 140 mmHg e valor de pressão arterial diastólica inferior a 90 mmHg. Pretendemos estudar valores elevados de pressão arterial sistólica neste grupo. Em primeira análise, foi efectuado um estudo descritivo dos indivÃduos que frequentaram a campanha e que sofrem de hipertensão sistólica isolada, com o intuito de averiguar o efeito de outras variáveis de interesse nos nÃveis de pressão arterial sistólica. As variáveis consideradas nesta analise preliminar foram a idade, cuja relação com valores elevados de pressão arterial sistólica é conhecida, ver [Pinto, 2007]; o género, consumo de tabaco, Ãndice de massa corporal e distrito. A análise de valores extremos utilizando a metodologia Peaks Over Threshold consiste em várias etapas. Em primeiro lugar, é necessário obter o valor limiar elevado (threshold) com o objectivo de ajustar uma distribuição generalizada de Pareto aos seus excessos. Esta distribuição tem parâmetro de forma k e parâmetro de escala s, ver expressão (3.2). Esta primeira etapa é por vezes difÃcil. A literatura apresenta várias metodologias para tratar esta fase. Existem métodos exploratórios, como o descrito por [Coles, 2001], que utiliza a função de excesso médio para discernir o limiar elevado pretendido. [DuMouchel, 1983] sugere utilizar c0:9 como valor limiar. Existem também métodos que consistem em ajustar o modelo considerando vários valores limiar e avaliar qual produz o melhor ajustamento, como por exemplo os testes de Cramér-von Mises e Anderson-Darling, ver [Choulakian and Stephens, 2001]. Ainda dentro deste grupo destacamos um método Bayesiano que utiliza medidas de surpresa, ver [Lee et al., 2015]. Todos os métodos referidos acima são utilizados ao longo da dissertação. Após concluÃda esta fase procedemos ao ajuste de uma distribuição generalizada de Pareto aos excessos do valor limiar seleccionado. Máxima verosimilhança é a metodologia mais usual para efectuar o ajustamento visto que os resultantes estimadores dos parâmetros gozam de propriedades relevantes. Numa primeira etapa, implementamos a metodologia Peaks Over Threshold nos indivÃduos que sofrem de pressão arterial sistólica isolada em cada distrito de Portugal continental e ilhas. Aqui são exploradas as dificuldades inerentes na análise de valores extremos e também alguns problemas encontrados nos dados, os quais são explorados no capÃtulo seguinte, onde analisamos os valores de pressão arterial sistólica em indivÃduos idosos, (idade superior ou igual a 55) e consideramos um método que trata o problema de testes múltiplos para hipóteses ordenadas. Estas resultam da aplicação dos testes de Cramér-von Mises e Anderson-Darling para diferentes partições da amostra; e consideramos também modelos jittering para lidar com o problema de discretização dos dados.It has been well stated that high values of blood pressure constitute a risk factor for cardiovascular diseases [Hajar, 2016], with the latter being the number one death cause in Portugal. With the objectiveof profiling the Portuguese population in what regards cardiovascular diseases’ risk factors, a study was developed and carried out in 2005, by the National Pharmacy Association through its Department of Pharmaceutical Care. The main interest of the present study is to model the high values of systolic blood pressure of individuals with a specific hypertension pathology. A similar study was developed for the total cholesterol levels [de Zea Bermudez and Mendes, 2012]. The aims of this dissertation are twofold: to study the geographical distribution of the high systolic blood pressure (but normal diastolic blood pressure) in Portugal, i.e., fitting extreme value models for each Portuguese district and islands and studying the group that is more at risk, i.e., the elderly. With that purpose, the Peaks Over Threshold methodology was applied, which consists in finding a sufficiently high systolic blood pressure threshold and fitting a tail model to the excesses. The models will be able to estimate extreme quantiles and tail probabilities of the systolic blood pressure in each group
Evaluating the Differences of Gridding Techniques for Digital Elevation Models Generation and Their Influence on the Modeling of Stony Debris Flows Routing: A Case Study From Rovina di Cancia Basin (North-Eastern Italian Alps)
Debris \ufb02ows are among the most hazardous phenomena in mountain areas. To cope
with debris \ufb02ow hazard, it is common to delineate the risk-prone areas through
routing models. The most important input to debris \ufb02ow routing models are the
topographic data, usually in the form of Digital Elevation Models (DEMs). The quality
of DEMs depends on the accuracy, density, and spatial distribution of the sampled
points; on the characteristics of the surface; and on the applied gridding methodology.
Therefore, the choice of the interpolation method affects the realistic representation
of the channel and fan morphology, and thus potentially the debris \ufb02ow routing
modeling outcomes. In this paper, we initially investigate the performance of common
interpolation methods (i.e., linear triangulation, natural neighbor, nearest neighbor,
Inverse Distance to a Power, ANUDEM, Radial Basis Functions, and ordinary kriging)
in building DEMs with the complex topography of a debris \ufb02ow channel located
in the Venetian Dolomites (North-eastern Italian Alps), by using small footprint full-
waveform Light Detection And Ranging (LiDAR) data. The investigation is carried
out through a combination of statistical analysis of vertical accuracy, algorithm
robustness, and spatial clustering of vertical errors, and multi-criteria shape reliability
assessment. After that, we examine the in\ufb02uence of the tested interpolation algorithms
on the performance of a Geographic Information System (GIS)-based cell model for
simulating stony debris \ufb02ows routing. In detail, we investigate both the correlation
between the DEMs heights uncertainty resulting from the gridding procedure and
that on the corresponding simulated erosion/deposition depths, both the effect of
interpolation algorithms on simulated areas, erosion and deposition volumes, solid-liquid
discharges, and channel morphology after the event. The comparison among the tested
interpolation methods highlights that the ANUDEM and ordinary kriging algorithms
are not suitable for building DEMs with complex topography. Conversely, the linear
triangulation, the natural neighbor algorithm, and the thin-plate spline plus tension and completely regularized spline functions ensure the best trade-off among accuracy
and shape reliability. Anyway, the evaluation of the effects of gridding techniques on
debris \ufb02ow routing modeling reveals that the choice of the interpolation algorithm does
not signi\ufb01cantly affect the model outcomes
Contributions to Statistical Image Analysis for High Content Screening.
Images of cells incubated with fluorescent small molecule probes can be used to infer where the compounds distribute within cells. Identifying the spatial pattern of
compound localization within each cell is very important problem for which adequate statistical methods do not yet exist.
First, we asked whether a classifier for subcellular localization categories can be
developed based on a training set of manually classified cells. Due to challenges of the images such as uneven field illumination, low resolution, high noise, variation in intensity and contrast, and cell to cell variability in probe distributions, we constructed texture features for contrast quantiles conditioning on intensities, and classifying on artificial cells with same marginal distribution but different conditional distribution supported that this conditioning approach is beneficial to distinguish different localization distributions. Using these conditional features, we obtained satisfactory performance in image classification, and performed to dimension reduction and data visualization.
As high content images are subject to several major forms of artifacts, we are
interested in the implications of measurement errors and artifacts on our ability to draw scientifically meaningful conclusions from high content images. Specifically, we considered three forms of artifacts: saturation, blurring and additive noise. For each type of artifacts, we artificially introduced larger amount, and aimed to understand the bias by `Simulation Extrapolation' (SIMEX) method, applied to the measurement errors for pairwise centroid distances, the degree of eccentricity in the class-specific distributions, and the angles between the dominant axes of variability for different categories.
Finally, we briefly considered the analysis of time-point images. Small molecule
studies will be more focused. Specifically, we consider the evolving patterns of subcellular staining from the moment that a compound is introduced into the cell culture medium, to the point that steady state distribution is reached. We construct the degree to which the subcellular staining pattern is concentrated in or near the nucleus as the features of timecourse data set, and aim to determine whether different compounds accumulate in different regions at different times, as characterized in terms of their position in the cell relative to the nucleus.Ph.D.StatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/91460/1/liufy_1.pd
A Framework for the Estimation of Disaggregated Statistical Indicators Using Tree-Based Machine Learning Methods
The thesis combines four papers that introduce a coherent framework based on MERFs for the estimation of spatially disaggregated economic and inequality indicators and associated uncertainties. Chapter 1 focusses on flexible domain prediction using MERFs. We discuss characteristics of semi-parametric point and uncertainty estimates for domain-specific means. Extensive model- and design-based simulations highlight advantages of MERFs in comparison to 'traditional' LMM-based SAE methods. Chapter 2 introduces the use of MERFs under limited covariate information. The access to population-level micro-data for auxiliary information imposes barriers for researchers and practitioners. We introduce an approach that adaptively incorporates aggregated auxiliary information using calibration-weights in the absence of unit-level auxiliary data. We apply the proposed method to German survey data and use aggregated covariate census information from the same year to estimate the average opportunity cost of care work for 96 planning regions in Germany. In Chapter 3, we discuss the estimation of non-linear poverty and inequality indicators. Our proposed method allows to estimate domain-specific cumulative distribution functions from which desired (non-linear) poverty estimators can be obtained. We evaluate proposed point and uncertainty estimators in a design-based simulation and focus on a case study uncovering spatial patterns of poverty for the Mexican state of Veracruz. Additionally, Chapter 3 informs a methodological discussion on differences and advantages between the use of predictive algorithms and (linear) statistical models in the context of SAE. The final Chapter 4 complements the previous research by implementing discussed methods for point and uncertainty estimates in the open-source R package SAEforest. The package facilitates the use of discussed methods and accessibly adds MERFs to the existing toolbox for SAE and official statistics.
Overall, this work aims to synergize aspects from two statistical spheres (e.g. 'traditional' parametric models and nonparametric predictive algorithms) by critically discussing and adapting tree-based methods for applications in SAE. In this perspective, the thesis contributes to the existing literature along three dimensions: 1) The methodological development of alternative semi-parametric methods for the estimation of non-linear domain-specific indicators and means under unit-level and aggregated auxiliary covariates. 2) The proposition of a general framework that enables further discussions between 'traditional' and algorithmic approaches for SAE as well as an extensive comparison between LMM-based methods and MERFs in applications and several model and design-based simulations. 3) The provision of an open-source software package to facilitate the usability of methods and thus making MERFs and general SAE methodology accessible for tailored research applications of statistical, institutional and political practitioners
Recommended from our members
A CFD-informed model for subchannel resolution crud prediction
A physics-directed, statistically based, surrogate model of the small scale flow fea-
tures that impact Chalk River unidentified deposit (crud) growth is presented in this work. The objective of the surrogate is to provide additional details of the rod surface
temperature, heat flux, and near-wall turbulent kinetic energy fields which cannot be
explicitly captured by a subchannel code.
Operating as a mapping from the high fidelity computational fluid dynamics (CFD) data to the low fidelity subchannel grid (hi2lo), the model provides CFD-informed bound-
ary conditions to the crud model executed on the subchannel pin surface mesh. The
surface temperature, heat flux, and turbulent kinetic energy, henceforth referred to as
the fields of interest (FOI), govern the growth rate of crud on the surface of the rod and
the precipitation of boron in the porous crud layer. Therefore the model predicts the
behavior of the FOIs as a function of position in the core and local thermal-hydraulic
(TH) conditions.
The subchannel code produces an estimate for all crud-relevant TH quantities at a
coarse spatial resolution everywhere in the core and executes substantially faster than
CFD. In the hi2lo approach, the solution provided by the subchannel code is augmented
by a predicted stochastic component of the FOI informed by CFD results to provide a
more detailed description of the target FOIs than subchannel can provide alone. To this
end, a novel method based on the marriage of copula and gradient boosting techniques is proposed. This methodology forgoes a spatial interpolation procedure for a statistically
driven approach, which predicts the fractional area of a rod’s surface in excess of some
critical temperature but not precisely where such maxima occur on the rod surface. The
resultant model retains the ability to account for the presence of hot and cold spots on the
rod surface induced by turbulent flow downstream of spacer grids when producing crud
estimates. Sklar’s theorem is leveraged to decompose multivariate probability densities
of the FOI into independent copula and marginal models. The free parameters within the
copula model are predicted using a combination of supervised regression and classification
machine learning techniques with training data sets supplied by a suite of precomputed
CFD results spanning a typical pressurized water reactor TH envelope.
Results show that compared to the subchannel standalone case, the hi2lo method
more accurately preserves the influence of spacer grids on the crud growth rate. Or more
precisely, the hi2lo method recovers key statistical properties of the FOI which impact
crud growth. Compared to gold standard high fidelity CFD/crud coupled results in a
single assembly test case, the hi2lo model produced a relative total crud mass difference
of -8.9% compared to the standalone subchannel relative crud mass difference of 192.1%.Mechanical Engineerin
Application of Statistical Methods and Process Models for the Design and Analysis of Activated Sludge Wastewater Treatment Plants (WWTPs)
The purpose of this study is to investigate statistical procedures to qualify uncertainty, and explicitly evaluate its impact on wastewater treatment plants (WWTPs). The goal is to develop a statistical-based procedure to design WWTPs that provide reliable protection of water quality, instead of making overly conservative assumptions and adopting empirical safety factors. An innovative Monte Carlo based procedure was developed to quantify the risk of violating effluent as a function of various design decisions. A simulation program called StatASPS was developed to conduct Monte Carlo simulations combined with the ASM1 model.
A random influent generator was developed to describe the statistical characteristics of the influent components of WWTPs. Prior to modeling, a two-directional exponential smoothing (TES) method was developed to replace those non-randomly missing data during weekends and holidays. The best models were selected based on various statistics and the ability to forecast future values. The time series models were then used to generate random influent variables with the same statistical characteristics as the original data.
The best Monte Carlo simulations were conducted using historical influent data and site-specific parameter distributions, according to the applications to both the Oak Ridge and Seneca WWTPs. This indicates that parameter uncertainty was more effective in predicting uncertainty in plant performance than influent variability. The ultimate simulations were conducted using one-month’s influent data, considering limitations of computing technologies. Application of the method to the two plants demonstrated that this method provided a reliable and reasonable estimate of the uncertainty of plant performance. The best predictions of plant uncertainty were obtained by determining the distribution for the most sensitive parameter and holding all other model parameters constant.
The StatASPS procedure proved to be a reliable and reasonable method to design cost-effective WWTPs. With further development, this procedure could provide engineers and regulators with a high degree of confidence that the plant will perform as required, without resorting to overly conservative assumptions or large safety factors
ISBIS 2016: Meeting on Statistics in Business and Industry
This Book includes the abstracts of the talks presented at the 2016 International Symposium on Business and Industrial Statistics, held at Barcelona, June 8-10, 2016, hosted at the Universitat Politècnica de Catalunya - Barcelona TECH, by the Department of Statistics and Operations Research. The location of the meeting was at ETSEIB Building (Escola Tecnica Superior d'Enginyeria Industrial) at Avda Diagonal 647.
The meeting organizers celebrated the continued success of ISBIS and ENBIS society, and the meeting draw together the international community of statisticians, both academics and industry professionals, who share the goal of making statistics the foundation for decision making in business and related applications. The Scientific Program Committee was constituted by:
David Banks, Duke University
AmÃlcar Oliveira, DCeT - Universidade Aberta and CEAUL
Teresa A. Oliveira, DCeT - Universidade Aberta and CEAUL
Nalini Ravishankar, University of Connecticut
Xavier Tort Martorell, Universitat Politécnica de Catalunya, Barcelona TECH
Martina Vandebroek, KU Leuven
Vincenzo Esposito Vinzi, ESSEC Business Schoo
- …