1,724 research outputs found

    A Grouping Genetic Algorithm for Joint Stratification and Sample Allocation Designs

    Full text link
    Predicting the cheapest sample size for the optimal stratification in multivariate survey design is a problem in cases where the population frame is large. A solution exists that iteratively searches for the minimum sample size necessary to meet accuracy constraints in partitions of atomic strata created by the Cartesian product of auxiliary variables into larger strata. The optimal stratification can be found by testing all possible partitions. However the number of possible partitions grows exponentially with the number of initial strata. There are alternative ways of modelling this problem, one of the most natural is using Genetic Algorithms (GA). These evolutionary algorithms use recombination, mutation and selection to search for optimal solutions. They often converge on optimal or near-optimal solution more quickly than exact methods. We propose a new GA approach to this problem using grouping genetic operators instead of traditional operators. The results show a significant improvement in solution quality for similar computational effort, corresponding to large monetary savings.Comment: 22 page

    Discussions

    Full text link
    Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/111979/1/j.1751-5823.2011.00145.x.pd

    The Weighting Process in the SHIW

    Get PDF
    The design of a probability sample jointly determines the method used to select sampling units from the population and the estimator of the population parameter. If the sampling fraction is constant for all the units in the sample, then the unweighted sampling mean is an unbiased estimator. In the Survey on Household Income and Wealth (SHIW), units included in the sample have unequal probabilities of selection and each observation is weighted using the inverse of the proper sampling fraction (design weight) adjusted for the response mechanism (nonresponse weight) and for other factors such as imperfect coverage. In this paper we present the weighting scheme of the SHIW and assess its impact on bias and variance of selected estimators. Empirical evidences show that the increasing variability induced by using weighted estimators is compensated by the bias reduction even when performing analysis on sample domains. A set of longitudinal weights is also proposed to account for the selection process and the attrition of the SHIW panel component. These weights, giving their enhanced description of the “panel population”, should be better suited to perform longitudinal analysis; nevertheless their higher variance implies that they wouldn’t always be preferable in terms of mean square error.Survey Methods

    Stochastic-optimization of equipment productivity in multi-seam formations

    Get PDF
    Short and long range planning and execution for multi-seam coal formations (MSFs) are challenging with complex extraction mechanisms. Stripping equipment selection and scheduling are functions of the physical dynamics of the mine and the operational mechanisms of its components, thus its productivity is dependent on these parameters. Previous research studies did not incorporate quantitative relationships between equipment productivities and extraction dynamics in MSFs. The intrinsic variability of excavation and spoiling dynamics must also form part of existing models. This research formulates quantitative relationships of equipment productivities using Branch-and-Bound algorithms and Lagrange Parameterization approaches. The stochastic processes are resolved via Monte Carlo/Latin Hypercube simulation techniques within @RISK framework. The model was presented with a bituminous coal mining case in the Appalachian field. The simulated results showed a 3.51% improvement in mining cost and 0.19% increment in net present value. A 76.95yd³ drop in productivity per unit change in cycle time was recorded for sub-optimal equipment schedules. The geologic variability and equipment operational parameters restricted any possible change in the cost function. A 50.3% chance of the mining cost increasing above its current value was driven by the volume of material re-handled with 0.52 regression coefficient. The study advances the optimization process in mine planning and scheduling algorithms, to efficiently capture future uncertainties surrounding multivariate random functions. The main novelty includes the application of stochastic-optimization procedures to improve equipment productivity in MSFs --Abstract, page iii

    State space modelling of extreme values with particle filters

    Get PDF
    State space models are a flexible class of Bayesian model that can be used to smoothly capture non-stationarity. Observations are assumed independent given a latent state process so that their distribution can change gradually over time. Sequential Monte Carlo methods known as particle filters provide an approach to inference for such models whereby observations are added to the fit sequentially. Though originally developed for on-line inference, particle filters, along with related particle smoothers, often provide the best approach for off-line inference. This thesis develops new results for particle filtering and in particular develops a new particle smoother that has a computational complexity that is linear in the number of Monte Carlo samples. This compares favourably with the quadratic complexity of most of its competitors resulting in greater accuracy within a given time frame. The statistical analysis of extremes is important in many fields where the largest or smallest values have the biggest effect. Accurate assessments of the likelihood of extreme events are crucial to judging how severe they could be. While the extreme values of a stationary time series are well understood, datasets of extremes often contain varying degrees of non-stationarity. How best to extend standard extreme value models to account for non-stationary series is a topic of ongoing research. The thesis develops inference methods for extreme values of univariate and multivariate non-stationary processes using state space models fitted using particle methods. Though this approach has been considered previously in the univariate case, we identify problems with the existing method and provide solutions and extensions to it. The application of the methodology is illustrated through the analysis of a series of world class athletics running times, extreme temperatures at a site in the Antarctic, and sea-level extremes on the east coast of England

    Selection models for efficient two-phase design of family studies

    Get PDF
    This is the peer reviewed version of the following article: “Zhong Y and Cook RJ (2021), Selection models for efficient two-phase design of family studies, Statistics in Medicine, 40 (2): 254–270” which has been published in final form at https://doi.org/10.1002/sim.8772.Family studies routinely employ biased sampling schemes in which individuals are randomly chosen from a disease registry and genetic and phenotypic data are obtained from their consenting relatives. We view this as a two-phase study and propose the use of an efficient selection model for the recruitment of families to form a phase II sample subject to budgetary constraints. Simple random sampling, balanced sampling and use of an approximately optimal selection model are considered where the latter is chosen to minimize the variance of parameters of interest. We consider the setting where family members provide current status data with respect to the disease and use copula models to address within-family dependence. The efficiency gains fromthe use of an optimal selection model over simple random sampling and balanced sampling schemes are investigated as is the robustness of optimal sampling to model misspecification. An application to a family study on psoriatic arthritis is given for illustration.National Natural Science Foundation of China, NSFC-11901376 (to YZ) || Shanghai Pujiang Program, 2019PJC051 (to YZ) || SUFE Innovation Funding, 2019110051 (to YZ) || Discovery Grant and Supplement Award from the Natural Science and Engineering Research Council of Canada, RGPIN 155849 and RGPIN 04207 (to RJC) || Canadian Institutes for Health Research, FRN 13887 (to RJC

    Structure preserving estimators to update socio-economic indicators in small areas

    Get PDF
    Official statistics are intended to support decision makers by providing reliable information on different population groups, identifying what their needs are and where they are located. This allows, for example, to better guide public policies and focus resources on the population most in need. Statistical information must have some characteristics to be useful for this purpose. This data must be reliable, up-to-date and also disaggregated at different domain levels, e.g., geographically or by sociodemographic groups (Eurostat, 2017). Statistical data producers (e.g., national statistical offices) face great challenges in delivering statistics with these three characteristics, mainly due to lack of resources. Population censuses collect data on demographic, economic and social aspects of all persons in a country which makes information at all domains of interest available. They quickly become outdated since they are carried out only every 10 years, especially in developing countries. Furthermore, administrative data sources in many countries have not enough quality to produce statistics that are reliable and comparable with other relevant sources. On the contrary, national surveys are conducted more frequently than censuses and offer the possibility of studying more complex topics. Due to their sample sizes, direct estimates are only published based on domains where the estimates reach a specific level of precision. These domains are called planned domains or large areas in this thesis, and the domains in which direct estimates cannot be produced due to lack of sample size or low precision will be called small areas or domains. Small area estimation (SAE) methods have been proposed as a solution to produce reliable estimates in small domains. These methods allow improving the precision of direct estimates, as well as providing reliable information in domains where the sample size is zero or where direct estimates cannot be obtained by combining data from censuses and surveys (Rao and Molina, 2015). Thereby, the variables obtained from both data sources are assumed to be highly correlated but the census actually may be outdated. In these cases, structure preservation estimation (SPREE) methods offer a solution when the target indicator is a categorical variable, with at least two categories (for example, the labor market status of an individual can be categorised as: ‘employed’, ‘unemployed’, and ‘out of labor force’). The population counts are arranged in contingency tables: by rows (domains of interest) and columns (the categories of the variable of interest) (Purcell and Kish, 1980). These types of estimators are studied in Part I of this work. In Chapter 1, SPREE methods are applied to produce postcensal population counts for the indicators that make up the ‘health’ dimension of the multidimensional poverty index (MPI) defined by Costa Rica. This case study is also used to illustrate the functionalities of the R spree package. It is a user-friendly tool designed to produce updated point and uncertainty estimates based on three different approaches: SPREE (Purcell and Kish, 1980), generalised SPREE (GSPREE) (Zhang and Chambers, 2004), and multivariate SPREE (MSPREE) (Luna-Hernández, 2016). SPREE-type estimators help to update population counts by preserving the census structure and relying on new and updated totals that are usually provided by recent survey data. However, two scenarios can jeopardise the use of standard SPREE methods: a) the indicator of interest is not available in the census data e.g., income or expenditure information to estimate monetary based poverty indicators, and b) the total margins are not reliable, for instance, when changes in the population distribution between areas are not captured correctly by the surveys or when some domains are not selected in the sample. Chapters 2 and 3 offer a solution for these cases, respectively. Chapter 2 presents a two-step procedure that allows obtaining reliable and updated estimates for small areas when the variable of interest is not available in the census. The first step is to obtain the population counts for the census year using a well-known small-area estimation approach: the empirical best prediction (EBP) (Molina and Rao, 2010) method. Then, the result of this procedure is used as input to proceed with the update for postcensal years by implementing the MSPREE (Luna-Hernández, 2016) method. This methodology is applied to the case of local areas in Costa Rica, where incidence of poverty (based on income) is estimated and updated for postcensal years (2012-2017). Chapter 3 deals with the second scenario where the population totals in local areas provided by the survey data are strengthened by including satellite imagery as an auxiliary source. These new margins are used as input in the SPREE procedure. In the case study in this paper, annual updates of the MPI for female-headed households in Senegal are produced. While the use of satellite imagery and other big data sources can improve the reliability of small-area estimates, access to survey data that can be matched with these novel sources is restricted for confidentiality reasons. Therefore, a data dissemination strategy for micro-level survey data is proposed in the paper presented in Part II. This strategy aims to help statistical data producers to improve the trade-off between privacy risk and utility of the data that they release for research purposes

    Statistical Methods for Incomplete Covariates and Two-Phase Designs

    Get PDF
    Incomplete data is a pervasive problem in health research, and as a result statistical methods enabling inference based on partial information play a critical role. This thesis explores estimation of regression coefficients and associated inferences when variables are incompletely observed. In the later chapters, we focus primarily on settings with incomplete covariate data which arise by design, as in studies with two-phase sampling schemes, as opposed to incomplete data which arise due to events beyond the control of the scientist. We consider the problem in which "inexpensive" auxiliary information can be used to inform the selection of individuals for collection of data on the "expensive" covariate. In particular, we explore how parameter estimation relates to the choice of sampling scheme. Efficient sampling designs are defined by choosing the optimal sampling criteria within a particular class of selection models under a two-phase framework. We compare the efficiency of these optimal designs to simple random sampling and balanced sampling designs under a variety of frameworks for inference. As a prelude to the work on two-phase designs, we first review and study issues related to incomplete data arising due to chance. In Chapter 2, we discuss several models by which missing data can arise, with an emphasis on issues in clinical trials. The likelihood function is used as a basis for discussing different missing data mechanisms for incomplete responses in short-term and longitudinal studies, as well as for missing covariates. We briefly discuss common ad hoc strategies for dealing with incomplete data, such as complete-case analyses and naive methods of imputation, and we review more broadly appropriate approaches for dealing with incomplete data in terms of asymptotic and empirical frequency properties. These methods include the EM algorithm, multiple imputation, and inverse probability weighted estimating equations. Simulation studies are reported which demonstrate how to implement these procedures and examine performance empirically. We further explore the asymptotic bias of these estimators when the nature of the missing data mechanism is misspecified. We consider specific types of model misspecification in methods designed to account for the missingness and compare the limiting values of the resulting estimators. In Chapter 3, we focus on methods for two-phase studies in which covariates are incomplete by design. In the second phase of the two-phase study, subject to correct specification of key models, optimal sub-sampling probabilities can be chosen to minimise the asymptotic variance of the resulting estimator. These optimal phase-II sampling designs are derived and the empirical and asymptotic relative efficiencies resulting from these designs are compared to those from simple random sampling and balanced sampling designs. We further examine the effect on efficiency of utilising external pilot data to estimate parameters needed for derivation of optimal designs, and we explore the sensitivity of these optimal sampling designs to misspecification of preliminary parameter estimates and to the misspecification of the covariate model at the design stage. Designs which are optimal for analyses based on inverse probability weighted estimating equations are shown to result in efficiency gains for several different methods of analysis and are shown to be relatively robust to misspecification of the parameters or models used to derive the optimal designs. Furthermore, these optimal designs for inverse probability weighted estimating equations are shown to be well behaved when necessary design parameters are estimated using relatively small external pilot studies. We also consider efficient two-phase designs explicitly in the context of studies involving clustered and longitudinal responses. Model-based methods are discussed for estimation and inference. Asymptotic results are used to derive optimal sampling designs and the relative efficiencies of these optimal designs are again compared with simple random sampling and balanced sampling designs. In this more complex setting, balanced sampling designs are demonstrated to be inefficient and it is not obvious when balanced sampling will offer greater efficiency than a simple random sampling design. We explore the relative efficiency of phase-II sampling designs based on increasing amounts of information in the longitudinal responses and show that the balanced design may become less efficient when more data is available at the design stage. In contrast, the optimal design is able to exploit additional information to increase efficiency whenever more data is available at phase-I. In Chapter 4, we consider an innovative adaptive two-phase design which breaks the phase-II sampling into a phase-IIa sample obtained by a balanced or proportional sampling strategy, and a phase-IIb sample collected according to an optimal sampling design based on the data in phases I and IIa. This approach exploits the previously established robustness of optimal inverse probability weighted designs to overcome the difficulties associated with the fact that derivations of optimal designs require a priori knowledge of parameters. The efficiency of this hybrid design is compared to those of the proportional and balanced sampling designs, and to the efficiency of the true optimal design, in a variety of settings. The efficiency gains of this adaptive two-phase design are particularly apparent in the setting involving clustered response data, and it is natural to consider this approach in settings with complex models for which it is difficult to even speculate on suitable parameter values at the design stage
    corecore