17 research outputs found

    On sample selection models and skew distributions

    Get PDF
    This thesis is concerned with methods for dealing with missing data in nonrandom samples and recurrent events data. The first part of this thesis is motivated by scores arising from questionnaires which often follow asymmetric distributions, on a fixed range. This can be due to scores clustering at one end of the scale or selective reporting. Sometimes, the scores are further subjected to sample selection resulting in partial observability. Thus, methods based on complete cases for skew data are inadequate for the analysis of such data and a general sample selection model is required. Heckman proposed a full maximum likelihood estimation method under the normality assumption for sample selection problems, and parametric and non-parametric extensions have been proposed. A general selection distribution for a vector Y 2 Rp has a PDF fY given by fY(y) = fY?(y) P(S? 2 CjY? = y) P(S? 2 C) ; where S? 2 Rq and Y? 2 Rp are two random vectors, and C is a measurable subset of Rq. We use this generalization to develop a sample selection model with underlying skew-normal distribution. A link is established between the continuous component of our model log-likelihood function and an extended version of a generalized skewnormal distribution. This link is used to derive the expected value of the model, which extends Heckman's two-step method. The general selection distribution is also used to establish the closed skew-normal distribution as the continuous component of the usual multilevel sample selection models. Finite sample performances of the maximum likelihood estimator of the models are studied via Monte Carlo simulation. The model parameters are more precisely estimated under the new models, even in the presence of moderate to extreme skewness, than the Heckman selection models. Application to data from a study of neck injuries where the responses are substantially skew successfully discriminates between selection and inherent skewness, and the multilevel model is used to analyze jointly unit and item non-response. We also discuss computational and identification issues, and provide an extension of the model using copula-based sample selection models with truncated marginals. The second part of this thesis is motivated by studies that seek to analyze processes that generate events repeatedly over time. We consider the number of events per subject within a specified study period as the primary outcome of interest. One considerable challenge in the analysis of this type of data is the large proportion of patients that might discontinue before the end of the study, leading to partially observed data. Sophisticated sensitivity analyses tools are therefore necessary for the analysis of such data. We propose the use of two frequentist based imputation methods for dealing with missing data in recurrent event data framework. The recurrent events are modeled as over-dispersed Poisson data, with constant rate function. Different assumptions about future behavior of dropouts depending on reasons for dropout and treatment received are made and evaluated in a simulation study. We illustrate our approach with a clinical trial in patients who suffer from bladder cancer

    On the extended two-parameter generalized skew-normal distribution

    Get PDF
    We propose a three-parameter skew-normal distribution, obtained by using hidden truncation on a skew-normal random variable. The hidden truncation framework permits direct interpretation of the model parameters. A link is established between the model and the closed skew-normal distribution

    Predictive performance of penalized beta regression model for continuous bounded outcomes

    Get PDF
    Prediction models for continuous bounded outcomes are often developed by fitting ordinary least-square regression. However, predicted values from such method may lie outside the range of the outcome as it is bounded within a fixed range, with nonlinear expectation due to the ceiling and floor effects of the bounds. Thus, regular regression models such as normal linear or nonlinear models, are inadequate for prediction purposes for bounded response variable and the use of distributions that can model different shapes are essential. Beta regression, apart from modeling different shapes and constraining predictions to an admissible range, has been shown to be superior to alternative methods for data fitting but not for prediction purposes. We take data structures into account and compared various penalized beta regression method on predictive accuracy for bounded outcome variables using optimism corrected measures. Contrary to results obtained under many regression contexts, the classical maximum likelihood method produced good predictive accuracy in terms of R2 and RMSE. The ridge penalized beta regression performed better in terms of g-index, which is a measure of performance of the methods in external data sets. We restricted attention to prespecified models throughout and as such variable selection methods are not evaluated

    A Sample Selection Model with Skew-normal Distribution

    Get PDF
    Non-random sampling is a source of bias in empirical research. It is common for the outcomes of interest (e.g. wage distribution) to be skewed in the source population. Sometimes, the outcomes are further subjected to sample selection, which is a type of missing data, resulting in partial observability. Thus, methods based on complete cases for skew data are inadequate for the analysis of such data and a general sample selection model is required. Heckman proposed a full maximum likelihood estimation method under the normality assumption for sample selection problems, and parametric and non-parametric extensions have been proposed. We generalize Heckman selection model to allow for underlying skew-normal distributions. Finite-sample performance of the maximum likelihood estimator of the model is studied via simulation. Applications illustrate the strength of the model in capturing spurious skewness in bounded scores, and in modelling data where logarithm transformation could not mitigate the effect of inherent skewness in the outcome variable

    Study on microstructure and mechanical properties of 304 stainless steel joints by TIG-MIG hybrid welding

    Get PDF
    Abstract: Stainless steel is a family of Fe-based alloys having excellent resistance to corrosion, and as such has been used imperatively for kitchen utensils, transportation, building constructions and much more. This paper presents the work conducted on the material characterizations of a TIG-MIG hybrid welded joint of type 304 austenitic stainless steel. The welding processes were conducted in three phases. The phases of welding employed are MIG welding using a current of 170A, TIG welding using the current of 190A, and a hybrid TIG-MIG welding with currents of 190/170A respectively. The MIG, TIG, and hybrid TIG-MIG weldments were characterized with incomplete penetration, full penetration and excess penetration of weld. Intergranular austenite was created towards the transition zone and the HAZ. The thickness of the delta ferrite (δ-Fe) formed in the microstructures of the TIG weld is more than the thickness emerged in the microstructures of MIG weld and hybrid TIG-MIG welds. A TIG-MIG hybrid weld of specimen welded at the currents of 190/170A has the highest UTS value and percentage elongation of 397.72 MPa and 35.7 %. The TIG-MIG hybrid welding can be recommended for high-tech industrial applications such as nuclear, aircraft, food processing, and automobile industry

    On Lasso and adaptive Lasso for non-random sample in credit scoring

    Get PDF
    Prediction models in credit scoring are often formulated using available data on accepted applicants at the loan application stage. The use of this data to estimate probability of default (PD) may lead to bias due to non-random selection from the population of applicants. That is, the PD in the general population of applicants may not be the same with the PD in the subpopulation of the accepted applicants. A prominent model for the reduction of bias in this framework is the sample selection model, but there is no consensus on its utility yet. It is unclear if the bias-variance trade- off of regularization techniques can improve the predictions of PD in non-random sample selection setting. To address this, we propose the use of Lasso and adaptive Lasso for variable selection and optimal predictive accuracy. By appealing to the least square approximation of the likelihood function of sample selection model, we optimize the resulting function subject to L1 and adaptively weighted L1 penalties using an efficient algorithm. We evaluate the performance of the proposed approach and competing alternatives in a simulation study and applied it to the well-known American Express credit card dataset

    Regularization and variable selection in Heckman selection model

    Get PDF
    Sample selection arises when the outcome of interest is partially observed in a study. A common challenge is the requirement for exclusion restrictions. That is, some of the covariates affecting missingness mechanism do not affect the outcome. The drive to establish this requirement often leads to the inclusion of irrelevant variables in the model. A suboptimal solution is the use of classical variable selection criteria such as AIC and BIC, and traditional variable selection procedures such as stepwise selection. These methods are unstable when there is limited expert knowledge about the variables to include in the model. To address this, we propose the use of adaptive Lasso for variable selection and parameter estimation in both the selection and outcome submodels simultaneously in the absence of exclusion restrictions. By using the maximum likelihood estimator of the sample selection model, we constructed a loss function similar to the least squares regression problem up to a constant, and minimized its penalized version using an efficient algorithm. We show that the estimator, with proper choice of regularization parameter, is consistent and possesses the oracle properties. The method is compared to Lasso and adaptively weighted L1 penalized Two-step method. We applied the methods to the well-known Ambulatory Expenditure Dat

    Prediction of default probability by using statistical models for rare events

    Get PDF
    Prediction models in credit scoring usually involve the use of data sets with highly imbalanced distributions of the event of interest (default). Logistic regression, which is widely used to estimate the probability of default, PD, often suffers from the problem of separation when the event of interest is rare and consequently poor predictive performance of the minority class in small samples. A common solution is to discard majority class examples, to duplicate minority class examples or to use a combination of both to balance the data. These methods may overfit data. It is unclear how penalized regression models such as Firth's estimator, which reduces bias and mean-square error relative to classical logistic regression, performs in modelling PD. We review some methods for class imbalanced data and compare them in a simulation study using the Taiwan credit card data. We emphasize the effect of events per variable for developing an accurate model—an often neglected concept in PD-modelling. The data balancing techniques that are considered are the random oversampling examples and synthetic minority oversampling technique methods. The results indicate that the synthetic minority oversampling technique improved predictive accuracy of PD regardless of sample size. Among the penalized regression models that are analysed, the log-F prior and ridge regression methods are preferred

    A robust imputation method for missing responses and covariates in sample selection models

    Get PDF
    Sample selection arises when the outcome of interest is partially observed in a study. Although sophisticated statistical methods in the parametric and non-parametric framework have been proposed to solve this problem, it is yet unclear how to deal with selectively missing covariate data using simple multiple imputation techniques, especially in the absence of exclusion restrictions and deviation from normality. Motivated by the 2003–2004 NHANES data, where previous authors have studied the effect of socio-economic status on blood pressure with missing data on income variable, we proposed the use of a robust imputation technique based on the selection-t sample selection model. The imputation method, which is developed within the frequentist framework, is compared with competing alternatives in a simulation study. The results indicate that the robust alternative is not susceptible to the absence of exclusion restrictions – a property inherited from the parent selection-t model – and performs better than models based on the normal assumption even when the data is generated from the normal distribution. Applications to missing outcome and covariate data further corroborate the robustness properties of the proposed method. We implemented the proposed approach within the MICE environment in R Statistical Software

    A unified approach to multilevel sample selection models

    Get PDF
    We propose a unified approach for multilevel sample selection models using a generalized result on skew distributions arising from selection. If the underlying distributional assumption is normal, then the resulting density for the outcome is the continuous component of the sample selection density and has links with the closed skew-normal distribution (CSN). The CSN distribution provides a framework which simplifies the derivation of the conditional expectation of the observed data. This generalizes the Heckman’s two-step method to a multilevel sample selection model. Finite-sample performance of the maximum likelihood estimator of this model is studied through a Monte Carlo simulation
    corecore