47 research outputs found

    CIRA annual report FY 2017/2018

    Get PDF
    Reporting period April 1, 2017-March 31, 2018

    Data Analysis of Medical Images: CT, MRI, Phase Contrast X-ray and PET

    Get PDF

    Statistical methods for handling incomplete longitudinal data with emphasis on discrete outcomes with application.

    Get PDF
    Doctor of Philosophy in Statistics. University of KwaZulu-Natal, Pietermaritzburg 2017.In longitudinal studies, measurements are taken repeatedly over time on the same ex- perimental unit. These measurements are thus correlated. The variances in repeated measures change with respect to time. Therefore, the variations together with the po- tential correlation patterns produce a complicated variance structure for the measures. Standard regression and analysis of variance techniques may result into invalid inference because they entail some mathematical assumptions that do not hold for repeated mea- sures data. Coupled with the repeated nature of the measurements, these datasets are often imbal- anced due to missing data. Methods used should be capable of handling the incomplete nature of the data, with the ability to capture the reasons for missingness in the analysis. This thesis seeks to investigate and compare analysis methods for incomplete correlated data, with primary emphasis on discrete longitudinal data. The thesis adopts the general taxonomy of longitudinal models, including marginal, random e ects, and transitional models. Although the objective is to deal with discrete data, the thesis starts with one continu- ous data case. Chapter 2 presents a comparative analysis on how to handle longitudinal continuous outcomes with dropouts missing at random. Inverse probability weighted generalized estimating equations (GEEs) and multiple imputation (MI) are compared. In Chapter 3, the weighted GEE is compared to GEE after MI (MI-GEE) in the analy- sis of correlated count outcome data in a simulation study. Chapter 4 deals with MI in the handling of ordinal longitudinal data with dropouts on the outcome. MI strategies, namely multivariate normal imputation (MNI) and fully conditional speci cation (FCS) are compared both in a simulation study and a real data application. In Chapter 5, still focussing on ordinal outcomes, the thesis presents a simulation and real data ap- plication to compare complete case analysis with advanced methods; direct likelihood analysis, MNI, FCS and ordinal imputation method. Finally, in Chapter 6, cumulative logit ordinal transition models are utilized to investigate the inuence of dependency of current incomplete responses on past responses. Transitions from one response state to another over time are of interest

    Vol. 7, No. 2 (Full Issue)

    Get PDF

    Research theme reports from April 1, 2019 - March 31, 2020

    Get PDF

    Assessing Predictive Performance: From Precipitation Forecasts over the Tropics to Receiver Operating Characteristic Curves and Back

    Get PDF
    Educated decision making involves two major ingredients: probabilistic forecasts for future events or quantities and an assessment of predictive performance. This thesis focuses on the latter topic and illustrates its importance and implications from both theoretical and applied perspectives. Receiver operating characteristic (ROC) curves are key tools for the assessment of predictions for binary events. Despite their popularity and ubiquitous use, the mathematical understanding of ROC curves is still incomplete. We establish the equivalence between ROC curves and cumulative distribution functions (CDFs) on the unit interval and elucidate the crucial role of concavity in interpreting and modeling ROC curves. Under this essential requirement, the classical binormal ROC model is strongly inhibited in its flexibility and we propose the novel beta ROC model as an alternative. For a class of models that includes the binormal and the beta model, we derive the large sample distribution of the minimum distance estimator. This allows for uncertainty quantification and statistical tests of goodness-of-fit or equal predictive ability. Turning to empirical examples, we analyze the suitability of both models and find empirical evidence for the increased flexibility of the beta model. A freely available software package called betaROC is currently prepared for release for the statistical programming language R. Throughout the tropics, probabilistic forecasts for accumulated precipitation are of economic importance. However, it is largely unknown how skillful current numerical weather prediction (NWP) models are at timescales of one to a few days. For the first time, we systematically assess the quality of nine global operational NWP ensembles for three regions in northern tropical Africa, and verify against station and satellite-based observations and for the monsoon seasons 2007-2014. All examined NWP models are uncalibrated and unreliable, in particular for high probabilities of precipitation, and underperform in the prediction of amount and occurrence of precipitation when compared to a climatological reference forecast. Statistical postprocessing corrects systematic deficiencies and realizes the full potential of ensemble forecasts. Postprocessed forecasts are calibrated and reliable and outperform raw ensemble forecasts in all regions and monsoon seasons. Disappointingly however, they have predictive performance only equal to the climatological reference. This assessment is robust and holds for all examined NWP models, all monsoon seasons, accumulation periods of 1 to 5 days, and station and spatially aggregated satellite-based observations. Arguably, it implies that current NWP ensembles cannot translate information about the atmospheric state into useful information regarding occurrence or amount of precipitation. We suspect convective parameterization as likely cause of the poor performance of NWP ensemble forecasts as it has been shown to be a first-order error source for the realistic representation of organized convection in NWP models. One may ask if the poor performance of NWP ensembles is exclusively confined to northern tropical Africa or if it applies to the tropics in general. In a comprehensive study, we assess the quality of two major NWP ensemble prediction systems (EPSs) for 1 to 5-day accumulated precipitation for ten climatic regions in the tropics and the period 2009-2017. In particular, we investigate their skill regarding the occurrence and amount of precipitation as well as the occurrence of extreme events. Both ensembles exhibit clear calibration problems and are unreliable and overconfident. Nevertheless, they are (slightly) skillful for most climates when compared to the climatological reference, except tropical and northern arid Africa and alpine climates. Statistical postprocessing corrects for the lack of calibration and reliability, and improves forecast quality. Postprocessed ensemble forecasts are skillful for most regions except the above mentioned ones. The lack of NWP forecast skill in tropical and northern arid Africa and alpine climates calls for alternative approaches for the prediction of precipitation. In a pilot study for northern tropical Africa, we investigate whether it is possible to construct skillful statistical models that rely on information about recent rainfall events. We focus on the prediction of the probability of precipitation and find clear evidence for its modulation by recent precipitation events. The spatio-temporal correlation of rainfall coincides with meteorological assumptions, is reasonably pronounced and stable, and allows to construct meaningful statistical forecasts. We construct logistic regression based forecasts that are reliable, have a higher resolution than the climatological reference forecast, and yield an average improvement of 20% for northern tropical Africa and the period 1998-2014

    Approaches for Outlier Detection in Sparse High-Dimensional Regression Models

    Get PDF
    Modern regression studies often encompass a very large number of potential predictors, possibly larger than the sample size, and sometimes growing with the sample size itself. This increases the chances that a substantial portion of the predictors is redundant, as well as the risk of data contamination. Tackling these problems is of utmost importance to facilitate scientific discoveries, since model estimates are highly sensitive both to the choice of predictors and to the presence of outliers. In this thesis, we contribute to this area considering the problem of robust model selection in a variety of settings, where outliers may arise both in the response and the predictors. Our proposals simplify model interpretation, guarantee predictive performance, and allow us to study and control the influence of outlying cases on the fit. First, we consider the co-occurrence of multiple mean-shift and variance-inflation outliers in low-dimensional linear models. We rely on robust estimation techniques to identify outliers of each type, exclude mean-shift outliers, and use restricted maximum likelihood estimation to down-weight and accommodate variance-inflation outliers into the model fit. Second, we extend our setting to high-dimensional linear models. We show that mean-shift and variance-inflation outliers can be modeled as additional fixed and random components, respectively, and evaluated independently. Specifically, we perform feature selection and mean-shift outlier detection through a robust class of nonconcave penalization methods, and variance-inflation outlier detection through the penalization of the restricted posterior mode. The resulting approach satisfies a robust oracle property for feature selection in the presence of data contamination – which allows the number of features to exponentially increase with the sample size – and detects truly outlying cases of each type with asymptotic probability one. This provides an optimal trade-off between a high breakdown point and efficiency. Third, focusing on high-dimensional linear models affected by meanshift outliers, we develop a general framework in which L0-constraints coupled with mixed-integer programming techniques are used to perform simultaneous feature selection and outlier detection with provably optimal guarantees. In particular, we provide necessary and sufficient conditions for a robustly strong oracle property, where again the number of features can increase exponentially with the sample size, and prove optimality for parameter estimation and the resulting breakdown point. Finally, we consider generalized linear models and rely on logistic slippage to perform outlier detection and removal in binary classification. Here we use L0-constraints and mixed-integer conic programming techniques to solve the underlying double combinatorial problem of feature selection and outlier detection, and the framework allows us again to pursue optimality guarantees. For all the proposed approaches, we also provide computationally lean heuristic algorithms, tuning procedures, and diagnostic tools which help to guide the analysis. We consider several real-world applications, including the study of the relationships between childhood obesity and the human microbiome, and of the main drivers of honey bee loss. All methods developed and data used, as well as the source code to replicate our analyses, are publicly available

    On aspects of robustness and sensitivity in missing data methods

    Get PDF
    Missing data are common wherever statistical methods are applied in practice. They present a problem by demanding that additional untestable assumptions be made about the mechanism leading to the incompleteness of the data. Minimising the strength of these assumptions and assessing the sensitivity of conclusions to their possible violation constitute two important aspects of current research in this area. One attractive approach is the doubly robust (DR) weighting-based method proposed by Robins and colleagues. By incorporating two models for the missing data process, inferences are valid when at least one model is correctly specified. The balance between robustness, efficiency and analytical complexity is one which is difficult to strike, resulting in a split between the likelihood and multiple imputation (MI) school on one hand and the weighting and DR school on the other. We propose a new method, doubly robust multiple imputation (DRMI), combining the convenience of MI with the robustness of the DR approach, and explore the use of our new estimator for non-monotone missing at random data, a setting in which, hitherto, estimators with the DR property have not been implemented. We apply the method to data from a clinical trial comparing type II diabetes drugs, where we also use MI as a tool to explore sensitivity to the missing at random assumption. Finally, we study DRMI in the longitudinal binary data setting and find that it compares favourably with existing methods

    CIRA annual report FY 2016/2017

    Get PDF
    Reporting period April 1, 2016-March 31, 2017
    corecore