52 research outputs found

    Optimal Subsampling Designs Under Measurement Constraints

    Get PDF
    We consider the problem of optimal subsample selection in an experiment setting where observing, or utilising, the full dataset for statistical analysis is practically unfeasible. This may be due to, e.g., computational, economic, or even ethical cost-constraints. As a result, statistical analyses must be restricted to a subset of data. Choosing this subset in a manner that captures as much information as possible is essential.In this thesis we present a theory and framework for optimal design in general subsampling problems. The methodology is applicable to a wide range of settings and inference problems, including regression modelling, parametric density estimation, and finite population inference. We discuss the use of auxiliary information and sequential optimal design for the implementation of optimal subsampling methods in practice and study the asymptotic properties of the resulting estimators. The proposed methods are illustrated and evaluated on three problem areas: on subsample selection for optimal prediction in active machine learning (Paper I), optimal control sampling in analysis of safety critical events in naturalistic driving studies (Paper II), and optimal subsampling in a scenario generation context for virtual safety assessment of an advanced driver assistance system (Paper III). In Paper IV we present a unified theory that encompasses and generalises the methods of Paper I–III and introduce a class of expected-distance-minimising designs with good theoretical and practical properties.In Paper I–III we demonstrate a sample size reduction of 10–50% with the proposed methods compared to simple random sampling and traditional importance sampling methods, for the same level of performance. We propose a novel class of invariant linear optimality criteria, which in Paper IV are shown to reach 90–99% D-efficiency with 90–95% lower computational demand

    Unequal Probability Sampling in Active Learning and Traffic Safety

    Get PDF
    This thesis addresses a problem arising in large and expensive experiments where incomplete data come in abundance but statistical analyses require collection of additional information, which is costly. Out of practical and economical considerations, it is necessary to restrict the analysis to a subset of the original database, which inevitably will cause a loss of valuable information; thus, choosing this subset in a manner that captures as much of the available information as possible is essential.Using finite population sampling methodology, we address the issue of appropriate subset selection. We show how sample selection may be optimised to maximise precision in estimating various parameters and quantities of interest, and extend the existing finite population sampling methodology to an adaptive, sequential sampling framework, where information required for sample scheme optimisation may be updated iteratively as more data is collected. The implications of model misspecification are discussed, and the robustness of the finite population sampling methodology against model misspecification is highlighted. The proposed methods are illustrated and evaluated on two problems: on subset selection for optimal prediction in active learning (Paper I), and on optimal control sampling for analysis of safety critical events in naturalistic driving studies (Paper II). It is demonstrated that the use of optimised sample selection may reduce the number of records for which complete information needs to be collected by as much as 50%, compared to conventional methods and uniform random sampling

    Optimal subsampling designs

    Get PDF
    Subsampling is commonly used to overcome computational and economical bottlenecks in the analysis of finite populations and massive datasets. Existing methods are often limited in scope and use optimality criteria (e.g., A-optimality) with well-known deficiencies, such as lack of invariance to the measurement-scale of the data and parameterisation of the model. A unified theory of optimal subsampling design is still lacking. We present a theory of optimal design for general data subsampling problems, including finite population inference, parametric density estimation, and regression modelling. Our theory encompasses and generalises most existing methods in the field of optimal subdata selection based on unequal probability sampling and inverse probability weighting. We derive optimality conditions for a general class of optimality criteria, and present corresponding algorithms for finding optimal sampling schemes under Poisson and multinomial sampling designs. We present a novel class of transformation- and parameterisation-invariant linear optimality criteria which enjoy the best of two worlds: the computational tractability of A-optimality and invariance properties similar to D-optimality. The methodology is illustrated on an application in the traffic safety domain. In our experiments, the proposed invariant linear optimality criteria achieve 92-99% D-efficiency with 90-95% lower computational demand. In contrast, the A-optimality criterion has only 46% and 60% D-efficiency on two of the examples

    Optimal sampling in unbiased active learning

    Get PDF
    A common belief in unbiased active learning is that, in order to capture the most informative instances, the sampling probabilities should be proportional to the uncertainty of the class labels. We argue that this produces suboptimal predictions and present sampling schemes for unbiased pool-based active learning that minimise the actual prediction error, and demonstrate a better predictive performance than competing methods on a number of benchmark datasets. In contrast, both probabilistic and deterministic uncertainty sampling performed worse than simple random sampling on some of the datasets

    Optimization of Two-Phase Sampling Designs with Application to Naturalistic Driving Studies

    Get PDF
    Naturalistic driving studies (NDS) generate tremendous amounts of traffic data and constitute an important component of modern traffic safety research. However, analysis of the entire NDS database is rarely feasible, as it often requires expensive and time-consuming annotations of video sequences. We describe how automatic measurements, readily available in an NDS database, may be utilised for selection of time segments for annotation that are most informative with regards to detection of potential associations between driving behaviour and a consecutive safety critical event. The methodology is illustrated and evaluated on data from a large naturalistic driving study, showing that the use of optimised instance selection may reduce the number of segments that need to be annotated by as much as 50%, compared to simple random sampling

    Variables associated with insulin production in persons with type 2 diabetes treated with multiple daily insulin injections

    Get PDF
    From the MDI-liraglutide study, we evaluated variables associated with endogenous insulin production in persons with multiple daily insulin injections-treated type 2 diabetes by relating C-peptide, proinsulin and proinsulin/C-peptide ratio at baseline to baseline variables. Lower insulin production was related to longer diabetes duration, shorter abdominal sagittal diameter and more glycaemic variability. (c) 2020 Primary Care Diabetes Europe. Published by Elsevier Ltd. All rights reserved.Peer reviewe

    Active sampling: A machine-learning-assisted framework for finite population inference with optimal subsamples

    Get PDF
    Data subsampling has become widely recognized as a tool to overcome computational and economic bottlenecks in analyzing massive datasets and measurement-constrained experiments. However, traditional subsampling methods often suffer from the lack of information available at the design stage. We propose an active sampling strategy that iterates between estimation and data collection with optimal subsamples, guided by machine learning predictions on yet unseen data. The method is illustrated on virtual simulation-based safety assessment of advanced driver assistance systems. Substantial performance improvements were observed compared to traditional sampling methods

    Chest X-rays are less sensitive than multiple breath washout examinations when it comes to detecting early cystic fibrosis lung disease

    Get PDF
    Aim: Annual chest X-ray is recommended as routine surveillance to track cystic fibrosis (CF) lung disease. The aim of this study was to investigate the clinical utility of chest X-rays to track CF lung disease. Methods: Children at Gothenburg\u27s CF centre who underwent chest X-rays, multiple breath washouts and chest computed tomography examinations between 1996 and 2016 were included in the study. Chest X-rays were interpreted with Northern Score (NS). We compared NS to lung clearance index (LCI) and structural lung damage measured by computed tomography using a logistic regression model. Results: A total of 75 children were included over a median period of 13\ua0years (range: 3.0-18.0\ua0years). The proportion of children with abnormal NS was significantly lower than the proportion of abnormal LCI up to the age of 4\ua0years (p\ua0<\ua00.05). A normal NS and a normal LCI at age 6\ua0years were associated with a median (10-90th percentile) total airway disease of 1.8% (0.4-4.7%) and bronchiectasis of 0.2% (0.0-1.5%). Conclusion: Chest X-rays were less sensitive than multiple breath washout examinations to detect early CF lung disease. The combined results from both methods can be used as an indicator to perform chest computed tomography less frequently

    Methodological approach for measuring the effects of organisational-level interventions on employee withdrawal behaviour

    Get PDF
    Background: Theoretical frameworks have recommended organisational-level interventions to decrease employee withdrawal behaviours such as sickness absence and employee turnover. However, evaluation of such interventions has produced inconclusive results. The aim of this study was to investigate if mixed-effects models in combination with time series analysis, process evaluation, and reference group comparisons could be used for evaluating the effects of an organisational-level intervention on employee withdrawal behaviour. Methods: Monthly data on employee withdrawal behaviours (sickness absence, employee turnover, employment rate, and unpaid leave) were collected for 58 consecutive months (before and after the intervention) for intervention and reference groups. In total, eight intervention groups with a total of 1600 employees participated in the intervention. Process evaluation data were collected by process facilitators from the intervention team. Overall intervention effects were assessed using mixed-effects models with an AR (1) covariance structure for the repeated measurements and time as fixed effect. Intervention effects for each intervention group were assessed using time series analysis. Finally, results were compared descriptively with data from process evaluation and reference groups to disentangle the organisational-level intervention effects from other simultaneous effects. Results: All measures of employee withdrawal behaviour indicated statistically significant time trends and seasonal variability. Applying these methods to an organisational-level intervention resulted in an overall decrease in employee withdrawal behaviour. Meanwhile, the intervention effects varied greatly between intervention groups, highlighting the need to perform analyses at multiple levels to obtain a full understanding. Results also indicated that possible delayed intervention effects must be considered and that data from process evaluation and reference group comparisons were vital for disentangling the intervention effects from other simultaneous effects. Conclusions: When analysing the effects of an intervention, time trends, seasonal variability, and other changes in the work environment must be considered. The use of mixed-effects models in combination with time series analysis, process evaluation, and reference groups is a promising way to improve the evaluation of organisational-level interventions that can easily be adopted by others

    Cerebral biomarkers in neurological complications of preeclampsia

    Get PDF
    BACKGROUND AND OBJECTIVES: There are no tools to accurately predict who is at risk of developing neurological complications of preeclampsia and no objective methods to determine disease severity. We assessed whether plasma levels of the cerebral biomarkers neurofilament light (NfL), tau and glial fibrillary acidic protein (GFAP) could reflect disease severity in various phenotypes of preeclampsia and compared them to the angiogenic biomarkers soluble fms-like tyrosine kinase 1 (sFlt-1), placental growth factor (PlGF), and soluble endoglin (sEng) . STUDY DESIGN: In this observational study, we included women from the South African PROVE biobank. Plasma samples taken at diagnosis (preeclampsia cases) or at admission for delivery (normotensive controls) were analyzed for concentrations of NfL, tau, GFAP, PlGF, sFlt-1 and sEng. Cerebrospinal fluid concentrations of inflammatory markers and albumin were analyzed in a subgroup of 15 women. Analyses were adjusted for gestational age, time from seizures and delivery to sampling, maternal age and parity. RESULTS: Compared to 28 normotensive pregnancies, 146 women with preeclampsia demonstrated 2.18-fold higher plasma concentrations of NfL (95% CI 1.64-2.88), 2.17-fold higher tau (1.49-3.16) and 2.77-fold higher GFAP (2.06-3.72). In total 72 women with neurological complications (eclampsia, cortical blindness and stroke) demonstrated increased plasma concentrations of tau (2.99-fold higher, 95% CI 1.92-4.65) and GFAP (3.22-fold higher, 95% CI 2.06-5.02) compared to women with preeclampsia without pulmonary edema, HELLP or neurological complications (n=31). Angiogenic markers were also higher but to a lesser extent. Women with hemolysis, elevated liver enzymes and low platelets (HELLP) syndrome (n=20) demonstrated increased plasma concentrations of NfL (1.64-fold higher, 95% CI 1.06-2.55), tau (4.44-fold higher, 95% CI 1.85-10.66) and GFAP (1.82-fold higher, 95% CI 1.32-2.50) compared to women with preeclampsia without pulmonary edema, HELLP or neurological complications . No difference was shown in angiogenic biomarkers. There was no difference between 23 women with preeclampsia complicated by pulmonary edema and women with preeclampsia without pulmonary edema, HELLP or neurological complications for any of the biomarkers. Plasma concentrations of tau and GFAP were increased in women with several neurological complications vs eclampsia only. CONCLUSIONS: Plasma NfL, GFAP and tau are candidate biomarkers for diagnosis and possibly prediction of cerebral complications of preeclampsia
    • …
    corecore