263 research outputs found

    A comparison of strategies for missing values in data on machine learning classification algorithms

    Get PDF
    Abstract: Dealing with missing values in data is an important feature engineering task in data science to prevent negative impacts on machine learning classification models in terms of accurate prediction. However, it is often unclear what the underlying cause of the missing values in real-life data is or rather the missing data mechanism that is causing the missingness. Thus, it becomes necessary to evaluate several missing data approaches for a given dataset. In this paper, we perform a comparative study of several approaches for handling missing values in data, namely listwise deletion, mean, mode, k–nearest neighbors, expectation-maximization, and multiple imputations by chained equations. The comparison is performed on two real-world datasets, using the following evaluation metrics: Accuracy, root mean squared error, receiver operating characteristics, and the F1 score. Most classifiers performed well across the missing data strategies. However, based on the result obtained, the support vector classifier method overall performed marginally better for the numerical data and naïve Bayes classifier for the categorical data when compared to the other evaluated missing value methods

    Missing value estimation using clustering and deep learning within multiple imputation framework

    Get PDF
    Missing values in tabular data restrict the use and performance of machine learning, requiring the imputation of missing values. Arguably the most popular imputation algorithm is multiple imputation by chained equations (MICE), which estimates missing values from linear conditioning on observed values. This paper proposes methods to improve both the imputation accuracy of MICE and the classification accuracy of imputed data by replacing MICE’s linear regressors with ensemble learning and deep neural networks (DNN). The imputation accuracy is further improved by characterizing individual samples with cluster labels (CISCL) obtained from the training data. Our extensive analyses of six tabular data sets with up to 80% missing values and three missing types (missing completely at random, missing at random, missing not at random) reveal that ensemble or deep learning within MICE is superior to the baseline MICE (b-MICE), both of which are consistently outperformed by CISCL. Results show that CISCL + b-MICE outperforms b-MICE for all percentages and types of missing values. In most experimental cases, our proposed DNN-based MICE and gradient boosting MICE plus CISCL (GB-MICE-CISCL) outperform seven state-of-the-art imputation algorithms. The classification accuracy of GB-MICE-imputed data is further improved by our proposed GB-MICE-CISCL imputation method across all percentages of missing values. Results also reveal a shortcoming of the MICE framework at high percentages of missing values (50%) and when the missing type is not random. This paper provides a generalized approach to identifying the best imputation model for a tabular data set based on the percentage and type of missing values

    Imputation, modelling and optimal sampling design for digital camera data in recreational fisheries monitoring

    Get PDF
    Digital camera monitoring has evolved as an active application-oriented scheme to help address questions in areas such as fisheries, ecology, computer vision, artificial intelligence, and criminology. In recreational fisheries research, digital camera monitoring has become a viable option for probability-based survey methods, and is also used for corroborative and validation purposes. In comparison to onsite surveys (e.g. boat ramp surveys), digital cameras provide a cost-effective method of monitoring boating activity and fishing effort, including night-time fishing activities. However, there are challenges in the use of digital camera monitoring that need to be resolved. Notably, missing data problems and the cost of data interpretation are among the most pertinent. This study provides relevant statistical support to address these challenges of digital camera monitoring of boating effort, to improve its utility to enhance recreational fisheries management in Western Australia and elsewhere, with capacity to extend to other areas of application. Digital cameras can provide continuous recordings of boating and other recreational fishing activities; however, interruptions of camera operations can lead to significant gaps within the data. To fill these gaps, some climatic and other temporal classification variables were considered as predictors of boating effort (defined as number of powerboat launches and retrievals). A generalized linear mixed effect model built on fully-conditional specification multiple imputation framework was considered to fill in the gaps in the camera dataset. Specifically, the zero-inflated Poisson model was found to satisfactorily impute plausible values for missing observations for varied durations of outages in the digital camera monitoring data of recreational boating effort. Additional modelling options were explored to guide both short- and long-term forecasting of boating activity and to support management decisions in monitoring recreational fisheries. Autoregressive conditional Poisson (ACP) and integer-valued autoregressive (INAR) models were identified as useful time series models for predicting short-term behaviour of such data. In Western Australia, digital camera monitoring data that coincide with 12-month state-wide boat-based surveys (now conducted on a triennial basis) have been read but the periods between the surveys have not been read. A Bayesian regression framework was applied to describe the temporal distribution of recreational boating effort using climatic and temporally classified variables to help construct data for such missing periods. This can potentially provide a useful cost-saving alternative of obtaining continuous time series data on boating effort. Finally, data from digital camera monitoring are often manually interpreted and the associated cost can be substantial, especially if multiple sites are involved. Empirical support for low-level monitoring schemes for digital camera has been provided. It was found that manual interpretation of camera footage for 40% of the days within a year can be deemed as an adequate level of sampling effort to obtain unbiased, precise and accurate estimates to meet broad management objectives. A well-balanced low-level monitoring scheme will ultimately reduce the cost of manual interpretation and produce unbiased estimates of recreational fishing indexes from digital camera surveys

    Cycling area can be a confounder and effect modifier of the association between helmet use and cyclists’ risk of death after a crash

    Get PDF
    The effect of helmet use on reducing the risk of death in cyclists appears to be distorted by some variables (potential confounders, effect modifiers, or both). Our aim was to provide evidence for or against the hypothesis that cycling area may act as a confounder and effect modifier of the association between helmet use and risk of death of cyclists involved in road crashes. Data were analysed for 24,605 cyclists involved in road crashes in Spain. A multiple imputation procedure was used to mitigate the effect of missing values. We used multilevel Poisson regression with province as the group level to estimate the crude association between helmet use and risk of death, and also three adjusted analyses: (1) for cycling area only, (2) for the remaining variables which may act as confounders, and (3) for all variables. Incidence–density ratios (IDR) and their 95% confidence intervals were calculated. Crude IDR was 1.10, but stratifying by cycling area disclosed a protective, differential effect of helmet use: IDR = 0.67 in urban areas, IDR = 0.34 on open roads. Adjusting for all variables except cycling area yielded similar results in both strata, albeit with a smaller difference between them. Adjusting for cycling area only yielded a strong association (IDR = 0.42), which was slightly lower in the adjusted analysis for all variables (IDR = 0.45). Cycling area can act as a confounder and also appears to act as an effect modifier (albeit to a lesser extent) of the risk of cyclists’ death after a crash

    Rehabilitation and outcomes after complicated vs uncomplicated mild TBI:results from the CENTER-TBI study

    Get PDF
    corecore