7 research outputs found

    Mind the large gap : novel algorithm using seasonal decomposition and elastic net regression to impute large intervals of missing data in air quality data

    Get PDF
    Air quality data sets are widely used in numerous analyses. Missing values are ubiquitous in air quality data sets as the data are collected through sensors. Recovery of missing data is a challenging task in the data preprocessing stage. This task becomes more challenging in time series data as time is an implicit variable that cannot be ignored. Even though existing methods to deal with missing data in time series perform well in situations where the percentage of missing values is relatively low and the gap size is small, their performances are reasonably lower when it comes to large gaps. This paper presents a novel algorithm based on seasonal decomposition and elastic net regression to impute large gaps of time series data when there exist correlated variables. This method outperforms several other existing univariate approaches namely Kalman smoothing on ARIMA models, Kalman smoothing on structural time series models, linear interpolation, and mean imputation in imputing large gaps. However, this is applicable only when there exists one or more correlated variables with the time series with large gaps

    Imputing large gaps of high-resolution environment temperature

    No full text
    Weather data are widely used in climatology and other environmental studies. One of the key challenges in preprocessing these data is to deal with missing values. Since the measurements are recorded through sensors, missing values are ubiquitous in weather variables such as environmental temperature. Even though there are well-established methods to impute missing values in univariate time series data, the need of developing improved methods to impute large gaps persists. This paper compares the performances of ten existing methods in imputing missing values of hourly temperature data. Among the methods considered, Kalman smoothing on Auto-Regressive Integrated Moving Average model(ARIMA) and Kalman smoothing on structural time series model are the best methods in imputing missing values under MCAR (Missing Completely at Random) mechanism with exponentially distributed missing values. Moreover, this paper proposes a novel method to impute large gaps of hourly temperature data using regularized regression models on deseasonalized data. This method outperforms all the other considered methods in imputing large gaps

    Modelling environmental impact on public health using machine learning : case study on asthma

    No full text
    Environmental conditions such as weather and pollution have direct links with public health. It is estimated that the global burden of disease attributed to environmental factors is 24%. A plethora of research has been carried out to investigate the links between the environment and public health. Most of them are clinical or experimental studies. In addition to the investigations of causal effects, it is always useful to study associations of weather and pollution with diseases to manage and mitigate the burden of diseases as well as other environmental issues holistically. Environmental conditions could be used to provide an alarm of a future episode of a disease such as asthma so that risky individuals can take precautions to minimize the risk. This study involves a case study of asthma which applies several machine learning techniques to build a classification model predicting the risk of getting future episodes of asthma based on weather and pollution conditions. Support Vector Machine, Artificial Neural Network, Decision Tree and Random Forest models were considered for the classification. Random forest model produced the best performance compared to other models with an accuracy of 77%. Decision tree model exhibits the highest sensitivity of 70%. Even though ANN gives the lowest accuracy of 59%, its learning curve shows a good fit

    Comparison of imputation methods for missing values in air pollution data : case study on Sydney air quality index

    No full text
    Missing values in air quality data may lead to a substantial amount of bias and inefficiency in modeling. In this paper, we discuss six methods for dealing with missing values in univariate time series and compare their performances. The methods we discuss here are Mean Imputation, Spline Interpolation, Simple Moving Average, Exponentially Weighted Moving Average, Kalman Smoothing on Structural Time Series Models and Kalman Smoothing on Autoregressive Integrated Moving Average (ARIMA) models. The performances of these methods were compared using three different performance measures; Mean Squared Error, Coefficient of Determination and the Index of Agreement. Kalman Smoothing on Structural Time Series method is the best method among the methods considered, for imputing missing values in the context of air quality data under Missing Completely at Random (MCAR) mechanism. Kalman Smoothing on ARIMA, and Exponentially Weighted Moving Average methods also perform considerably well. Performance of Spline Interpolation decreases drastically with increased percentage of missing values. Mean Imputation performs reasonably well for smaller percentage of missing values; however, all the other methods outperform Mean Imputation regardless the number of missing values

    Data exploration and pre-processing techniques on air pollution and meteorological data in Sydney region

    No full text
    Data preparation typically consumes 80-90% of the total time taken to complete a data mining project. It is a crucial step as the performance of any model highly depends upon Garbage in, Garbage out pre-processing stage in a large dataset is missing values. Air pollution and meteorological data typically consist of many missing values. Proper imputations should be carried out to avoid any bias caused by missing values. The main objective of this study was to propose suitable techniques to be used in data preprocessing for air pollution and meteorological data in Sydney region, Australia. The dataset consists of hourly measurements of air pollution and meteorological variables from 1994-01-01 01:00:00 AEST (Australian Eastern Standard Time) to 2018-12-31 24:00:00 AEST recorded at each station in Sydney Region. The preprocessed data can be used in spatiotemporal analysis to assess the impact of climate change on different health aspects. Principal Component Analysis (PCA) was used to analyze the relationships of variables. Highly positively-correlated variable groups were[CO, NO, NO2], [O3,temperature,wind speed], [Visibility, PM2.5, PM10] and [wind direction, humidity]. Humidity was highly negatively correlated with O3 and temperature. Further, 82% of the total variation is explained by the first five principal components. Six well-established techniques to impute missing values in time series data; Mean Imputation, Spline Interpolation, Simple Moving Average, Exponentially Weighted Moving Average, Kalman Smoothing on Structural Time Series Models and Kalman Smoothing on Autoregressive Integrated Moving Average (ARIMA) models were compared. Imputation method based on Kalman Smoothing on Structural Time Series model showed better performance over the other methods for missing values under Missing Completely at Random (MCAR) mechanism for the data obtained in Sydney area

    Air quality data pre-processing : a novel algorithm to impute missing values in univariate time series

    No full text
    Missing values are ubiquitous in air pollution datasets as the data is being collected through sensors. Preprocessing these data plays a vital role in obtaining accurate results in the downstream analyses. This task becomes even more challenging as time is an implicit variable that cannot be ignored. Existing methods that deal with missing data in time series perform reasonably well in situations where the percentage of missing values is relatively low and the gap size is small. However, the need for the development of robust methods, particularly for large gaps, is still persistent. This paper proposes a novel algorithm (FBReg) to impute univariate air pollution variables by applying a bi-directional method based on regularized regression models. The performance of the method is evaluated against two baseline models, Mean imputation and Last observation carried forward (LOCF), as well as two well- established methods, Kalman smoothing on structural time series models and Kalman smoothing on ARIMA (Auto-Regressive Integrated Moving Average) models. The proposed algorithm outperforms the considered methods and exhibits consistent performance with exponentially distributed missing values under the MCAR (Missing Completely at Random) mechanism, as well as with large gaps

    Space and time data exploration of air quality based on PM10 sensor data in Greater Sydney 2015-2021

    No full text
    Exposure to polluted air is associated with numerous adverse health effects for the general population. Therefore, it is important to monitor ambient air pollution which plays a key role in measuring the quality of the air we breathe. Particulate matter in the air with a diameter of 10 μm or less (PM10) is one of the important measurements of air quality. This paper presents a comprehensive space-time data exploration of daily PM10 measurements collected through sensors of the Greater Sydney region from 1 January 2015 to 31 December 2021 and clustering of air pollution monitoring sites based on Dynamic Time Warping (DTW) distance. According to the results, air quality was good on most days in all the places considered. The modes of the daily PM10 levels were varying spatially. Oakdale recorded the lowest mode in all the years considered. During the study period, daily PM10 levels exceeded the national air quality standards mostly in the autumn season. After 2020, the number of exceedances was reduced for all the monitoring sites except Campbelltown West and Liverpool. Further examination is needed to identify the reasons behind these exceedances. Clustering indicates four possible groups of sites according to the behaviour of the PM10 sensor data. The four clusters are Randwick-Chullora-Earlwood, Liverpool-Prospect, Bringelly and Richmond-Campbelltown West-Camden-Bargo-Oakdale
    corecore