20 research outputs found

    Multi-objective evolutionary GAN for tabular data synthesis

    Get PDF
    Synthetic data has a key role to play in data sharing by statistical agencies and other generators of statistical data products. Generative Adversarial Networks (GANs), typically applied to image synthesis, are also a promising method for tabular data synthesis. However, there are unique challenges in tabular data compared to images, eg tabular data may contain both continuous and discrete variables and conditional sampling, and, critically, the data should possess high utility and low disclosure risk (the risk of re-identifying a population unit or learning something new about them), providing an opportunity for multi-objective (MO) optimization. Inspired by MO GANs for images, this paper proposes a smart MO evolutionary conditional tabular GAN (SMOE-CTGAN). This approach models conditional synthetic data by applying conditional vectors in training, and uses concepts from MO optimisation to balance disclosure risk against utility. Our results indicate that SMOE-CTGAN is able to discover synthetic datasets with different risk and utility levels for multiple national census datasets. We also find a sweet spot in the early stage of training where a competitive utility and extremely low risk are achieved, by using an Improvement Score. The full code can be downloaded from https://github.com/HuskyNian/SMO\_EGAN\_pytorch

    Revisiting social vulnerability analysis in Indonesia data

    No full text
    This paper presents the dataset about the social vulnerability in Indonesia. This dataset contains several dimensions which rely on previous studies. The data was compiled mainly from the 2017 National Socioeconomic Survey (SUSENAS) done by BPS-Statistics Indonesia. We utilize the weight to obtain the estimation based on multistage sampling. We also received additional information on population, the number, and population growth from the BPS-Statistics Indonesia's 2017 Population projection. Furthermore, we provide the distance matrix as the supplementary information and the number of populations to do the Fuzzy Geographically Weighted Clustering (FGWC). This data can be utilized to do further analysis of social vulnerability to promote disaster management. The data can be accessed further at https://raw.githubusercontent.com/bmlmcmc/naspaclust/main/data/sovi_data.csv

    Using Harris hawk optimization towards support vector regression to ozone prediction

    No full text
    As an area experiencing air pollution, especially ozone concentrations that often exceed the threshold or are unhealthy, JABODETABEK (Jakarta, Bogor, Depok, Tangerang, and Bekasi) seeks to prevent and control pollution as well as restore air quality. Therefore, this study aims to build a predictive model of ozone concentration using Harris hawks optimization-support vector regression (HHO-SVR) in 14 sub-districts in JABODETABEK. This goal is achieved by collecting data on ozone concentration as a response variable and meteorological factors as predictor variables from the website that provides the data. Other predictor variables such as time and significant lag detected with partial autocorrelation function of ozone concentration were also used. Then the variables will be selected using the recursive feature elimination-support vector regression (RFE-SVR) to obtain a significant predictor variable that affects the ozone concentration. After that, the prediction model will be built using the HHO-SVR method, support vector regression (SVR) whose parameter values are optimized with the Harris hawks optimization (HHO) algorithm. When the model has been formed, several evaluation metrics used to determine the best model include mean absolute error (MAE), root mean square error (RMSE), mean absolute percentage error (MAPE), Coefficient of Determination (R(2)), Variance Ratio (VR), and Diebold–Mariano test. The results of this study indicate that lag 1, lag 2, air temperature, humidity, and UV index are significant predictor variables of the RFE-SVR results for most sub-districts. In general, the HHO process takes longer than other metaheuristic algorithms. On average, 7 of the 14 sub-districts using the HHO-SVR model yielded the best predictions with MAE below 10, RMSE and MAPE below 20, R(2) around 0.97, and VR around 0.98. Then, the results of the Diebold–Mariano test also show that the accuracy of the prediction results and the stability of the performance of the HHO-SVR model is better, especially for the Ciputat and South Bekasi sub-districts. This shows that the two sub-districts are very suitable to use HHO-SVR in predicting ozone concentrations
    corecore