48 research outputs found
A Generative Deep Learning Approach for Crash Severity Modeling with Imbalanced Data
Crash data is often greatly imbalanced, with the majority of crashes being
non-fatal crashes, and only a small number being fatal crashes due to their
rarity. Such data imbalance issue poses a challenge for crash severity modeling
since it struggles to fit and interpret fatal crash outcomes with very limited
samples. Usually, such data imbalance issues are addressed by data resampling
methods, such as under-sampling and over-sampling techniques. However, most
traditional and deep learning-based data resampling methods, such as synthetic
minority oversampling technique (SMOTE) and generative Adversarial Networks
(GAN) are designed dedicated to processing continuous variables. Though some
resampling methods have improved to handle both continuous and discrete
variables, they may have difficulties in dealing with the collapse issue
associated with sparse discrete risk factors. Moreover, there is a lack of
comprehensive studies that compare the performance of various resampling
methods in crash severity modeling. To address the aforementioned issues, the
current study proposes a crash data generation method based on the Conditional
Tabular GAN. After data balancing, a crash severity model is employed to
estimate the performance of classification and interpretation. A comparative
study is conducted to assess classification accuracy and distribution
consistency of the proposed generation method using a 4-year imbalanced crash
dataset collected in Washington State, U.S. Additionally, Monte Carlo
simulation is employed to estimate the performance of parameter and probability
estimation in both two- and three-class imbalance scenarios. The results
indicate that using synthetic data generated by CTGAN-RU for crash severity
modeling outperforms using original data or synthetic data generated by other
resampling methods