2,778 research outputs found

    ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•œ 2002~2020๋…„ ํ•œ๊ตญ์˜ O3, NO2, CO ๋†๋„์˜ ๊ณ ํ•ด์ƒ๋„ ์ถ”์ •

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(์„์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๋ณด๊ฑด๋Œ€ํ•™์› ๋ณด๊ฑดํ•™๊ณผ, 2023. 2. ๊น€ํ˜ธ.Backrgound : Long-term exposure to ozone (O3), nitrogen dioxide (NO2), and carbon monoxide (CO) is known to cause various diseases and increase mortality. For that reason, estimating ground-level O3, NO2, and CO concentrations with a high spatial resolution is crucial for assessing the health effects associated with these air pollutants. However, related studies are limited in South Korea. This study aimed to develop machine learning-based models to predict the monthly O3 (average of daily 8-hour maximums), NO2, and CO at a spatial resolution of 1 km ร— 1 km across South Korea from 2002 to 2020. Methods : Approximately 80% of the monitoring stations were used to train the three machine learning models (random forest, light gradient boosting, and neural network) with a 10-fold cross-validation, and 20% of the monitoring stations were used to test the model performance. The author also applied ensemble models to integrate the variation in predictions among the models. Multiple predictors with satellite-based remote sensing data, inverse distance weighted ground-level air pollutants, land use variables, reanalysis datasets for meteorological variables, and regional socioeconmoic variables collected from various databases were included in the prediction model. Results : For O3, the overall R2 of the ensemble model was 0.841 during the entire study period. Urban areas showed a better model performance (R2 = 0.845) than rural areas (R2 = 0.762). For NO2, the highest overall R2 was 0.756, which best fit in autumn (R2 = 0.768). For CO, the overall R2 value was 0.506. This study provides high spatial resolution monthly average O3 and NO2 estimates with excellent performance (R2 > 0.75). Conclusion : The authors predictions can be used to analyze the spatial patterns in pollutants in relation to population characteristics and studies on the health effects of long-term exposure to air pollution using geocode-based health information and local health data.์—ฐ๊ตฌ๋ฐฐ๊ฒฝ : ์˜ค์กด(O3), ์ด์‚ฐํ™”์งˆ์†Œ(NO2), ์ผ์‚ฐํ™”ํƒ„์†Œ(CO)์— ์žฅ๊ธฐ๊ฐ„ ๋…ธ์ถœ๋˜๋ฉด ๊ฐ์ข… ์งˆ๋ณ‘์„ ์œ ๋ฐœํ•˜๊ณ  ์‚ฌ๋ง๋ฅ ์„ ๋†’์ด๋Š” ๊ฒƒ์œผ๋กœ ์•Œ๋ ค์ ธ ์žˆ๋‹ค. ๊ทธ๋ ‡๊ธฐ์—, ๊ณ ํ•ด์ƒ๋„๋กœ ์ง€ํ‘œ๋ฉด O3, NO2, CO ๋†๋„๋ฅผ ์ถ”์ •ํ•˜๋Š” ๊ฒƒ์€ ์ด๋Ÿฌํ•œ ๋Œ€๊ธฐ์˜ค์—ผ๋ฌผ์งˆ๊ณผ ๊ด€๋ จ๋œ ๊ฑด๊ฐ• ์˜ํ–ฅ์„ ํ‰๊ฐ€ํ•˜๋Š” ๋ฐ ๋งค์šฐ ์ค‘์š”ํ•˜๋‹ค. ํ•˜์ง€๋งŒ, ์žฅ๊ธฐ๊ฐ„์— ๊ฑธ์ณ ๊ณ ํ•ด์ƒ๋„๋กœ ๊ฐ€์Šค์ƒ ๋Œ€๊ธฐ์˜ค์—ผ๋ฌผ์งˆ(O3, NO2, CO)๋ฅผ ์ถ”์ •ํ•œ ์—ฐ๊ตฌ๋Š” ๊ตญ๋‚ด์—์„œ ์•„์ง ์ง„ํ–‰๋œ ๋ฐ”๊ฐ€ ์—†๋‹ค. ๋”ฐ๋ผ์„œ, ๋ณธ ์—ฐ๊ตฌ๋Š” 2002๋…„๋ถ€ํ„ฐ 2020๋…„๊นŒ์ง€ ๋Œ€ํ•œ๋ฏผ๊ตญ ์ „์—ญ์—์„œ 1km ร— 1km์˜ ๊ณต๊ฐ„ํ•ด์ƒ๋„๋กœ ์›”๋ณ„ O3(์ผํ‰๊ท  8์‹œ๊ฐ„ ์ตœ๋Œ€์น˜), NO2, CO๋ฅผ ๋จธ์‹ ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ ๋ฐ ๊ทธ๋“ค์˜ ์•™์ƒ๋ธ” ๋ชจํ˜•์„ ํ†ตํ•ด ์˜ˆ์ธกํ•˜๊ณ ์ž ํ•œ๋‹ค. ์—ฐ๊ตฌ๋ฐฉ๋ฒ• : 3๊ฐ€์ง€ ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ(๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ, ๋ผ์ดํŠธ ๊ทธ๋ž˜๋””์–ธํŠธ ๋ถ€์ŠคํŒ…, ์‹ ๊ฒฝ๋ง)์˜ ์ตœ์ ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ฐพ๊ธฐ ์œ„ํ•ด ๋ชจ๋‹ˆํ„ฐ๋ง ์Šคํ…Œ์ด์…˜์˜ ์•ฝ 80%๋ฅผ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋กœ ์‚ฌ์šฉํ•˜์˜€๊ณ , 10-fold ๊ต์ฐจ๊ฒ€์ฆ์„ ํ†ตํ•ด ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ ๋‚ด์—์„œ ํ›ˆ๋ จ/ํ‰๊ฐ€ ๋‹จ๊ณ„๋ฅผ ๊ฑฐ์ณค์œผ๋ฉฐ, ๋‚˜๋จธ์ง€ ๋ชจ๋‹ˆํ„ฐ๋ง ์Šคํ…Œ์ด์…˜์˜ 20%๋ฅผ ๋ชจ๋ธ ํ‰๊ฐ€์— ์‚ฌ์šฉํ•˜์˜€๋‹ค. ์—ฌ๊ธฐ์— ์ถ”๊ฐ€๋กœ ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ ๊ฐ„์˜ ์˜ˆ์ธก ๋ณ€๋™์„ ํ†ตํ•ฉํ•˜๊ธฐ ์œ„ํ•ด ์•™์ƒ๋ธ” ๋ชจ๋ธ์„ ์ ์šฉํ–ˆ๋‹ค. ๋ฐ์ดํ„ฐ์—๋Š” ์œ„์„ฑ ๊ธฐ๋ฐ˜ ์›๊ฒฉ ๊ฐ์ง€ ๋ฐ์ดํ„ฐ, ์—ญ๊ฑฐ๋ฆฌ ๊ฐ€์ค‘์น˜ ๊ธฐ๋ฐ˜ ๋Œ€๊ธฐ์˜ค์—ผ๋†๋„, ํ† ์ง€ ์ด์šฉ ๋ณ€์ˆ˜, ๊ธฐ์ƒ ์žฌ๋ถ„์„ ์ž๋ฃŒ, ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์—์„œ ์ˆ˜์ง‘๋œ ์ง€์—ญ ์‚ฌํšŒ๊ฒฝ์ œ์  ๋ณ€์ˆ˜ ๋“ฑ์ด ํฌํ•จ๋˜์—ˆ๋‹ค. ์—ฐ๊ตฌ๊ฒฐ๊ณผ : O3์˜ ๊ฒฝ์šฐ, ์ „์ฒด ์—ฐ๊ตฌ ๊ธฐ๊ฐ„ ๋™์•ˆ ์•™์ƒ๋ธ” ๋ชจ๋ธ์˜ R2๊ฐ€ 0.841์„ ๊ธฐ๋กํ–ˆ์œผ๋ฉฐ, ๋„์‹œ ์ง€์—ญ์ด ๋†์ดŒ ์ง€์—ญ(R2 = 0.762)๋ณด๋‹ค ์šฐ์ˆ˜ํ•œ ์˜ˆ์ธก ์„ฑ๋Šฅ(R2 = 0.845)์„ ๋ณด์˜€๋‹ค. NO2์˜ ๊ฒฝ์šฐ, ์•™์ƒ๋ธ”(ํ‰๊ท ) ๋ชจ๋ธ์˜ R2๊ฐ€ 0.756์œผ๋กœ ๊ฐ€์žฅ ๋†’์•˜์œผ๋ฉฐ, ๊ณ„์ ˆ๋กœ ๋ณด๋ฉด ๊ฐ€์„์— ์˜ˆ์ธก ์„ฑ๋Šฅ์ด ๊ฐ€์žฅ ๋†’์•˜๋‹ค(R2 = 0.768). CO์˜ ๊ฒฝ์šฐ, R2๊ฐ€ 0.506 ์„ ๊ธฐ๋กํ–ˆ๋‹ค. ๋ณธ ์—ฐ๊ตฌ๋Š” O3 ๋ฐ NO2 ์—์„œ R2 > 0.75 ์œผ๋กœ ๋†’์€ ์˜ˆ์ธก๋ ฅ์˜ ๊ณ ํ•ด์ƒ๋„ ์›”ํ‰๊ท  ์ถ”์ •์น˜๋ฅผ ์ œ๊ณตํ•œ๋‹ค. ๊ฒฐ๋ก  : ๋ณธ ์—ฐ๊ตฌ์—์„œ ์–ป์–ด์ง„ ๋Œ€๊ธฐ์˜ค์—ผ ์ถ”์ • ๊ฒฐ๊ณผ๋Š” ์ธ๊ตฌ ํŠน์„ฑ๊ณผ ๊ด€๋ จ๋œ ๊ฐ€์Šค์ƒ ๋Œ€๊ธฐ์˜ค์—ผ๋ฌผ์งˆ์˜ ๊ณต๊ฐ„ ํŒจํ„ด์„ ๋ถ„์„ํ•˜๊ฑฐ๋‚˜, ์œ„์น˜ ๊ธฐ๋ฐ˜ ๊ฑด๊ฐ• ์ •๋ณด์™€ ํ–‰์ •๊ตฌ์—ญ ๋‹จ์œ„ ๊ฑด๊ฐ• ๋ฐ์ดํ„ฐ์™€ ์—ฎ์—ฌ์„œ ์žฅ๊ธฐ๊ฐ„ ๋Œ€๊ธฐ์˜ค์—ผ ๋…ธ์ถœ์˜ ๊ฑด๊ฐ• ์˜ํ–ฅ์„ ํ‰๊ฐ€ํ•˜๋Š” ์—ฐ๊ตฌ์— ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ์„ ๊ฒƒ์œผ๋กœ ๊ธฐ๋Œ€๋œ๋‹ค.Chapter 1. Introduction 1 Chapter 2. Materials and Methods 6 2.1. Study area 6 2.2. Air pollution monitoring data 6 2.3. Satellite-based remote sensing data 7 2.3.1. Meteorological data 7 2.3.2. Land-use data 10 2.3.3. Surface reflectance 11 2.4. Regional socioeconomic predictors 12 2.5. Modeling procedures 13 2.5.1. Data Preprocessing 14 2.5.2. Machine learning-based model 15 2.5.3. Ensemble Model 16 2.5.4. Model Prediction 17 Chapter 3. Results 19 Chapter 4. Discussion 29 Chapter 5. Conclusion 34 Supplementary materials 47 ๊ตญ๋ฌธ ์ดˆ๋ก 82 Tables Table 1. Model performance for O3, NO2, and CO overall and in three- and four-year periods 21 Table S1. Detailed information about data sources 61 Table S2. Variables sorted by % missing values 65 Table S3. Results of parameter grid search using 10-fold cross-validation for O3, NO2 and CO 68 Table S4. Yearly ensemble (GAM) performance for O3, NO2, and CO 70 Table S5. Model performances for O3, NO2, and CO by season and urbanity 71 Table S6. Number of monitoring stations by year for O3, NO2 and CO in urban and rural areas 73 Figures Fig. 1. Flowchart of the modeling process. GEE: Google Earth Engine, SEDAC: Socioeconomic Data and Applications Center, RSD: Regional Socioeconomic Database from Korean Disease Control and Prevention Agency 18 Fig. 2. Density scatter plot for monthly averages of the monitored and predicted concentrations of O3, NO2, and CO 26 Fig. 3. Maps of monitored and predicted O3, NO2 and CO during 2002~2020 27 Fig. 4. Percentage decrease in R2 when excluding grouped variables from each machine learning model of O3, NO2, and CO. The closer the color is to red, the greater the effect of the variables on the model performance 28 Fig. S1. Urban/Rural and Metropolitan (Metro) area for entire contiguous regions of South Korea 74 Fig. S2. Distribution maps of predicted O3 (ppb) by year and season for contiguous South Korea 75 Fig. S3. Distribution maps of predicted NO2 (ppb) by year and season for contiguous South Korea 76 Fig. S4. Distribution maps of predicted CO (ppm) by year and season for contiguous South Korea 77 Fig. S5. Monthly fluctuations in the number of monitoring stations for O3, NO2, and CO between 2002 and 2020 78 Fig. S6. Density scatter plot for monthly averages of the monitored and predicted concentrations of O3, NO2, and CO with seasonal discrimination 79์„

    Multi-Source-Data-Oriented Ensemble Learning Based PM 2.5 Concentration Prediction in Shenyang

    Get PDF
    Shenyang where is surrounded by smokestack industries and depends on coal heating in winter, is a classical one of cities in China northeastern which has suffered from serious air pollution, especially PM2.5. The existing research on machine learning, based on historical air-monitoring data and meteorological data, does neither forecast accurately nor identify key pollutants for PM2.5. This paper presents a multi-source-data-oriented ensemble learning for predicting PM2.5 concentration. The proposed framework incorporates not only air quality data and weather data, but also industrial emission data, especially those of winter heating enterprises, in Shenyang and nearby cities; the model also takes into account location and emission frequency of pollution sources. All these data are entered into an ensemble learning model based on Extreme Gradient Boosting (XGBoost) in order to predict PM2.5 concentration, which not only improves prediction accuracy effectively, but also provides contribution analysis of different pollutants. Experimental results show that the top two factors affecting PM2.5 concentration are: (1) air pollutant emission quantities and (2) distance from pollution sources to air-monitoring stations. According to the importance of these two factors, we refine feature selection and re-train the ensemble learning model and find that the new model performs better on 72% of evaluation indexes

    Accident prediction using machine learning:analyzing weather conditions, and model performance

    Get PDF
    Abstract. The primary focus of this study was to investigate the impact of weather and road conditions on the severity of accidents and to determine the feasibility of machine learning models in accurately predicting the likelihood of such incidents. The research was centered on two key research questions. Firstly, the study examined the influence of weather and road conditions on accident severity and identified the most related factors contributing to accidents. We utilized an open-source accident dataset, which was preprocessed using techniques like variable selection, missing data elimination, and data balancing through the Synthetic Minority Over-sampling Technique (SMOTE). Chi-square statistical analysis was performed, suggesting that all weather-related variables are more or less associated with the severity of accidents. Visibility and temperature were found to be the most critical factors affecting the severity of road accidents. Hence, appropriate measures such as implementing effective fog dispersal systems, heatwave alerts, or improved road maintenance during extreme temperatures could help reduce accident severity. Secondly, the research evaluated the ability of machine learning models including decision trees, random forests, naive bayes, extreme gradient boost, and neural networks to predict accident likelihood. The modelsโ€™ performance was gauged using metrics like accuracy, precision, recall, and F1 score. The Random Forest model emerged as the most reliable and accurate model for predicting accidents, with an overall accuracy of 98.53%. The Decision Tree model also showed high overall accuracy (95.33%), indicating its reliability. However, the Naive Bayes model showed the lowest accuracy (63.31%) and was deemed less reliable in this context. It is concluded that machine learning models can be effectively used to predict the likelihood of accidents, with models like Random Forest and Decision Tree proving the most effective. However, the effectiveness of each model may vary depending on the dataset and context, necessitating further testing and validation for real-world implementation. These findings not only provide insight into the factors affecting accident severity but also open a promising avenue in employing machine learning techniques for proactive accident prediction and mitigation. Future studies can aim to refine the models further and potentially integrate them into traffic management systems to enhance road safety

    A comparison of statistical and machine-learning approaches for spatiotemporal modeling of nitrogen dioxide across Switzerland

    Get PDF
    Land use regression modeling has commonly been used to model ambient air pollutant concentrations in environmental epidemiological studies. Recently, other statistical and machine-learning methods have also been applied to model air pollution, but their relative strengths and limitations have not been extensively investigated. In this study, we developed and compared land-use statistical and machine-learning models at annual, monthly and daily scales estimating ground-level NO2 concentrations across Switzerland (at high spatial resolution 100 ร— 100 m). Our study showed that the best model type varies with context, particularly with temporal resolution and training data size. Linear-regression-based models were useful in predicting long-term (annual, monthly) spatial distribution of NO2 and outperformed machine-learning models. However, linear-regression-based models were limited in representing short-term temporal variation even when predictor variables with temporal variability were provided. Machine-learning models showed high capability in predicting short-term temporal variation and outperformed linear-regression-based models for modeling NO2 variation at high temporal resolution (daily). However, the best performing models, XGBoost and LightGBM, constantly overfit on training data and may result in erratic patterns in the model-estimated concentration surfaces. Therefore, the temporal and spatial scale of the study is an important factor on which the choice of the suitable model type should be based and validation is required whatever approach is used

    A Machine Learning Approach to Safer Airplane Landings: Predicting Runway Conditions using Weather and Flight Data

    Full text link
    The presence of snow and ice on runway surfaces reduces the available tire-pavement friction needed for retardation and directional control and causes potential economic and safety threats for the aviation industry during the winter seasons. To activate appropriate safety procedures, pilots need accurate and timely information on the actual runway surface conditions. In this study, XGBoost is used to create a combined runway assessment system, which includes a classifcation model to predict slippery conditions and a regression model to predict the level of slipperiness. The models are trained on weather data and data from runway reports. The runway surface conditions are represented by the tire-pavement friction coefficient, which is estimated from flight sensor data from landing aircrafts. To evaluate the performance of the models, they are compared to several state-of-the-art runway assessment methods. The XGBoost models identify slippery runway conditions with a ROC AUC of 0.95, predict the friction coefficient with a MAE of 0.0254, and outperforms all the previous methods. The results show the strong abilities of machine learning methods to model complex, physical phenomena with a good accuracy when domain knowledge is used in the variable extraction. The XGBoost models are combined with SHAP (SHapley Additive exPlanations) approximations to provide a comprehensible decision support system for airport operators and pilots, which can contribute to safer and more economic operations of airport runways

    Analysis, Characterization, Prediction and Attribution of Extreme Atmospheric Events with Machine Learning: a Review

    Full text link
    Atmospheric Extreme Events (EEs) cause severe damages to human societies and ecosystems. The frequency and intensity of EEs and other associated events are increasing in the current climate change and global warming risk. The accurate prediction, characterization, and attribution of atmospheric EEs is therefore a key research field, in which many groups are currently working by applying different methodologies and computational tools. Machine Learning (ML) methods have arisen in the last years as powerful techniques to tackle many of the problems related to atmospheric EEs. This paper reviews the ML algorithms applied to the analysis, characterization, prediction, and attribution of the most important atmospheric EEs. A summary of the most used ML techniques in this area, and a comprehensive critical review of literature related to ML in EEs, are provided. A number of examples is discussed and perspectives and outlooks on the field are drawn.Comment: 93 pages, 18 figures, under revie

    Estimation of hourly near surface air temperature across Israel using an ensemble model

    Get PDF
    Mapping of near-surface air temperature (Ta) at high spatio-temporal resolution is essential for unbiased assessment of human health exposure to temperature extremes, not least given the observed trend of urbanization and global climate change. Data constraints have led previous studies to focus merely on daily Ta metrics, rather than hourly ones, making them insufficient for intra-day assessment of health exposure. In this study, we present a three-stage machine learning-based ensemble model to estimate hourly Ta at a high spatial resolution of 1 ร— 1 km2, incorporating remotely sensed surface skin temperature (Ts) from geostationary satellites, reanalysis synoptic variables, and observations from weather stations, as well as auxiliary geospatial variables, which account for spatio-temporal variability of Ta. The Stage 1 model gap-fills hourly Ts at 4 ร— 4 km2 from the Spinning Enhanced Visible and InfraRed Imager (SEVIRI), which are subsequently fed into the Stage 2 model to estimate hourly Ta at the same spatio-temporal resolution. The Stage 3 model downscales the residuals between estimated and measured Ta to a grid of 1 ร— 1 km2, taking into account additionally the monthly diurnal pattern of Ts derived from the Moderate Resolution Imaging Spectroradiometer (MODIS) data. In each stage, the ensemble model synergizes estimates from the constituent base learnersโ€”random forest (RF) and extreme gradient boosting (XGBoost)โ€”by applying a geographically weighted generalized additive model (GAM), which allows the weights of results from individual models to vary over space and time. Demonstrated for Israel for the period 2004โ€“2017, the proposed ensemble model outperformed each of the two base learners. It also attained excellent five-fold cross-validated performance, with overall root mean square error (RMSE) of 0.8 and 0.9 ยฐC, mean absolute error (MAE) of 0.6 and 0.7 ยฐC, and R2 of 0.95 and 0.98 in Stage 1 and Stage 2, respectively. The Stage 3 model for downscaling Ta residuals to 1 km MODIS grids achieved overall RMSE of 0.3 ยฐC, MAE of 0.5 ยฐC, and R2 of 0.63. The generated hourly 1 ร— 1 km2 Ta thus serves as a foundation for monitoring and assessing human health exposure to temperature extremes at a larger geographical scale, helping to further minimize exposure misclassification in epidemiological studies

    Estimation of hourly near surface air temperature across Israel using an ensemble model

    Get PDF
    Mapping of near-surface air temperature (Ta) at high spatio-temporal resolution is essential for unbiased assessment of human health exposure to temperature extremes, not least given the observed trend of urbanization and global climate change. Data constraints have led previous studies to focus merely on daily Ta metrics, rather than hourly ones, making them insufficient for intra-day assessment of health exposure. In this study, we present a three-stage machine learning-based ensemble model to estimate hourly Ta at a high spatial resolution of 1 × 1 km2, incorporating remotely sensed surface skin temperature (Ts) from geostationary satellites, reanalysis synoptic variables, and observations from weather stations, as well as auxiliary geospatial variables, which account for spatio-temporal variability of Ta. The Stage 1 model gap-fills hourly Ts at 4 × 4 km2 from the Spinning Enhanced Visible and InfraRed Imager (SEVIRI), which are subsequently fed into the Stage 2 model to estimate hourly Ta at the same spatio-temporal resolution. The Stage 3 model downscales the residuals between estimated and measured Ta to a grid of 1 × 1 km2, taking into account additionally the monthly diurnal pattern of Ts derived from the Moderate Resolution Imaging Spectroradiometer (MODIS) data. In each stage, the ensemble model synergizes estimates from the constituent base learners—random forest (RF) and extreme gradient boosting (XGBoost)—by applying a geographically weighted generalized additive model (GAM), which allows the weights of results from individual models to vary over space and time. Demonstrated for Israel for the period 2004–2017, the proposed ensemble model outperformed each of the two base learners. It also attained excellent five-fold cross-validated performance, with overall root mean square error (RMSE) of 0.8 and 0.9 °C, mean absolute error (MAE) of 0.6 and 0.7 °C, and R2 of 0.95 and 0.98 in Stage 1 and Stage 2, respectively. The Stage 3 model for downscaling Ta residuals to 1 km MODIS grids achieved overall RMSE of 0.3 °C, MAE of 0.5 °C, and R2 of 0.63. The generated hourly 1 × 1 km2 Ta thus serves as a foundation for monitoring and assessing human health exposure to temperature extremes at a larger geographical scale, helping to further minimize exposure misclassification in epidemiological studies

    Effect of traffic dataset on various machine-learning algorithms when forecasting air quality

    Get PDF
    ยฉ Emerald Publishing Limited. This is the accepted manuscript version of an article which has been published in final form at https://10.1108/JEDT-10-2021-0554Purpose (limit 100 words) Road traffic emissions are generally believed to contribute immensely to air pollution, but the effect of road traffic datasets on air quality predictions has not been clearly investigated. This research investigates the effects traffic dataset have on the performance of Machine Learning (ML) predictive models in air quality prediction. Design/methodology/approach (limit 100 words) To achieve this, we have set up an experiment with the control dataset having only the Air Quality (AQ) dataset and Meteorological (Met) dataset. While the experimental dataset is made up of the AQ dataset, Met dataset and Traffic dataset. Several ML models (such as Extra Trees Regressor, eXtreme Gradient Boosting Regressor, Random Forest Regressor, K-Neighbors Regressor, and five others) were trained, tested, and compared on these individual combinations of datasets to predict the volume of PM2.5, PM10, NO2, and O3 in the atmosphere at various time of the day. Findings (limit 100 words) The result obtained showed that various ML algorithms react differently to the traffic dataset despite generally contributing to the performance improvement of all the ML algorithms considered in this study by at least 20% and an error reduction of at least 18.97%. Research limitations/implications (limit 100 words) This research is limited in terms of the study area and the result cannot be generalized outside of the UK as many conditions may not be similar elsewhere. Additionally, only the ML algorithms commonly used in literature are considered in this research. Therefore, leaving out a few other ML algorithms. Practical implications (limit 100 words) This study reinforces the belief that the traffic dataset has a significant effect on improving the performance of air pollution ML prediction models. Hence, there is an indication that ML algorithms behave differently when trained with a form traffic dataset in the development of an air quality prediction model. This implies that developers and researchers in air quality prediction need to identify the ML algorithms that behave in their best interest before implementation. Originality/value (limit 100 words) This will enable researchers to focus more on algorithms of benefit when using traffic datasets in air quality prediction.Peer reviewe
    • โ€ฆ
    corecore