924 research outputs found

    Optimization of Home Mortgage Mover Predictive Model Applying Geo-Spatial Analysis and Machine Learning Techniques

    Get PDF
    In the last decade digital innovations and online banking services have significantly changed customers banking preferences and behaviour. Banking industry is going through the changes and developments in the provision of banking services that are affecting the structure and the organization of the bank network. However, private home loan, referred as Home Mortgage hereinafter, continue to remain among the products, that customers prefer to have personal interaction about with professional advisors prior making the decision to apply for the loan with financial institution

    GIS-Based landslide susceptibility modeling: a comparison between best-first decision tree and its two ensembles (BagBFT and RFBFT)

    Get PDF
    This study aimed to explore and compare the application of current state-of-the-art machine learning techniques, including bagging (Bag) and rotation forest (RF), to assess landslide susceptibility with the base classifier best-first decision tree (BFT). The proposed two novel ensemble frameworks, BagBFT and RFBFT, and the base model BFT, were used to model landslide susceptibility in Zhashui County (China), which suffers from landslides. Firstly, we identified 169 landslides through field surveys and image interpretation. Then, a landslide inventory map was built. These 169 historical landslides were randomly classified into two groups: 70% for training data and 30% for validation data. Then, 15 landslide conditioning factors were considered for mapping landslide susceptibility. The three ensemble outputs were estimated with a receiver operating characteristic (ROC) curve and statistical tests, as well as a new approach, the improved frequency ratio accuracy. The areas under the ROC curve (AUCs) for the training data (success rate) of the three algorithms were 0.722 for BFT, 0.869 for BagBFT, and 0.895 for RFBFT. The AUCs for the validating groups (prediction rates) were 0.718, 0.834, and 0.872, respectively. The frequency ratio accuracy of the three models was 0.76163 for the BFT model, 0.92220 for the BagBFT model, and 0.92224 for the RFBFT model. Both BagBFT and RFBFT ensembles can improve the accuracy of the BFT base model, and RFBFT was relatively better. Therefore, the RFBFT model is the most effective approach for the accurate modeling of landslide susceptibility mapping (LSM). All three models can improve the identification of landslide-prone areas, enhance risk management ability, and afford more detailed information for land-use planning and policy setting.National Natural Science Foundation of China | Ref. 41977228Key Research Program of Shaanxi | Ref. 2022SF-33

    Ensemble decision tree models using RUSBoost for estimating risk of iron failure in drinking water distribution systems

    Get PDF
    Safe, trusted drinking water is fundamental to society. Discolouration is a key aesthetic indicator visible to customers. Investigations to understand discolouration and iron failures in water supply systems require assessment of large quantities of disparate, inconsistent, multidimensional data from multiple corporate systems. A comprehensive data matrix was assembled for a seven year period across the whole of a UK water company (serving three million people). From this a novel data driven tool for assessment of iron risk was developed based on a yearly update and ranking procedure, for a subset of the best quality data. To avoid a ‘black box’ output, and provide an element of explanatory (human readable) interpretation, classification decision trees were utilised. Due to the very limited number of iron failures, results from many weak learners were melded into one high-quality ensemble predictor using the RUSBoost algorithm which is designed for class imbalance. Results, exploring simplicity vs predictive power, indicate enough discrimination between variable relationships in the matrix to produce ensemble decision tree classification models with good accuracy for iron failure estimation at District Management Area (DMA) scale. Two model variants were explored: ‘Nowcast’ (situation at end of calendar year) and ‘Futurecast’ (predict end of next year situation from this year’s data). The Nowcast 2014 model achieved 100% True Positive Rate (TPR) and 95.3% True Negative Rate (TNR), with 3.3% of DMAs classified High Risk for un-sampled instances. The Futurecast 2014 achieved 60.5% TPR and 75.9% TNR, with 25.7% of DMAs classified High Risk for un-sampled instances. The output can be used to focus preventive measures to improve iron compliance

    Ensemble decision tree models using RUSBoost for estimating risk of iron failure in drinking water distribution systems

    Get PDF
    Safe, trusted drinking water is fundamental to society. Discolouration is a key aesthetic indicator visible to customers. Investigations to understand discolouration and iron failures in water supply systems require assessment of large quantities of disparate, inconsistent, multidimensional data from multiple corporate systems. A comprehensive data matrix was assembled for a seven year period across the whole of a UK water company (serving three million people). From this a novel data driven tool for assessment of iron risk was developed based on a yearly update and ranking procedure, for a subset of the best quality data. To avoid a ‘black box’ output, and provide an element of explanatory (human readable) interpretation, classification decision trees were utilised. Due to the very limited number of iron failures, results from many weak learners were melded into one high-quality ensemble predictor using the RUSBoost algorithm which is designed for class imbalance. Results, exploring simplicity vs predictive power, indicate enough discrimination between variable relationships in the matrix to produce ensemble decision tree classification models with good accuracy for iron failure estimation at District Management Area (DMA) scale. Two model variants were explored: ‘Nowcast’ (situation at end of calendar year) and ‘Futurecast’ (predict end of next year situation from this year’s data). The Nowcast 2014 model achieved 100% True Positive Rate (TPR) and 95.3% True Negative Rate (TNR), with 3.3% of DMAs classified High Risk for un-sampled instances. The Futurecast 2014 achieved 60.5% TPR and 75.9% TNR, with 25.7% of DMAs classified High Risk for un-sampled instances. The output can be used to focus preventive measures to improve iron compliance

    Predicting sustainable arsenic mitigation using machine learning techniques.

    Full text link
    This study evaluates state-of-the-art machine learning models in predicting the most sustainable arsenic mitigation preference. A Gaussian distribution-based Naïve Bayes (NB) classifier scored the highest Area Under the Curve (AUC) of the Receiver Operating Characteristic curve (0.82), followed by Nu Support Vector Classification (0.80), and K-Neighbors (0.79). Ensemble classifiers scored higher than 70% AUC, with Random Forest being the top performer (0.77), and Decision Tree model ranked fourth with an AUC of 0.77. The multilayer perceptron model also achieved high performance (AUC=0.75). Most linear classifiers underperformed, with the Ridge classifier at the top (AUC=0.73) and perceptron at the bottom (AUC=0.57). A Bernoulli distribution-based Naïve Bayes classifier was the poorest model (AUC=0.50). The Gaussian NB was also the most robust ML model with the slightest variation of Kappa score on training (0.58) and test data (0.64). The results suggest that nonlinear or ensemble classifiers could more accurately understand the complex relationships of socio-environmental data and help develop accurate and robust prediction models of sustainable arsenic mitigation. Furthermore, Gaussian NB is the best option when data is scarce

    Gis-based gully erosion susceptibility mapping: a comparison of computational ensemble data mining models

    Get PDF
    Gully erosion destroys agricultural and domestic grazing land in many countries, especially those with arid and semi-arid climates and easily eroded rocks and soils. It also generates large amounts of sediment that can adversely impact downstream river channels. The main objective of this research is to accurately detect and predict areas prone to gully erosion. In this paper, we couple hybrid models of a commonly used base classifier (reduced pruning error tree, REPTree) with AdaBoost (AB), bagging (Bag), and random subspace (RS) algorithms to create gully erosion susceptibility maps for a sub-basin of the Shoor River watershed in northwestern Iran. We compare the performance of these models in terms of their ability to predict gully erosion and discuss their potential use in other arid and semi-arid areas. Our database comprises 242 gully erosion locations, which we randomly divided into training and testing sets with a ratio of 70/30. Based on expert knowledge and analysis of aerial photographs and satellite images, we selected 12 conditioning factors for gully erosion. We used multi-collinearity statistical techniques in the modeling process, and checked model performance using statistical indexes including precision, recall, F-measure, Matthew correlation coefficient (MCC), receiver operatic characteristic curve (ROC), precision-recall graph (PRC), Kappa, root mean square error (RMSE), relative absolute error (PRSE), mean absolute error (MAE), and relative absolute error (RAE). Results show that rainfall, elevation, and river density are the most important factors for gully erosion susceptibility mapping in the study area. All three hybrid models that we tested significantly enhanced and improved the predictive power of REPTree (AUC=0.800), but the RS-REPTree (AUC= 0.860) ensemble model outperformed the Bag-REPTree (AUC= 0.841) and the AB-REPTree (AUC= 0.805) models. We suggest that decision makers, planners, and environmental engineers employ the RS-REPTree hybrid model to better manage gully erosion-prone areas in Iran

    Novel GIS based machine learning algorithms for shallow landslide susceptibility mapping

    Get PDF
    © 2018 by the authors. Licensee MDPI, Basel, Switzerland. The main objective of this research was to introduce a novel machine learning algorithm of alternating decision tree (ADTree) based on the multiboost (MB), bagging (BA), rotation forest (RF) and random subspace (RS) ensemble algorithms under two scenarios of different sample sizes and raster resolutions for spatial prediction of shallow landslides around Bijar City, Kurdistan Province, Iran. The evaluation of modeling process was checked by some statistical measures and area under the receiver operating characteristic curve (AUROC). Results show that, for combination of sample sizes of 60%/40% and 70%/30% with a raster resolution of 10 m, the RS model, while, for 80%/20% and 90%/10% with a raster resolution of 20 m, the MB model obtained a high goodness-of-fit and prediction accuracy. The RS-ADTree and MB-ADTree ensemble models outperformed the ADTree model in two scenarios. Overall, MB-ADTree in sample size of 80%/20% with a resolution of 20 m (area under the curve (AUC) = 0.942) and sample size of 60%/40% with a resolution of 10 m (AUC = 0.845) had the highest and lowest prediction accuracy, respectively. The findings confirm that the newly proposed models are very promising alternative tools to assist planners and decision makers in the task of managing landslide prone areas

    Analysis of incidence of air quality on human health: a case study on the relationship between pollutant concentrations and respiratory diseases in Kennedy, Bogotá

    Full text link
    [EN] Thousands of deaths associated with air pollution each year could be prevented by forecasting the behavior of factors that pose risks to people's health and their geographical distribution. Proximity to pollution sources, degree of urbanization, and population density are some of the factors whose spatial distribution enables the identification of possible influence on the presence of respiratory diseases (RD). Currently, Bogota is among the cities with the poorest air quality in Latin America. Specifically, the locality of Kennedy is one of the zones in the city with the highest recorded concentration levels of local pollutants over the last 10 years. From 2009 to 2016, there were 8619 deaths associated with respiratory and cardiovascular diseases in the locality. Given these characteristics, this study set out to identify and analyze the areas in which the primary socioeconomic and environmental conditions contribute to the presence of symptoms associated with RD. To this end, information collected in field by performing georeferenced surveys was analyzed through geostatistical and machine learning tools which carried out cluster and pattern analyses. Random forests and AdaBoost were applied to establish hot spots where RD could occur, given the conjugation of predictor variables in the micro-territory. It was found that random forests outperformed AdaBoost with 0.63 AUC. In particular, this study's approach applies to densely populated municipalities with high levels of air pollution. In using these tools, municipalities can anticipate environmental health situations and reduce the cost of respiratory disease treatments.Many thanks to the members of the Intelligence and Territorial Analysis Group of the Universidad Santo Tomás for their collaboration in conducting the fieldwork.Molina-Gomez, NI.; Calderón-Rivera, DS.; Sierra-Parada, R.; Díaz Arévalo, JL.; López Jiménez, PA. (2021). Analysis of incidence of air quality on human health: a case study on the relationship between pollutant concentrations and respiratory diseases in Kennedy, Bogotá. International Journal of Biometeorology. 65(1):119-132. https://doi.org/10.1007/s00484-020-01955-4S119132651Altman DG, Bland JM (1994) Diagnostic tests 3: receiver operating characteristic plots. BMJ 309(6948):188. https://doi.org/10.1136/bmj.309.6948.188Billionnet C, Sherrill D, Annesi-Maesano I (2012) Estimating the health effects of exposure to multi-pollutant mixture. Ann Epidemiol 22:126–141. https://doi.org/10.1016/J.ANNEPIDEM.2011.11.004Bobb JF, Valeri L, Claus Henn B, Christiani DC, Wright RO, Mazumdar M, Godleski JJ, Coull BA (2015) Bayesian kernel machine regression for estimating the health effects of multi-pollutant mixtures. Biostatistics 16:493–508. https://doi.org/10.1093/biostatistics/kxu058Borja-Aburto VH (2000) Ecological studies. Salud Publica Mex 42:533–538Breiman L (2001) Random forests. Mach Learn 45:5–32. https://doi.org/10.1023/A:1010933404324CRAN Comprensive R Archive Network (2018) R-3.5.2 for Windows (32/64 bit). https://cran.r-project.org/bin/windows/base/old/3.5.2/. Accessed 10 June 2019. Accessed 10 Jun 2019DANE National Administrative Department of Statistics (2018) Multi-purpose survey -MS 2017. Bogotá, ColombiaDHS (2019) SALUDATA- Health Observatory of Bogota http://saludata.saludcapital.gov.co/osb/index.php/datos-de-salud/salud-ambiental/consultaurgencias14anios/. Accessed 11 April 2019. Accessed 11 April 2019Franceschi F, Cobo M, Figueredo M (2018) Discovering relationships and forecasting PM10 and PM2.5 concentrations in Bogotá, Colombia, using artificial neural networks, principal component analysis, and k-means clustering. Atmos Pollut Res 9:912–922. https://doi.org/10.1016/j.apr.2018.02.006Galindo WG (2013) Construction dynamics by use, the locality of Kennedy 2002/2012. BogotáGarcía-Ubaque JC, García-Ubaque CA, Vaca-Bohórquez ML (2011) Medical consultation in productive age population related with air pollution levels in Bogota city. Procedia Environ Sci 4:165–169. https://doi.org/10.1016/j.proenv.2011.03.020Gorai AK, Tchounwou PB, Biswal S, Tuluri F (2018) Spatio-temporal variation of particulate matter (PM2.5) concentrations and its health impacts in a mega city, Delhi in India. Environ Health Insights 12:1–9. https://doi.org/10.1177/1178630218792861Habibi R, Alesheikh AA, Mohammadinia A, et al (2017) An assessment of spatial pattern characterization of air pollution: a case study of CO and PM2.5 in Tehran, Iran. ISPRS Int J Geo-Inf 6:270. https://doi.org/10.3390/ijgi6090270Hand DJ (2009) Measuring classifier performance: a coherent alternative to the area under the ROC curve. Mach Learn 77:103–123. https://doi.org/10.1007/s10994-009-5119-5Hernández B, Velasco-Mondragón HE (2000) Cross-sectional surveys. Salud Publica Mex 42:447–455Huang K, Xiao Q, Meng X, Geng G, Wang Y, Lyapustin A, Gu D, Liu Y (2018) Predicting monthly high-resolution PM2.5 concentrations with random forest model in the North China plain. Environ Pollut 242:675–683. https://doi.org/10.1016/j.envpol.2018.07.016IDEAM Institute of Hydrology,Meteorology and Environmental Studies (2016) State of air quality in Colombia, 2011–2015 Report. Bogotá D.C.Ivanov A, Voynikova D, Stoimenova M et al (2018) Random forests models of particulate matter PM10: a case study, in: AIP conference proceedings 2025, 030001. https://doi.org/10.1063/1.5064879Jenks GF (1967) The data model concept in statistical mapping. International Yearbook of Cartography 7:186–190Kami JA (2019) A random forest partition model for predicting NO2 concentrations from traffic flow and meteorological conditions. Sci Total Environ 651:475–483. https://doi.org/10.1016/j.scitotenv.2018.09.196Kassomenos P, Petrakis M, Sarigiannis D, Gotti A, Karakitsios S (2011) Identifying the contribution of physical and chemical stressors to the daily number of hospital admissions implementing an artificial neural network model. Air Qual Atmos Health 4:263–272. https://doi.org/10.1007/s11869-011-0139-2Kestenbaum B (2019) Epidemiology and biostatistics. Seattle, USA. https://doi.org/10.1007/978-3-319-96644-1Kuhn M, Johnson K (2016) Applied predictive modeling. New York, USA. https://doi.org/10.1007/978-1-4614-6849-3Lazcano-Ponce E, Fernández E, Salazar-Martínez E, Hernández-Avila M (2000) Cohort studies. Methodology, biases and application. Salud Publica Mex 42:230–241Li S, Batterman S, Wasilevich E, Elasaad H, Wahl R, Mukherjee B (2011) Asthma exacerbation and proximity of residence to major roads: a population-based matched case-control study among the pediatric Medicaid population in Detroit, Michigan. Environ Health 10:34. https://doi.org/10.1186/1476-069X-10-34Jin L, Heap Andrew D (2014) Spatial interpolation methods applied in the environmental sciences: a review. Environ Model Softw 53:173–189 https://doi.org/10.1016/j.envsoft.2013.12.008Ly S, Charles C, Degr A (2011) Geostatistical interpolation of daily rainfall at catchment scale: the use of several variogram models in the Ourthe and Ambleve catchments, Belgium. Hydrol Earth Syst Sci 15:2259–2274. https://doi.org/10.5194/hess-15-2259-2011MAVDT Ministry of Environment, Housing and Territorial Development (2010) Protocol for air quality monitoring. Bogota, ColombiaMazenq J, Dubus J-C, Gaudart J, Charpin D, Viudes G, Noel G (2017) City housing atmospheric pollutant impact on emergency visit for asthma: a classification and regression tree approach. Respir Med 132:1–8. https://doi.org/10.1016/j.rmed.2017.09.004Pandey G, Zhang B, Jian L (2013) Predicting submicron air pollution indicators: a machine learning approach. Environ Sci Processes Impacts 15:996–1005. https://doi.org/10.1039/c3em30890aPolezer G, Tadano YS, Siqueira HV, Godoi AFL, Yamamoto CI, de André PA, Pauliquevis T, Andrade MF, Oliveira A, Saldiva PHN, Taylor PE, Godoi RHM (2018) Assessing the impact of PM2.5 on respiratory disease using artificial neural networks. Environ Pollut 235:394–403. https://doi.org/10.1016/j.envpol.2017.12.111Ramírez O, Sánchez de la Campa AM, Amato F, Catacolí RA, Rojas NY, de la Rosa J (2018) Chemical composition and source apportionment of PM10 at an urban background site in a high–altitude Latin American megacity (Bogota, Colombia). Environ Pollut 233:142–155. https://doi.org/10.1016/j.envpol.2017.10.045Reid CE, Jerrett M, Tager IB, Petersen ML, Mann JK, Balmes JR (2016) Differential respiratory health effects from the 2008 northern California wildfires: a spatiotemporal approach. Environ Res 150:227–235. https://doi.org/10.1016/J.ENVRES.2016.06.012Rodríguez-Villamizar LA, Rojas-Roa NY, Blanco-Becerra LC, Herrera-Galindo V, Fernández-Niño J (2018) Short-term effects of air pollution on respiratory and circulatory morbidity in Colombia 2011−2014: a multi-city, time-series analysis. Int J Environ Res Public Health 15:2–12. https://doi.org/10.3390/ijerph15081610Rokach L, Maimon O (2015) Data mining with decision trees: theory and applications, 2nd edn. World Scientific Publishing Co. Pte. Ltd, Singapore, p 5Salam MT, Islam T, Gilliland FD (2008) Recent evidence for adverse effects of residential proximity to traffic sources on asthma. Curr Opin Pulm Med 14:3–8. https://doi.org/10.1097/MCP.0b013e3282f1987aSajjadia SA, Zolfagharib G, Adabc H et al (2017) Measurement and modeling of particulate matter concentrations: applying spatial analysis and regression techniques to assess air quality. MethodsX 4:372–390. https://doi.org/10.1016/j.mex.2017.09.006Schapire RE, Freund Y (2012) Boosting: foundations and algorithms, adaptive computation and machine learning. MIT Press, LondonSDA District Secretariat for the Environment (2017) Air quality annual report of Bogota, 2016. Bogotá, ColombiaSDP District Planning Secretariat (2018) Monograph 2017 assessment of the main territorial, infrastructure, demographic and socio-economic aspects of the locality of Kennedy 08. Bogotá, ColombiaValle Benavides AR del (2017) ROC curves (receiver-operating-characteristic) and their applications. Universidad de SevillaWeizhen H, Zhengqiang L, Yuhuan Z, et al (2014) Using support vector regression to predict PM10 and PM2.5, in: IOP conference series: Earth and Environmental Science. IOP. https://doi.org/10.1088/1755-1315/17/1/012268Westerlund J, Urbain JP, Bonilla J (2014) Application of air quality combination forecasting to Bogota. Atmos Environ 89:22–28. https://doi.org/10.1016/j.atmosenv.2014.02.015WHO (2006) WHO Air quality guidelines for particulate matter, ozone, nitrogen dioxide and sulfur dioxide. Global update (2005) Geneva. SwitzerlandYu Y, Yao S, Dong H, Wang L, Wang C, Ji X, Ji M, Yao X, Zhang Z (2019) Association between short-term exposure to particulate matter air pollution and cause-specific mortality in Changzhou, China. Environ Res 170:7–15. https://doi.org/10.1016/j.envres.2018.11.041Zhan Y, Luo Y, Deng X, Chen H, Grieneisen ML, Shen X, Zhu L, Zhang M (2017) Spatiotemporal prediction of continuous daily PM2.5 concentrations across China using a spatially explicit machine learning algorithm. Environ Pollut 233:464–473. https://doi.org/10.1016/j.atmosenv.2017.02.02

    Land subsidence susceptibility mapping in South Korea using machine learning algorithms

    Get PDF
    © 2018 by the authors. Licensee MDPI, Basel, Switzerland. In this study, land subsidence susceptibility was assessed for a study area in South Korea by using four machine learning models including Bayesian Logistic Regression (BLR), Support Vector Machine (SVM), Logistic Model Tree (LMT) and Alternate Decision Tree (ADTree). Eight conditioning factors were distinguished as the most important affecting factors on land subsidence of Jeong-am area, including slope angle, distance to drift, drift density, geology, distance to lineament, lineament density, land use and rock-mass rating (RMR) were applied to modelling. About 24 previously occurred land subsidence were surveyed and used as training dataset (70% of data) and validation dataset (30% of data) in the modelling process. Each studied model generated a land subsidence susceptibility map (LSSM). The maps were verified using several appropriate tools including statistical indices, the area under the receiver operating characteristic (AUROC) and success rate (SR) and prediction rate (PR) curves. The results of this study indicated that the BLR model produced LSSM with higher acceptable accuracy and reliability compared to the other applied models, even though the other models also had reasonable results

    A review of machine learning applications in wildfire science and management

    Full text link
    Artificial intelligence has been applied in wildfire science and management since the 1990s, with early applications including neural networks and expert systems. Since then the field has rapidly progressed congruently with the wide adoption of machine learning (ML) in the environmental sciences. Here, we present a scoping review of ML in wildfire science and management. Our objective is to improve awareness of ML among wildfire scientists and managers, as well as illustrate the challenging range of problems in wildfire science available to data scientists. We first present an overview of popular ML approaches used in wildfire science to date, and then review their use in wildfire science within six problem domains: 1) fuels characterization, fire detection, and mapping; 2) fire weather and climate change; 3) fire occurrence, susceptibility, and risk; 4) fire behavior prediction; 5) fire effects; and 6) fire management. We also discuss the advantages and limitations of various ML approaches and identify opportunities for future advances in wildfire science and management within a data science context. We identified 298 relevant publications, where the most frequently used ML methods included random forests, MaxEnt, artificial neural networks, decision trees, support vector machines, and genetic algorithms. There exists opportunities to apply more current ML methods (e.g., deep learning and agent based learning) in wildfire science. However, despite the ability of ML models to learn on their own, expertise in wildfire science is necessary to ensure realistic modelling of fire processes across multiple scales, while the complexity of some ML methods requires sophisticated knowledge for their application. Finally, we stress that the wildfire research and management community plays an active role in providing relevant, high quality data for use by practitioners of ML methods.Comment: 83 pages, 4 figures, 3 table
    corecore