279 research outputs found

    Comparison of Various Machine Learning Models for Estimating Construction Projects Sales Valuation Using Economic Variables and Indices

    Get PDF
    The capability of various machine learning techniques in predicting construction project profit in residential buildings using a combination of economic variables and indices (EV&Is) and physical and financial variables (P&F) as input variables remain uncertain. Although recent studies have primarily focused on identifying the factors influencing the sales of construction projects due to their significant short-term impact on a country's economy, the prediction of these parameters is crucial for ensuring project sustainability. While techniques such as regression and artificial neural networks have been utilized to estimate construction project sales, limited research has been conducted in this area. The application of machine learning techniques presents several advantages over conventional methods, including reductions in cost, time, and effort. Therefore, this study aims to predict the sales valuation of construction projects using various machine learning approaches, incorporating different EV&Is and P&F as input features for these models and subsequently generating the sales valuation as the output. This research will undertake a comparative analysis to investigate the efficiency of the different machine learning models, identifying the most effective approach for estimating the sales valuation of construction projects. By leveraging machine learning techniques, it is anticipated that the accuracy of sales valuation predictions will be enhanced, ultimately resulting in more sustainable and successful construction projects. In general, the findings of this research reveal that the extremely randomized trees model delivers the best performance, while the decision tree model exhibits the least satisfactory performance in predicting the sales valuation of construction projects

    NEMISA Digital Skills Conference (Colloquium) 2023

    Get PDF
    The purpose of the colloquium and events centred around the central role that data plays today as a desirable commodity that must become an important part of massifying digital skilling efforts. Governments amass even more critical data that, if leveraged, could change the way public services are delivered, and even change the social and economic fortunes of any country. Therefore, smart governments and organisations increasingly require data skills to gain insights and foresight, to secure themselves, and for improved decision making and efficiency. However, data skills are scarce, and even more challenging is the inconsistency of the associated training programs with most curated for the Science, Technology, Engineering, and Mathematics (STEM) disciplines. Nonetheless, the interdisciplinary yet agnostic nature of data means that there is opportunity to expand data skills into the non-STEM disciplines as well.College of Engineering, Science and Technolog

    Exploring uplift modeling with high class imbalance

    Get PDF
    Uplift modeling refers to individual level causal inference. Existing research on the topic ignores one prevalent and important aspect: high class imbalance. For instance in online environments uplift modeling is used to optimally target ads and discounts, but very few users ever end up clicking an ad or buying. One common approach to deal with imbalance in classification is by undersampling the dataset. In this work, we show how undersampling can be extended to uplift modeling. We propose four undersampling methods for uplift modeling. We compare the proposed methods empirically and show when some methods have a tendency to break down. One key observation is that accounting for the imbalance is particularly important for uplift random forests, which explains the poor performance of the model in earlier works. Undersampling is also crucial for class-variable transformation based models.Peer reviewe

    Machine Learning Approaches for Heart Disease Detection: A Comprehensive Review

    Get PDF
    This paper presents a comprehensive review of the application of machine learning algorithms in the early detection of heart disease. Heart disease remains a leading global health concern, necessitating efficient and accurate diagnostic methods. Machine learning has emerged as a promising approach, offering the potential to enhance diagnostic accuracy and reduce the time required for assessments. This review begins by elucidating the fundamentals of machine learning and provides concise explanations of the most prevalent algorithms employed in heart disease detection. It subsequently examines noteworthy research efforts that have harnessed machine learning techniques for heart disease diagnosis. A detailed tabular comparison of these studies is also presented, highlighting the strengths and weaknesses of various algorithms and methodologies. This survey underscores the significant strides made in leveraging machine learning for early heart disease detection and emphasizes the ongoing need for further research to enhance its clinical applicability and efficacy

    Deep Neural Networks and Tabular Data: Inference, Generation, and Explainability

    Get PDF
    Over the last decade, deep neural networks have enabled remarkable technological advancements, potentially transforming a wide range of aspects of our lives in the future. It is becoming increasingly common for deep-learning models to be used in a variety of situations in the modern life, ranging from search and recommendations to financial and healthcare solutions, and the number of applications utilizing deep neural networks is still on the rise. However, a lot of recent research efforts in deep learning have focused primarily on neural networks and domains in which they excel. This includes computer vision, audio processing, and natural language processing. It is a general tendency for data in these areas to be homogeneous, whereas heterogeneous tabular datasets have received relatively scant attention despite the fact that they are extremely prevalent. In fact, more than half of the datasets on the Google dataset platform are structured and can be represented in a tabular form. The first aim of this study is to provide a thoughtful and comprehensive analysis of deep neural networks' application to modeling and generating tabular data. Apart from that, an open-source performance benchmark on tabular data is presented, where we thoroughly compare over twenty machine and deep learning models on heterogeneous tabular datasets. The second contribution relates to synthetic tabular data generation. Inspired by their success in other homogeneous data modalities, deep generative models such as variational autoencoders and generative adversarial networks are also commonly applied for tabular data generation. However, the use of Transformer-based large language models (which are also generative) for tabular data generation have been received scant research attention. Our contribution to this literature consists of the development of a novel method for generating tabular data based on this family of autoregressive generative models that, on multiple challenging benchmarks, outperformed the current state-of-the-art methods for tabular data generation. Another crucial aspect for a deep-learning data system is that it needs to be reliable and trustworthy to gain broader acceptance in practice, especially in life-critical fields. One of the possible ways to bring trust into a data-driven system is to use explainable machine-learning methods. In spite of this, the current explanation methods often fail to provide robust explanations due to their high sensitivity to the hyperparameter selection or even changes of the random seed. Furthermore, most of these methods are based on feature-wise importance, ignoring the crucial relationship between variables in a sample. The third aim of this work is to address both of these issues by offering more robust and stable explanations, as well as taking into account the relationships between variables using a graph structure. In summary, this thesis made a significant contribution that touched many areas related to deep neural networks and heterogeneous tabular data as well as the usage of explainable machine learning methods

    Task-specific experimental design for treatment effect estimation

    Full text link
    Understanding causality should be a core requirement of any attempt to build real impact through AI. Due to the inherent unobservability of counterfactuals, large randomised trials (RCTs) are the standard for causal inference. But large experiments are generically expensive, and randomisation carries its own costs, e.g. when suboptimal decisions are trialed. Recent work has proposed more sample-efficient alternatives to RCTs, but these are not adaptable to the downstream application for which the causal effect is sought. In this work, we develop a task-specific approach to experimental design and derive sampling strategies customised to particular downstream applications. Across a range of important tasks, real-world datasets, and sample sizes, our method outperforms other benchmarks, e.g. requiring an order-of-magnitude less data to match RCT performance on targeted marketing tasks.Comment: To appear in ICML 2023; 8 pages, 7 figures, 4 appendice

    Data Balancing Techniques for Predicting Student Dropout Using Machine Learning

    Get PDF
    This research article was published MDPI, 2023Predicting student dropout is a challenging problem in the education sector. This is due to an imbalance in student dropout data, mainly because the number of registered students is always higher than the number of dropout students. Developing a model without taking the data imbalance issue into account may lead to an ungeneralized model. In this study, different data balancing techniques were applied to improve prediction accuracy in the minority class while maintaining a satisfactory overall classification performance. Random Over Sampling, Random Under Sampling, Synthetic Minority Over Sampling, SMOTE with Edited Nearest Neighbor and SMOTE with Tomek links were tested, along with three popular classification models: Logistic Regression, Random Forest, and Multi-Layer Perceptron. Publicly accessible datasets from Tanzania and India were used to evaluate the effectiveness of balancing techniques and prediction models. The results indicate that SMOTE with Edited Nearest Neighbor achieved the best classification performance on the 10-fold holdout sample. Furthermore, Logistic Regression correctly classified the largest number of dropout students (57348 for the Uwezo dataset and 13430 for the India dataset) using the confusion matrix as the evaluation matrix. The applications of these models allow for the precise prediction of at-risk students and the reduction of dropout rates

    Black spots identification on rural roads based on extreme learning machine

    Get PDF
    Accident black spots are usually defined as road locations with a high risk of fatal accidents. A thorough analysis of these areas is essential to determine the real causes of mortality due to these accidents and can thus help anticipate the necessary decisions to be made to mitigate their effects. In this context, this study aims to develop a model for the identification, classification and analysis of black spots on roads in Morocco. These areas are first identified using extreme learning machine (ELM) algorithm, and then the infrastructure factors are analyzed by ordinal regression. The XGBoost model is adopted for weighted severity index (WSI) generation, which in turn generates the severity scores to be assigned to individual road segments. The latter are then classified into four classes by using a categorization approach (high, medium, low and safe). Finally, the bagging extreme learning machine is used to classify the severity of road segments according to infrastructures and environmental factors. Simulation results show that the proposed framework accurately and efficiently identified the black spots and outperformed the reputable competing models, especially in terms of accuracy 98.6%. In conclusion, the ordinal analysis revealed that pavement width, road curve type, shoulder width and position were the significant factors contributing to accidents on rural roads

    Desenvolvimento de frameworks para a modelagem do risco de crédito por meio de algoritmos de classificação

    Get PDF
    Granting credit is a vital activity in the financial industry. For the success of financial institutions, as well as the equilibrium of the credit system as a whole, it is important that credit risk management systems efficiently evaluate the probability of default of potential debtors based on their historical data. Classification algorithms are an interesting approach to this problem in the form of Credit Scoring models. Since the emergence of quantitative analytical methods with this purpose, statistical models persist as the most commonly chosen method, given their easier implementation and inherent interpretability. However, advances in Machine Learning have developed new and more complex algorithms capable of handling a bigger amount of data, often with an increase in predictive power. These new approaches, although not always readily transferable to practical applications in the financial industry, present an opportunity for the development of credit risk modeling and have piqued the interest of researchers in the field. Nonetheless, researchers seem to focus on model performance, not appropriately setting up guidelines to optimize the modeling process or considering the present regulation for model implementation. Thereby, this dissertation establishes frameworks for consumer credit risk modeling based on classification algorithms while guided by a systematic literature review on the topic. The proposed frameworks incorporate ML techniques, data preprocessing and balancing, feature selection (FS), and hyperparameter optimization (HPO). In addition to the bibliographic research, which introduces us to the main classification algorithms and appropriate modeling steps, the development of the frameworks is also based on experiments with hundreds of models for credit risk classification, using Logistic Regression (LR), Decision Trees (DT), Support Vector Machines (SVM), Random Forest (RF), as well as boosting and stacking ensembles, to efficiently guide the construction of robust and parsimonious models for credit risk analysis in consumer lending.Agência 1A concessão de crédito é uma atividade vital da indústria financeira. Para o funcionamento e sucesso das instituições financeiras, assim como a manutenção do equilíbrio do sistema creditício, a modelagem de risco de crédito tem o papel de avaliar a probabilidade de inadimplência de potenciais devedores com base em dados históricos. Algoritmos de classificação apresentam uma abordagem interessante para esta finalidade na elaboração de modelos para Credit Scoring. Desde o surgimento das metodologias analíticas e quantitativas para esta modelagem, persistem na indústria modelos estatísticos, dotados de maior interpretabilidade e fácil implementação. Contudo, com o desenvolvimento na área de Machine Learning (ML), surgiram novos algoritmos capazes de trabalhar com um maior volume de dados e com melhor performance preditiva. Estes algoritmos, apesar de nem sempre prontamente transferíveis da academia para a indústria, apresentam uma oportunidade para o desenvolvimento da modelagem do risco de crédito, tendo consequentemente despertado um interesse de pesquisadores na área. A literatura, por sua vez, se enfoca na performance dos modelos, dificilmente estabelecendo diretrizes para a otimização do processo de modelagem ou se atentando às regulamentações vigentes para a sua aplicação prática na indústria financeira. Desta forma, esta dissertação, embasada por uma revisão sistemática de literatura, propõe frameworks para a modelagem do risco de crédito incorporando o uso de técnicas de ML, pré-processamento e balanceamento de dados, feature selection (FS) e otimização de hiper-parâmetros (OHP). Além da pesquisa bibliográfica, que possibilita uma familiarização com os principais algoritmos de classificação e as etapas de modelagem apropriadas, o desenvolvimento dos frameworks também é fundamentado pela elaboraçao de centenas de modelos para classificação do risco de crédito, partindo dos algoritmos de Regressão Logística (Logistic Regression - LR), Árvores de Decisão (Decision Trees - DT), Support Vector Machines (SVM), Random Forest (RF), assim como ensembles de boosting e stacking, para direcionar de maneira eficiente a construção de modelos robustos e parcimoniosos para a análise do risco na concessão de crédito ao consumidor
    corecore