1,499 research outputs found

    Comparative Study on the Performance of Categorical Variable Encoders in Classification and Regression Tasks

    Full text link
    Categorical variables often appear in datasets for classification and regression tasks, and they need to be encoded into numerical values before training. Since many encoders have been developed and can significantly impact performance, choosing the appropriate encoder for a task becomes a time-consuming yet important practical issue. This study broadly classifies machine learning models into three categories: 1) ATI models that implicitly perform affine transformations on inputs, such as multi-layer perceptron neural network; 2) Tree-based models that are based on decision trees, such as random forest; and 3) the rest, such as kNN. Theoretically, we prove that the one-hot encoder is the best choice for ATI models in the sense that it can mimic any other encoders by learning suitable weights from the data. We also explain why the target encoder and its variants are the most suitable encoders for tree-based models. This study conducted comprehensive computational experiments to evaluate 14 encoders, including one-hot and target encoders, along with eight common machine-learning models on 28 datasets. The computational results agree with our theoretical analysis. The findings in this study shed light on how to select the suitable encoder for data scientists in fields such as fraud detection, disease diagnosis, etc

    Reduction of emergency department returns after discharge from hospital: Machine learning model to predict emergency department returns 30 days post hospital discharge for medical patients

    Get PDF
    Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Business AnalyticsPost-hospital discharge returns to emergency departments are associated with reducing the efficiency of the emergency department (ED) utilisation and the quality of healthcare. These returns are often related to the nature of the disease and/or inadequate care. This thesis aims to develop a machine-learning model that predicts ED returns within 30 days of inpatient discharge from Portuguese public hospitals. Different binary classification models were trained and evaluated with a particular focus on sensitivity (predictive power of the critical class of returning patients). The selected model was the Extreme gradient boost Classifier, which showed the best performance on recall and the other considered performance metrics. A cohort of 93 449 medical hospitalisations of adult patients discharged between January 1st, 2018, and December 31st, 2019, was assembled with diagnoses details to be used in this study. According to the problem's requirement, the recall was the performance metric to be maximised. Therefore, Performance optimisation methods were considered, and the final model resulted in a recall of 84.38%, precision of 84.35%, F1 score of 84.36% and accuracy of 84.10%. Future deployment and integration of this ED return predictive analytics into the inpatient care workflow may allow identifying patients that require targeted care interventions that reduce overall healthcare expense and improve health outcomes

    Análisis automático de variables categóricas de alta dimensionalidad en bases de datos médicas para la predicción de bacteriemias hospitalarias

    Get PDF
    Trabajo de Fin de Grado en Ingeniería Informática, Facultad de Informática UCM, Departamento de Arquitectura de Computadores y Automática, Curso 2020/2021.This project aims to continue and consolidate the study for the bacteriemia detection process and its diagnosis carried out by some faculty companions last year. A first glance through the analysis of numerical variables allowed a deeper understanding and the trace of an approach for a quick detection model. Now, categorical variables take relevance too in order to successfully achieve higher results in the classifier models. The addition of categorical variables in classifier models has been around for at least five years due to the increase in computational capacity, and the benefits in the classifiers as direct consequence is clear. Yet, it is proven that, as complex and abstract as language is, classifiers do struggle when data with slang or abbreviations comes up for prediction, even if its linguistic register is heavily bounded, i.e. when strictly related to medical issues data is treated. Throughout the study we will apply text cleaning and text processing methods to prepare the variables for use, since their format is heterogeneous and unsuitable to be processed by Machine Learning tools. We will also apply the string similarity method to identify all those classes that can help in the algorithm classification process and we will assess the most suitable types of encoding for working with these variables. Finally, we will apply the Random Forest Machine Learning algorithm on the set with techniques that allow us to avoid data learning bias and we will assess the results in terms of the success rates and the relevance of the variables in the decision-making process of the algorithm.Depto. de Arquitectura de Computadores y AutomáticaFac. de InformáticaTRUEunpu

    Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features

    Get PDF
    Since most machine learning (ML) algorithms are designed for numerical inputs, efficiently encoding categorical variables is a crucial aspect in data analysis. A common problem are high cardinality features, i.e. unordered categorical predictor variables with a high number of levels. We study techniques that yield numeric representations of categorical variables which can then be used in subsequent ML applications. We focus on the impact of these techniques on a subsequent algorithm's predictive performance, and-if possible-derive best practices on when to use which technique. We conducted a large-scale benchmark experiment, where we compared different encoding strategies together with five ML algorithms (lasso, random forest, gradient boosting, k-nearest neighbors, support vector machine) using datasets from regression, binary- and multiclass-classification settings. In our study, regularized versions of target encoding (i.e. using target predictions based on the feature levels in the training set as a new numerical feature) consistently provided the best results. Traditionally widely used encodings that make unreasonable assumptions to map levels to integers (e.g. integer encoding) or to reduce the number of levels (possibly based on target information, e.g. leaf encoding) before creating binary indicator variables (one-hot or dummy encoding) were not as effective in comparison

    A benchmark of categorical encoders for binary classification

    Full text link
    Categorical encoders transform categorical features into numerical representations that are indispensable for a wide range of machine learning models. Existing encoder benchmark studies lack generalizability because of their limited choice of (1) encoders, (2) experimental factors, and (3) datasets. Additionally, inconsistencies arise from the adoption of varying aggregation strategies. This paper is the most comprehensive benchmark of categorical encoders to date, including an extensive evaluation of 32 configurations of encoders from diverse families, with 36 combinations of experimental factors, and on 50 datasets. The study shows the profound influence of dataset selection, experimental factors, and aggregation strategies on the benchmark's conclusions -- aspects disregarded in previous encoder benchmarks.Comment: To be published in the 37th Conference on Neural Information Processing Systems (NeurIPS 2023) Track on Datasets and Benchmark
    • …
    corecore