Search CORE

2,025 research outputs found

Análisis automático de variables categóricas de alta dimensionalidad en bases de datos médicas para la predicción de bacteriemias hospitalarias

Author: Rey García Jaime del
Publication venue
Publication date: 01/01/2021
Field of study

Trabajo de Fin de Grado en Ingeniería Informática, Facultad de Informática UCM, Departamento de Arquitectura de Computadores y Automática, Curso 2020/2021.This project aims to continue and consolidate the study for the bacteriemia detection process and its diagnosis carried out by some faculty companions last year. A first glance through the analysis of numerical variables allowed a deeper understanding and the trace of an approach for a quick detection model. Now, categorical variables take relevance too in order to successfully achieve higher results in the classifier models. The addition of categorical variables in classifier models has been around for at least five years due to the increase in computational capacity, and the benefits in the classifiers as direct consequence is clear. Yet, it is proven that, as complex and abstract as language is, classifiers do struggle when data with slang or abbreviations comes up for prediction, even if its linguistic register is heavily bounded, i.e. when strictly related to medical issues data is treated. Throughout the study we will apply text cleaning and text processing methods to prepare the variables for use, since their format is heterogeneous and unsuitable to be processed by Machine Learning tools. We will also apply the string similarity method to identify all those classes that can help in the algorithm classification process and we will assess the most suitable types of encoding for working with these variables. Finally, we will apply the Random Forest Machine Learning algorithm on the set with techniques that allow us to avoid data learning bias and we will assess the results in terms of the success rates and the relevance of the variables in the decision-making process of the algorithm.Depto. de Arquitectura de Computadores y AutomáticaFac. de InformáticaTRUEunpu

Docta Complutense

A benchmark of categorical encoders for binary classification

Author: Arzamasov Vadim
Boehm Klemens
Matteucci Federico
Publication venue
Publication date: 20/11/2023
Field of study

Categorical encoders transform categorical features into numerical representations that are indispensable for a wide range of machine learning models. Existing encoder benchmark studies lack generalizability because of their limited choice of (1) encoders, (2) experimental factors, and (3) datasets. Additionally, inconsistencies arise from the adoption of varying aggregation strategies. This paper is the most comprehensive benchmark of categorical encoders to date, including an extensive evaluation of 32 configurations of encoders from diverse families, with 36 combinations of experimental factors, and on 50 datasets. The study shows the profound influence of dataset selection, experimental factors, and aggregation strategies on the benchmark's conclusions -- aspects disregarded in previous encoder benchmarks.Comment: To be published in the 37th Conference on Neural Information Processing Systems (NeurIPS 2023) Track on Datasets and Benchmark

arXiv.org e-Print Archive

From Strings to Data Science:a Practical Framework for Automated String Handling

Author: van Lith John
Publication venue
Publication date: 05/08/2021
Field of study

Pure OAI Repository

Comparative Study on the Performance of Categorical Variable Encoders in Classification and Regression Tasks

Author: Fu Ying
Qiu Runwen
Zhu Wenbin
Publication venue
Publication date: 17/01/2024
Field of study

Categorical variables often appear in datasets for classification and regression tasks, and they need to be encoded into numerical values before training. Since many encoders have been developed and can significantly impact performance, choosing the appropriate encoder for a task becomes a time-consuming yet important practical issue. This study broadly classifies machine learning models into three categories: 1) ATI models that implicitly perform affine transformations on inputs, such as multi-layer perceptron neural network; 2) Tree-based models that are based on decision trees, such as random forest; and 3) the rest, such as kNN. Theoretically, we prove that the one-hot encoder is the best choice for ATI models in the sense that it can mimic any other encoders by learning suitable weights from the data. We also explain why the target encoder and its variants are the most suitable encoders for tree-based models. This study conducted comprehensive computational experiments to evaluate 14 encoders, including one-hot and target encoders, along with eight common machine-learning models on 28 datasets. The computational results agree with our theoretical analysis. The findings in this study shed light on how to select the suitable encoder for data scientists in fields such as fraud detection, disease diagnosis, etc

arXiv.org e-Print Archive

Neural Nearest Neighbors Networks

Author: Plötz Tobias
Roth Stefan
Publication venue
Publication date: 01/01/2018
Field of study

Non-local methods exploiting the self-similarity of natural signals have been well studied, for example in image analysis and restoration. Existing approaches, however, rely on k-nearest neighbors (KNN) matching in a fixed feature space. The main hurdle in optimizing this feature space w.r.t. application performance is the non-differentiability of the KNN selection rule. To overcome this, we propose a continuous deterministic relaxation of KNN selection that maintains differentiability w.r.t. pairwise distances, but retains the original KNN as the limit of a temperature parameter approaching zero. To exploit our relaxation, we propose the neural nearest neighbors block (N3 block), a novel non-local processing layer that leverages the principle of self-similarity and can be used as building block in modern neural network architectures. We show its effectiveness for the set reasoning task of correspondence classification as well as for image restoration, including image denoising and single image super-resolution, where we outperform strong convolutional neural network (CNN) baselines and recent non-local models that rely on KNN selection in hand-chosen features spaces.Comment: to appear at NIPS*2018, code available at https://github.com/visinf/n3net

arXiv.org e-Print Archive

TUbiblio

Data Preparation in the Big Data Era

Author: Federico Castanedo
Publication venue: O\u27Relly
Publication date: 01/11/2016
Field of study

Preparing and cleaning data is notoriously expensive, prone to error, and time consuming: the process accounts for roughly 80% of the total time spent on analysis. As this O’Reilly report points out, enterprises have already invested billions of dollars in big data analytics, so there’s great incentive to modernize methods for cleaning, combining, and transforming data. Author Federico Castanedo, Chief Data Scientist at WiseAthena.com, details best practices for reducing the time it takes to convert raw data into actionable insights. With these tools and techniques in mind, your organization will be well positioned to translate big data into big decisions. • Explore the problems organizations face today with traditional prep and integration • Define the business questions you want to address before selecting, prepping, and analyzing data • Learn new methods for preparing raw data, including date-time and string data • Understand how some cleaning actions (like replacing missing values) affect your analysis • Examine data curation products: modern approaches that scale • Consider your business audience when choosing ways to deliver your analysis Federico Castanedo is the Chief Data Scientist at WiseAthena.com. Involved in projects related to data analysis in academia and industry for more than a decade, he’s published several scientific papers about data fusion techniques, visual sensor networks, and machine learning

Open Library

Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features

Author: Bischl Bernd
Pargent Florian
Pfisterer Florian
Thomas Janek
Publication venue: Ludwig-Maximilians-Universität München
Publication date: 01/01/2022
Field of study

Since most machine learning (ML) algorithms are designed for numerical inputs, efficiently encoding categorical variables is a crucial aspect in data analysis. A common problem are high cardinality features, i.e. unordered categorical predictor variables with a high number of levels. We study techniques that yield numeric representations of categorical variables which can then be used in subsequent ML applications. We focus on the impact of these techniques on a subsequent algorithm's predictive performance, and-if possible-derive best practices on when to use which technique. We conducted a large-scale benchmark experiment, where we compared different encoding strategies together with five ML algorithms (lasso, random forest, gradient boosting, k-nearest neighbors, support vector machine) using datasets from regression, binary- and multiclass-classification settings. In our study, regularized versions of target encoding (i.e. using target predictions based on the feature levels in the training set as a new numerical feature) consistently provided the best results. Traditionally widely used encodings that make unreasonable assumptions to map levels to integers (e.g. integer encoding) or to reduce the number of levels (possibly based on target information, e.g. leaf encoding) before creating binary indicator variables (one-hot or dummy encoding) were not as effective in comparison

arXiv.org e-Print Archive

Open Access LMU