2,025 research outputs found
Análisis automático de variables categóricas de alta dimensionalidad en bases de datos médicas para la predicción de bacteriemias hospitalarias
Trabajo de Fin de Grado en IngenierĂa Informática, Facultad de Informática UCM, Departamento de Arquitectura de Computadores y Automática, Curso 2020/2021.This project aims to continue and consolidate the study for the bacteriemia detection process and its diagnosis carried out by some faculty companions last year. A first glance through the analysis of numerical variables allowed a deeper understanding and the trace of an approach for a quick detection model. Now, categorical variables take relevance too in order to successfully achieve higher results in the classifier models.
The addition of categorical variables in classifier models has been around for at least five years due to the increase in computational capacity, and the benefits in the classifiers as direct consequence is clear. Yet, it is proven that, as complex and abstract as language is, classifiers do struggle when data with slang or abbreviations comes up for prediction, even if its linguistic register is heavily bounded, i.e. when strictly related to medical issues data is treated.
Throughout the study we will apply text cleaning and text processing methods to prepare the variables for use, since their format is heterogeneous and unsuitable to be processed by Machine Learning tools.
We will also apply the string similarity method to identify all those classes that can help in the algorithm classification process and we will assess the most suitable types of encoding for working with these variables.
Finally, we will apply the Random Forest Machine Learning algorithm on the set with techniques that allow us to avoid data learning bias and we will assess the results in terms of the success rates and the relevance of the variables in the decision-making process of the algorithm.Depto. de Arquitectura de Computadores y AutomáticaFac. de InformáticaTRUEunpu
A benchmark of categorical encoders for binary classification
Categorical encoders transform categorical features into numerical
representations that are indispensable for a wide range of machine learning
models. Existing encoder benchmark studies lack generalizability because of
their limited choice of (1) encoders, (2) experimental factors, and (3)
datasets. Additionally, inconsistencies arise from the adoption of varying
aggregation strategies. This paper is the most comprehensive benchmark of
categorical encoders to date, including an extensive evaluation of 32
configurations of encoders from diverse families, with 36 combinations of
experimental factors, and on 50 datasets. The study shows the profound
influence of dataset selection, experimental factors, and aggregation
strategies on the benchmark's conclusions -- aspects disregarded in previous
encoder benchmarks.Comment: To be published in the 37th Conference on Neural Information
Processing Systems (NeurIPS 2023) Track on Datasets and Benchmark
Comparative Study on the Performance of Categorical Variable Encoders in Classification and Regression Tasks
Categorical variables often appear in datasets for classification and
regression tasks, and they need to be encoded into numerical values before
training. Since many encoders have been developed and can significantly impact
performance, choosing the appropriate encoder for a task becomes a
time-consuming yet important practical issue. This study broadly classifies
machine learning models into three categories: 1) ATI models that implicitly
perform affine transformations on inputs, such as multi-layer perceptron neural
network; 2) Tree-based models that are based on decision trees, such as random
forest; and 3) the rest, such as kNN. Theoretically, we prove that the one-hot
encoder is the best choice for ATI models in the sense that it can mimic any
other encoders by learning suitable weights from the data. We also explain why
the target encoder and its variants are the most suitable encoders for
tree-based models. This study conducted comprehensive computational experiments
to evaluate 14 encoders, including one-hot and target encoders, along with
eight common machine-learning models on 28 datasets. The computational results
agree with our theoretical analysis. The findings in this study shed light on
how to select the suitable encoder for data scientists in fields such as fraud
detection, disease diagnosis, etc
Neural Nearest Neighbors Networks
Non-local methods exploiting the self-similarity of natural signals have been
well studied, for example in image analysis and restoration. Existing
approaches, however, rely on k-nearest neighbors (KNN) matching in a fixed
feature space. The main hurdle in optimizing this feature space w.r.t.
application performance is the non-differentiability of the KNN selection rule.
To overcome this, we propose a continuous deterministic relaxation of KNN
selection that maintains differentiability w.r.t. pairwise distances, but
retains the original KNN as the limit of a temperature parameter approaching
zero. To exploit our relaxation, we propose the neural nearest neighbors block
(N3 block), a novel non-local processing layer that leverages the principle of
self-similarity and can be used as building block in modern neural network
architectures. We show its effectiveness for the set reasoning task of
correspondence classification as well as for image restoration, including image
denoising and single image super-resolution, where we outperform strong
convolutional neural network (CNN) baselines and recent non-local models that
rely on KNN selection in hand-chosen features spaces.Comment: to appear at NIPS*2018, code available at
https://github.com/visinf/n3net
Data Preparation in the Big Data Era
Preparing and cleaning data is notoriously expensive, prone to error, and time consuming: the process accounts for roughly 80% of the total time spent on analysis. As this O’Reilly report points out, enterprises have already invested billions of dollars in big data analytics, so there’s great incentive to modernize methods for cleaning, combining, and transforming data.
Author Federico Castanedo, Chief Data Scientist at WiseAthena.com, details best practices for reducing the time it takes to convert raw data into actionable insights. With these tools and techniques in mind, your organization will be well positioned to translate big data into big decisions.
• Explore the problems organizations face today with traditional prep and integration
• Define the business questions you want to address before selecting, prepping, and analyzing data
• Learn new methods for preparing raw data, including date-time and string data
• Understand how some cleaning actions (like replacing missing values) affect your analysis
• Examine data curation products: modern approaches that scale
• Consider your business audience when choosing ways to deliver your analysis
Federico Castanedo is the Chief Data Scientist at WiseAthena.com. Involved in projects related to data analysis in academia and industry for more than a decade, he’s published several scientific papers about data fusion techniques, visual sensor networks, and machine learning
Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features
Since most machine learning (ML) algorithms are designed for numerical inputs, efficiently encoding categorical variables is a crucial aspect in data analysis. A common problem are high cardinality features, i.e. unordered categorical predictor variables with a high number of levels. We study techniques that yield numeric representations of categorical variables which can then be used in subsequent ML applications. We focus on the impact of these techniques on a subsequent algorithm's predictive performance, and-if possible-derive best practices on when to use which technique. We conducted a large-scale benchmark experiment, where we compared different encoding strategies together with five ML algorithms (lasso, random forest, gradient boosting, k-nearest neighbors, support vector machine) using datasets from regression, binary- and multiclass-classification settings. In our study, regularized versions of target encoding (i.e. using target predictions based on the feature levels in the training set as a new numerical feature) consistently provided the best results. Traditionally widely used encodings that make unreasonable assumptions to map levels to integers (e.g. integer encoding) or to reduce the number of levels (possibly based on target information, e.g. leaf encoding) before creating binary indicator variables (one-hot or dummy encoding) were not as effective in comparison
- …