105 research outputs found
Benchmarking machine learning models on multi-centre eICU critical care dataset
Progress of machine learning in critical care has been difficult to track, in
part due to absence of public benchmarks. Other fields of research (such as
computer vision and natural language processing) have established various
competitions and public benchmarks. Recent availability of large clinical
datasets has enabled the possibility of establishing public benchmarks. Taking
advantage of this opportunity, we propose a public benchmark suite to address
four areas of critical care, namely mortality prediction, estimation of length
of stay, patient phenotyping and risk of decompensation. We define each task
and compare the performance of both clinical models as well as baseline and
deep learning models using eICU critical care dataset of around 73,000
patients. This is the first public benchmark on a multi-centre critical care
dataset, comparing the performance of clinical gold standard with our
predictive model. We also investigate the impact of numerical variables as well
as handling of categorical variables on each of the defined tasks. The source
code, detailing our methods and experiments is publicly available such that
anyone can replicate our results and build upon our work.Comment: Source code to replicate the results
https://github.com/mostafaalishahi/eICU_Benchmar
GLEm-Net: Unified Framework for Data Reduction with Categorical and Numerical Features
In the era of Big Data, effective data reduction through feature selection is of paramount importance for machine learning. This paper presents GLEm-Net (Grouped Lasso with Embeddings Network), a novel neural framework that seamlessly processes both categorical and numerical features to reduce the dimensionality of data while retaining as much information as possible. By integrating embedding layers, GLEm-Net effectively manages categorical features with high cardinality and compresses their information in a less dimensional space.
By using a grouped Lasso penalty function in its architecture, GLEm-Net simultaneously processes categorical and numerical
data, efficiently reducing high-dimensional data while preserving the essential information. We test GLEm-Net with a real-world
application in an industrial environment where 6 million records exist and each is described by a mixture of 19 numerical and 7
categorical features with a strong class imbalance. A comparative analysis using state-of-the-art methods shows that despite the
difficulty of building a high-performance model, GLEm-Net outperforms the other methods in both feature selection and classification, with a better balance in the selection of both numerical and categorical features
Categorical Encoding for Machine Learning
Abstract: In recent years, interest has grown in addressing the problem of encoding categorical variables, especially in deep learning applied to big-data. However, the current proposals are not entirely satisfactory. The aim of this work is to show the logic and advantages of a new encoding method that takes its cue from the recent word embedding proposals and which we have called Categorical Embedding. Both a supervised and an unsupervised approach will be considered
A comparison of neural and non-neural machine learning models for food safety risk prediction with European Union RASFF data.
European Union launched the RASFF portal in 1977 to ensure cross-border monitoring and a quick reaction when public health risks are detected in the food chain. There are not enough resources available to guarantee a comprehensive inspection policy, but RASFF data has enormous potential as a preventive tool. However, there are few studies of food and feed risk issues prediction and none with RASFF data. Although deep learning models are good prediction systems, it must be confirmed whether in this field they behave better than other machine learning techniques. The importance of categorical variables encoding as input for numerical models should be specially studied. Results in this paper show that deep learning with entity embedding is the best combination, with accuracies of 86.81%, 82.31%, and 88.94% in each of the three stages of the simplified RASFF process in which the tests were carried out. However, the random forest models with one hot encoding offer only slightly worse results, so it seems that in the quality of the results the coding has more weight than the prediction technique. Our work also demonstrates that the use of probabilistic predictions (an advantage of neural models) can also be used to optimize the number of inspections that can be carried out.pre-print301 K
LambdaOpt: Learn to Regularize Recommender Models in Finer Levels
Recommendation models mainly deal with categorical variables, such as
user/item ID and attributes. Besides the high-cardinality issue, the
interactions among such categorical variables are usually long-tailed, with the
head made up of highly frequent values and a long tail of rare ones. This
phenomenon results in the data sparsity issue, making it essential to
regularize the models to ensure generalization. The common practice is to
employ grid search to manually tune regularization hyperparameters based on the
validation data. However, it requires non-trivial efforts and large computation
resources to search the whole candidate space; even so, it may not lead to the
optimal choice, for which different parameters should have different
regularization strengths. In this paper, we propose a hyperparameter
optimization method, LambdaOpt, which automatically and adaptively enforces
regularization during training. Specifically, it updates the regularization
coefficients based on the performance of validation data. With LambdaOpt, the
notorious tuning of regularization hyperparameters can be avoided; more
importantly, it allows fine-grained regularization (i.e. each parameter can
have an individualized regularization coefficient), leading to better
generalized models. We show how to employ LambdaOpt on matrix factorization, a
classical model that is representative of a large family of recommender models.
Extensive experiments on two public benchmarks demonstrate the superiority of
our method in boosting the performance of top-K recommendation.Comment: Accepted by KDD 201
Midwifery Learning and Forecasting: Predicting Content Demand with User-Generated Logs
Every day, 800 women and 6,700 newborns die from complications related to
pregnancy or childbirth. A well-trained midwife can prevent most of these
maternal and newborn deaths. Data science models together with logs generated
by users of online learning applications for midwives can help to improve their
learning competencies. The goal is to use these rich behavioral data to push
digital learning towards personalized content and to provide an adaptive
learning journey. In this work, we evaluate various forecasting methods to
determine the interest of future users on the different kind of contents
available in the app, broken down by profession and region
A Supervised Embedding and Clustering Anomaly Detection method for classification of Mobile Network Faults
The paper introduces Supervised Embedding and Clustering Anomaly Detection
(SEMC-AD), a method designed to efficiently identify faulty alarm logs in a
mobile network and alleviate the challenges of manual monitoring caused by the
growing volume of alarm logs. SEMC-AD employs a supervised embedding approach
based on deep neural networks, utilizing historical alarm logs and their labels
to extract numerical representations for each log, effectively addressing the
issue of imbalanced classification due to a small proportion of anomalies in
the dataset without employing one-hot encoding. The robustness of the embedding
is evaluated by plotting the two most significant principle components of the
embedded alarm logs, revealing that anomalies form distinct clusters with
similar embeddings. Multivariate normal Gaussian clustering is then applied to
these components, identifying clusters with a high ratio of anomalies to normal
alarms (above 90%) and labeling them as the anomaly group. To classify new
alarm logs, we check if their embedded vectors' two most significant principle
components fall within the anomaly-labeled clusters. If so, the log is
classified as an anomaly. Performance evaluation demonstrates that SEMC-AD
outperforms conventional random forest and gradient boosting methods without
embedding. SEMC-AD achieves 99% anomaly detection, whereas random forest and
XGBoost only detect 86% and 81% of anomalies, respectively. While supervised
classification methods may excel in labeled datasets, the results demonstrate
that SEMC-AD is more efficient in classifying anomalies in datasets with
numerous categorical features, significantly enhancing anomaly detection,
reducing operator burden, and improving network maintenance
- …