105 research outputs found

    Benchmarking machine learning models on multi-centre eICU critical care dataset

    Get PDF
    Progress of machine learning in critical care has been difficult to track, in part due to absence of public benchmarks. Other fields of research (such as computer vision and natural language processing) have established various competitions and public benchmarks. Recent availability of large clinical datasets has enabled the possibility of establishing public benchmarks. Taking advantage of this opportunity, we propose a public benchmark suite to address four areas of critical care, namely mortality prediction, estimation of length of stay, patient phenotyping and risk of decompensation. We define each task and compare the performance of both clinical models as well as baseline and deep learning models using eICU critical care dataset of around 73,000 patients. This is the first public benchmark on a multi-centre critical care dataset, comparing the performance of clinical gold standard with our predictive model. We also investigate the impact of numerical variables as well as handling of categorical variables on each of the defined tasks. The source code, detailing our methods and experiments is publicly available such that anyone can replicate our results and build upon our work.Comment: Source code to replicate the results https://github.com/mostafaalishahi/eICU_Benchmar

    GLEm-Net: Unified Framework for Data Reduction with Categorical and Numerical Features

    Get PDF
    In the era of Big Data, effective data reduction through feature selection is of paramount importance for machine learning. This paper presents GLEm-Net (Grouped Lasso with Embeddings Network), a novel neural framework that seamlessly processes both categorical and numerical features to reduce the dimensionality of data while retaining as much information as possible. By integrating embedding layers, GLEm-Net effectively manages categorical features with high cardinality and compresses their information in a less dimensional space. By using a grouped Lasso penalty function in its architecture, GLEm-Net simultaneously processes categorical and numerical data, efficiently reducing high-dimensional data while preserving the essential information. We test GLEm-Net with a real-world application in an industrial environment where 6 million records exist and each is described by a mixture of 19 numerical and 7 categorical features with a strong class imbalance. A comparative analysis using state-of-the-art methods shows that despite the difficulty of building a high-performance model, GLEm-Net outperforms the other methods in both feature selection and classification, with a better balance in the selection of both numerical and categorical features

    Categorical Encoding for Machine Learning

    Get PDF
    Abstract: In recent years, interest has grown in addressing the problem of encoding categorical variables, especially in deep learning applied to big-data. However, the current proposals are not entirely satisfactory. The aim of this work is to show the logic and advantages of a new encoding method that takes its cue from the recent word embedding proposals and which we have called Categorical Embedding. Both a supervised and an unsupervised approach will be considered

    A comparison of neural and non-neural machine learning models for food safety risk prediction with European Union RASFF data.

    Get PDF
    European Union launched the RASFF portal in 1977 to ensure cross-border monitoring and a quick reaction when public health risks are detected in the food chain. There are not enough resources available to guarantee a comprehensive inspection policy, but RASFF data has enormous potential as a preventive tool. However, there are few studies of food and feed risk issues prediction and none with RASFF data. Although deep learning models are good prediction systems, it must be confirmed whether in this field they behave better than other machine learning techniques. The importance of categorical variables encoding as input for numerical models should be specially studied. Results in this paper show that deep learning with entity embedding is the best combination, with accuracies of 86.81%, 82.31%, and 88.94% in each of the three stages of the simplified RASFF process in which the tests were carried out. However, the random forest models with one hot encoding offer only slightly worse results, so it seems that in the quality of the results the coding has more weight than the prediction technique. Our work also demonstrates that the use of probabilistic predictions (an advantage of neural models) can also be used to optimize the number of inspections that can be carried out.pre-print301 K

    LambdaOpt: Learn to Regularize Recommender Models in Finer Levels

    Full text link
    Recommendation models mainly deal with categorical variables, such as user/item ID and attributes. Besides the high-cardinality issue, the interactions among such categorical variables are usually long-tailed, with the head made up of highly frequent values and a long tail of rare ones. This phenomenon results in the data sparsity issue, making it essential to regularize the models to ensure generalization. The common practice is to employ grid search to manually tune regularization hyperparameters based on the validation data. However, it requires non-trivial efforts and large computation resources to search the whole candidate space; even so, it may not lead to the optimal choice, for which different parameters should have different regularization strengths. In this paper, we propose a hyperparameter optimization method, LambdaOpt, which automatically and adaptively enforces regularization during training. Specifically, it updates the regularization coefficients based on the performance of validation data. With LambdaOpt, the notorious tuning of regularization hyperparameters can be avoided; more importantly, it allows fine-grained regularization (i.e. each parameter can have an individualized regularization coefficient), leading to better generalized models. We show how to employ LambdaOpt on matrix factorization, a classical model that is representative of a large family of recommender models. Extensive experiments on two public benchmarks demonstrate the superiority of our method in boosting the performance of top-K recommendation.Comment: Accepted by KDD 201

    Midwifery Learning and Forecasting: Predicting Content Demand with User-Generated Logs

    Full text link
    Every day, 800 women and 6,700 newborns die from complications related to pregnancy or childbirth. A well-trained midwife can prevent most of these maternal and newborn deaths. Data science models together with logs generated by users of online learning applications for midwives can help to improve their learning competencies. The goal is to use these rich behavioral data to push digital learning towards personalized content and to provide an adaptive learning journey. In this work, we evaluate various forecasting methods to determine the interest of future users on the different kind of contents available in the app, broken down by profession and region

    A Supervised Embedding and Clustering Anomaly Detection method for classification of Mobile Network Faults

    Full text link
    The paper introduces Supervised Embedding and Clustering Anomaly Detection (SEMC-AD), a method designed to efficiently identify faulty alarm logs in a mobile network and alleviate the challenges of manual monitoring caused by the growing volume of alarm logs. SEMC-AD employs a supervised embedding approach based on deep neural networks, utilizing historical alarm logs and their labels to extract numerical representations for each log, effectively addressing the issue of imbalanced classification due to a small proportion of anomalies in the dataset without employing one-hot encoding. The robustness of the embedding is evaluated by plotting the two most significant principle components of the embedded alarm logs, revealing that anomalies form distinct clusters with similar embeddings. Multivariate normal Gaussian clustering is then applied to these components, identifying clusters with a high ratio of anomalies to normal alarms (above 90%) and labeling them as the anomaly group. To classify new alarm logs, we check if their embedded vectors' two most significant principle components fall within the anomaly-labeled clusters. If so, the log is classified as an anomaly. Performance evaluation demonstrates that SEMC-AD outperforms conventional random forest and gradient boosting methods without embedding. SEMC-AD achieves 99% anomaly detection, whereas random forest and XGBoost only detect 86% and 81% of anomalies, respectively. While supervised classification methods may excel in labeled datasets, the results demonstrate that SEMC-AD is more efficient in classifying anomalies in datasets with numerous categorical features, significantly enhancing anomaly detection, reducing operator burden, and improving network maintenance
    corecore