871 research outputs found
New PCA-based Category Encoder for Cybersecurity and Processing Data in IoT Devices
Increasing the cardinality of categorical variables might decrease the
overall performance of machine learning (ML) algorithms. This paper presents a
novel computational preprocessing method to convert categorical to numerical
variables ML algorithms. It uses a supervised binary classifier to extract
additional context-related features from the categorical values. Up to two
numerical variables per categorical variable are created, depending on the
compression achieved by the Principal Component Analysis (PCA). The method
requires two hyperparameters: a threshold related to the distribution of
categories in the variables and the PCA representativeness. This paper applies
the proposed approach to the well-known cybersecurity NSLKDD dataset to select
and convert three categorical features to numerical features. After choosing
the threshold parameter, we use conditional probabilities to convert the three
categorical variables into six new numerical variables. After that, we feed
these numerical variables to the PCA algorithm and select the whole or partial
numbers of the Principal Components (PCs). Finally, by applying binary
classification with ten different classifiers, we measure the performance of
the new encoder and compare it with the other 17 well-known category encoders.
The new technique achieves the highest performance related to accuracy and Area
Under the Curve (AUC) on high cardinality categorical variables. Also, we
define the harmonic average metrics to find the best trade-off between train
and test performances and prevent underfitting and overfitting. Ultimately, the
number of newly created numerical variables is minimal. This data reduction
improves computational processing time in Internet of things (IoT) devices in
future telecommunication networks.Comment: 6 pages, 4 figures, 5 table
Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features
Since most machine learning (ML) algorithms are designed for numerical inputs, efficiently encoding categorical variables is a crucial aspect in data analysis. A common problem are high cardinality features, i.e. unordered categorical predictor variables with a high number of levels. We study techniques that yield numeric representations of categorical variables which can then be used in subsequent ML applications. We focus on the impact of these techniques on a subsequent algorithm's predictive performance, and-if possible-derive best practices on when to use which technique. We conducted a large-scale benchmark experiment, where we compared different encoding strategies together with five ML algorithms (lasso, random forest, gradient boosting, k-nearest neighbors, support vector machine) using datasets from regression, binary- and multiclass-classification settings. In our study, regularized versions of target encoding (i.e. using target predictions based on the feature levels in the training set as a new numerical feature) consistently provided the best results. Traditionally widely used encodings that make unreasonable assumptions to map levels to integers (e.g. integer encoding) or to reduce the number of levels (possibly based on target information, e.g. leaf encoding) before creating binary indicator variables (one-hot or dummy encoding) were not as effective in comparison
Predictive modeling for occupational safety outcomes and days away from work analysis in mining operations
Mining is known to be one of the most hazardous occupations in the world. Many serious accidents have occurred worldwide over the years in mining. Although there have been efforts to create a safer work environment for miners, the number of accidents occurring at the mining sites is still significant. Machine learning techniques and predictive analytics are becoming one of the leading resources to create safer work environments in the manufacturing and construction industries. These techniques are leveraged to generate actionable insights to improve decision-making. A large amount of mining safety-related data are available, and machine learning algorithms can be used to analyze the data. The use of machine learning techniques can significantly benefit the mining industry. Decision tree, random forest, and artificial neural networks were implemented to analyze the outcomes of mining accidents. These machine learning models were also used to predict days away from work. An accidents dataset provided by the Mine Safety and Health Administration was used to train the models. The models were trained separately on tabular data and narratives. The use of a synthetic data augmentation technique using word embedding was also investigated to tackle the data imbalance problem. Performance of all the models was compared with the performance of the traditional logistic regression model. The results show that models trained on narratives performed better than the models trained on structured/tabular data in predicting the outcome of the accident. The higher predictive power of the models trained on narratives led to the conclusion that the narratives have additional information relevant to the outcome of injury compared to the tabular entries. The models trained on tabular data had a lower mean squared error compared to the models trained on narratives while predicting the days away from work. The results highlight the importance of predictors, like shift start time, accident time, and mining experience in predicting the days away from work. It was found that the F1 score of all the underrepresented classes except one improved after the use of the data augmentation technique. This approach gave greater insight into the factors influencing the outcome of the accident and days away from work
Machine Learning Algorithms for Smart Data Analysis in Internet of Things Environment: Taxonomies and Research Trends
Machine learning techniques will contribution towards making Internet of Things (IoT)
symmetric applications among the most significant sources of new data in the future. In this context,
network systems are endowed with the capacity to access varieties of experimental symmetric data
across a plethora of network devices, study the data information, obtain knowledge, and make
informed decisions based on the dataset at its disposal. This study is limited to supervised and
unsupervised machine learning (ML) techniques, regarded as the bedrock of the IoT smart data
analysis. This study includes reviews and discussions of substantial issues related to supervised
and unsupervised machine learning techniques, highlighting the advantages and limitations of each
algorithm, and discusses the research trends and recommendations for further study
- …