532 research outputs found
Modified Genetic Algorithm for Feature Selection and Hyper Parameter Optimization: Case of XGBoost in Spam Prediction
Recently, spam on online social networks has attracted attention in the
research and business world. Twitter has become the preferred medium to spread
spam content. Many research efforts attempted to encounter social networks
spam. Twitter brought extra challenges represented by the feature space size,
and imbalanced data distributions. Usually, the related research works focus on
part of these main challenges or produce black-box models. In this paper, we
propose a modified genetic algorithm for simultaneous dimensionality reduction
and hyper parameter optimization over imbalanced datasets. The algorithm
initialized an eXtreme Gradient Boosting classifier and reduced the features
space of tweets dataset; to generate a spam prediction model. The model is
validated using a 50 times repeated 10-fold stratified cross-validation, and
analyzed using nonparametric statistical tests. The resulted prediction model
attains on average 82.32\% and 92.67\% in terms of geometric mean and accuracy
respectively, utilizing less than 10\% of the total feature space. The
empirical results show that the modified genetic algorithm outperforms
and feature selection methods. In addition, eXtreme Gradient Boosting
outperforms many machine learning algorithms, including BERT-based deep
learning model, in spam prediction. Furthermore, the proposed approach is
applied to SMS spam modeling and compared to related works
Offensive Language Detection in Arabic Social Networks Using Evolutionary-Based Classifiers Learned From Fine-Tuned Embeddings
Social networks facilitate communication between people from all over the world.
Unfortunately, the excessive use of social networks leads to the rise of antisocial behaviors such as the
spread of online offensive language, cyberbullying (CB), and hate speech (HS). Therefore, abusive\offensive
and hate detection become a crucial part of cyberharassment. Manual detection of cyberharassment is
cumbersome, slow, and not even feasible in rapidly growing data. In this study, we addressed the challenges
of automatic detection of the offensive tweets in the Arabic language. The main contribution of this study is
to design and implement an intelligent prediction system encompassing a two-stage optimization approach
to identify and classify the offensive from the non-offensive text. In the rst stage, the proposed approach
ne-tuned the pre-trainedword embedding models by training them for several epochs on the training dataset.
The embeddings of the vocabularies in the new dataset are trained and added to the old embeddings. While
in the second stage, it employed a hybrid approach of two classi ers, namely XGBoost and SVM, and a
genetic algorithm (GA) to mitigate the drawback of the classi ers in nding the optimal hyperparameter
values to run the proposed approach. We tested the proposed approach on Arabic Cyberbullying Corpus
(ArCybC), which contains tweets collected from four Twitter domains: gaming, sports, news, and celebrities.
The ArCybC dataset has four categories: sexual, racial, intelligence, and appearance. The proposed approach
produced superior results, in which the SVM algorithm with the Aravec SkipGram word embedding model
achieved an accuracy rate of 88.2% and an F1-score rate of 87.8%.Ministerio Espanol de Ciencia e Innovacion (DemocratAI::UGR) PID2020-115570GB-C2
Topic Classification for Short Texts
In the context of TV and social media surveillance, constructing models to automate topic identification of short texts is key task. This paper formalizes the topic classification as a top-K multinomial classification problem and constructs worth-to-consider models for practical usage. We describe the full data processing pipeline, discussing about dataset selection, text preprocessing, feature extraction, model selection and learning, including hyperparameter optimization. When computing time and resources are limited, we show that a classical model like SVM performs as well as an advanced deep neural network, but with shorter model training time
FAKE NEWS DETECTION ON THE WEB: A DEEP LEARNING BASED APPROACH
The acceptance and popularity of social media platforms for the dispersion and proliferation of news articles have led to the spread of questionable and untrusted information (in part) due to the ease by which misleading content can be created and shared among the communities. While prior research has attempted to automatically classify news articles and tweets as credible and non-credible. This work complements such research by proposing an approach that utilizes the amalgamation of Natural Language Processing (NLP), and Deep Learning techniques such as Long Short-Term Memory (LSTM).
Moreover, in Information System’s paradigm, design science research methodology (DSRM) has become the major stream that focuses on building and evaluating an artifact to solve emerging problems. Hence, DSRM can accommodate deep learning-based models with the availability of adequate datasets. Two publicly available datasets that contain labeled news articles and tweets have been used to validate the proposed model’s effectiveness. This work presents two distinct experiments, and the results demonstrate that the proposed model works well for both long sequence news articles and short-sequence texts such as tweets. Finally, the findings suggest that the sentiments, tagging, linguistics, syntactic, and text embeddings are the features that have the potential to foster fake news detection through training the proposed model on various dimensionality to learn the contextual meaning of the news content
An Improved Bees Algorithm for Training Deep Recurrent Networks for Sentiment Classification
Recurrent neural networks (RNNs) are powerful tools for learning information from
temporal sequences. Designing an optimum deep RNN is difficult due to configuration and training
issues, such as vanishing and exploding gradients. In this paper, a novel metaheuristic optimisation
approach is proposed for training deep RNNs for the sentiment classification task. The approach
employs an enhanced Ternary Bees Algorithm (BA-3+), which operates for large dataset classification
problems by considering only three individual solutions in each iteration. BA-3+ combines the
collaborative search of three bees to find the optimal set of trainable parameters of the proposed deep
recurrent learning architecture. Local learning with exploitative search utilises the greedy selection
strategy. Stochastic gradient descent (SGD) learning with singular value decomposition (SVD) aims to
handle vanishing and exploding gradients of the decision parameters with the stabilisation strategy
of SVD. Global learning with explorative search achieves faster convergence without getting trapped
at local optima to find the optimal set of trainable parameters of the proposed deep recurrent learning
architecture. BA-3+ has been tested on the sentiment classification task to classify symmetric and
asymmetric distribution of the datasets from different domains, including Twitter, product reviews,
and movie reviews. Comparative results have been obtained for advanced deep language models and
Differential Evolution (DE) and Particle Swarm Optimization (PSO) algorithms. BA-3+ converged
to the global minimum faster than the DE and PSO algorithms, and it outperformed the SGD, DE,
and PSO algorithms for the Turkish and English datasets. The accuracy value and F1 measure have
improved at least with a 30–40% improvement than the standard SGD algorithm for all classification
datasets. Accuracy rates in the RNN model trained with BA-3+ ranged from 80% to 90%, while the
RNN trained with SGD was able to achieve between 50% and 60% for most datasets. The performance
of the RNN model with BA-3+ has as good as for Tree-LSTMs and Recursive Neural Tensor Networks
(RNTNs) language models, which achieved accuracy results of up to 90% for some datasets. The
improved accuracy and convergence results show that BA-3+ is an efficient, stable algorithm for the
complex classification task, and it can handle the vanishing and exploding gradients problem of
deep RNNs
Automated design of the deep neural network pipeline
Dissertation (MSc (Computer Science))--University of Pretoria, 2021.Deep neural networks have been shown to be very effective for image processing
and text processing. However the big challenge is designing the deep
neural network pipeline, as it is time consuming and requires machine learning
expertise. More and more non-experts are using deep neural networks in their
day-to-day lives, but do not have the expertise to parameter tune and construct
optimal deep neural network pipelines. AutoML has mainly focused on neural
architecture design and parameter tuning, but little attention has been given
to optimal design of the deep neural network pipeline and all of its constituent
parts. In this work a single point hyper heuristic (SPHH) was used to automate
iii
the design of the deep neural network pipeline. The SPHH constructed a deep
neural network pipeline design by selecting techniques to use at the various stages
of the pipeline, namely: the preprocessing stage, the feature engineering stage,
the augmentation stage as well as selecting a deep neural network architecture
and relevant hyper-parameters. This work also investigated transfer learning by
using a design that was created for one dataset as a starting point for the design
process for a different dataset and the effect thereof was evaluated. The reusability
of the designs themselves were also tested. The SPHH designed pipelines for
both the image processing and text processing domain. The image processing
domain covered maize disease detection and oral lesion detection specifically
and text processing used sentiment analysis and spam detection, with multiple
datasets being used for all the aforementioned tasks. The pipeline designs created
by means of automated design were compared to manually derived pipelines
from the literature for the given datasets. This research showed that automated
design of a deep neural network pipeline using a single point hyper-heuristic is
effective. Deep neural network pipelines designed by the SPHH are either better
than or just as good as manually derived pipeline designs in terms of performance
and application time. The results showed that the pipeline designs created by
the SPHH are not reusable as they do not provide comparable performance to
the results achieved when specifically creating a design for a dataset. Transfer
learning using the designed pipelines is found to produce results comparable
to or better than the results achieved when using the SPHH without transfer
learning. Transfer learning is only effective when the correct target and source
are chosen, for some target datasets negative transfer occurs when using certain
datasets as the transfer learning source. Future work will include applying the
automated design approach to more domains and making designs reusable. The
transfer learning process will also be automated in future work to ensure positive transfer occurs. The last recommendation for future work is to construct a
pipeline for unsupervised deep neural network techniques instead of supervised
deep neural network techniques.The work presented in this thesis is supported by the National Research
Foundation of South Africa (Grant Numbers 46712). Opinions expressed and
conclusions arrived at, are those of the author and are not necessarily to be
attributed to the NRF.Computer ScienceMSc (Computer Science)Unrestricte
Text Classification Using Long Short-Term Memory With GloVe Features
In the classification of traditional algorithms, problems of high features dimension and data sparseness often occur when classifying text. Classifying text with traditional machine learning algorithms has high efficiency and stability characteristics. However, it has certain limitations with regard to large-scale dataset training. Deep Learning is a proposed method for solving problems in text classification techniques. By tuning the parameters and comparing the eight proposed Long Short-Term Memory (LSTM) models with a large-scale dataset, to show that LSTM with features GloVe can achieve good performance in text classification. The results show that text classification using LSTM with GloVe obtain the highest accuracy is in the sixth model with 95.17, the average precision, recall, and F1-score are 9
- …