532 research outputs found

    Modified Genetic Algorithm for Feature Selection and Hyper Parameter Optimization: Case of XGBoost in Spam Prediction

    Full text link
    Recently, spam on online social networks has attracted attention in the research and business world. Twitter has become the preferred medium to spread spam content. Many research efforts attempted to encounter social networks spam. Twitter brought extra challenges represented by the feature space size, and imbalanced data distributions. Usually, the related research works focus on part of these main challenges or produce black-box models. In this paper, we propose a modified genetic algorithm for simultaneous dimensionality reduction and hyper parameter optimization over imbalanced datasets. The algorithm initialized an eXtreme Gradient Boosting classifier and reduced the features space of tweets dataset; to generate a spam prediction model. The model is validated using a 50 times repeated 10-fold stratified cross-validation, and analyzed using nonparametric statistical tests. The resulted prediction model attains on average 82.32\% and 92.67\% in terms of geometric mean and accuracy respectively, utilizing less than 10\% of the total feature space. The empirical results show that the modified genetic algorithm outperforms Chi2Chi^2 and PCAPCA feature selection methods. In addition, eXtreme Gradient Boosting outperforms many machine learning algorithms, including BERT-based deep learning model, in spam prediction. Furthermore, the proposed approach is applied to SMS spam modeling and compared to related works

    Offensive Language Detection in Arabic Social Networks Using Evolutionary-Based Classifiers Learned From Fine-Tuned Embeddings

    Get PDF
    Social networks facilitate communication between people from all over the world. Unfortunately, the excessive use of social networks leads to the rise of antisocial behaviors such as the spread of online offensive language, cyberbullying (CB), and hate speech (HS). Therefore, abusive\offensive and hate detection become a crucial part of cyberharassment. Manual detection of cyberharassment is cumbersome, slow, and not even feasible in rapidly growing data. In this study, we addressed the challenges of automatic detection of the offensive tweets in the Arabic language. The main contribution of this study is to design and implement an intelligent prediction system encompassing a two-stage optimization approach to identify and classify the offensive from the non-offensive text. In the rst stage, the proposed approach ne-tuned the pre-trainedword embedding models by training them for several epochs on the training dataset. The embeddings of the vocabularies in the new dataset are trained and added to the old embeddings. While in the second stage, it employed a hybrid approach of two classi ers, namely XGBoost and SVM, and a genetic algorithm (GA) to mitigate the drawback of the classi ers in nding the optimal hyperparameter values to run the proposed approach. We tested the proposed approach on Arabic Cyberbullying Corpus (ArCybC), which contains tweets collected from four Twitter domains: gaming, sports, news, and celebrities. The ArCybC dataset has four categories: sexual, racial, intelligence, and appearance. The proposed approach produced superior results, in which the SVM algorithm with the Aravec SkipGram word embedding model achieved an accuracy rate of 88.2% and an F1-score rate of 87.8%.Ministerio Espanol de Ciencia e Innovacion (DemocratAI::UGR) PID2020-115570GB-C2

    Topic Classification for Short Texts

    Get PDF
    In the context of TV and social media surveillance, constructing models to automate topic identification of short texts is key task. This paper formalizes the topic classification as a top-K multinomial classification problem and constructs worth-to-consider models for practical usage. We describe the full data processing pipeline, discussing about dataset selection, text preprocessing, feature extraction, model selection and learning, including hyperparameter optimization. When computing time and resources are limited, we show that a classical model like SVM performs as well as an advanced deep neural network, but with shorter model training time

    FAKE NEWS DETECTION ON THE WEB: A DEEP LEARNING BASED APPROACH

    Get PDF
    The acceptance and popularity of social media platforms for the dispersion and proliferation of news articles have led to the spread of questionable and untrusted information (in part) due to the ease by which misleading content can be created and shared among the communities. While prior research has attempted to automatically classify news articles and tweets as credible and non-credible. This work complements such research by proposing an approach that utilizes the amalgamation of Natural Language Processing (NLP), and Deep Learning techniques such as Long Short-Term Memory (LSTM). Moreover, in Information System’s paradigm, design science research methodology (DSRM) has become the major stream that focuses on building and evaluating an artifact to solve emerging problems. Hence, DSRM can accommodate deep learning-based models with the availability of adequate datasets. Two publicly available datasets that contain labeled news articles and tweets have been used to validate the proposed model’s effectiveness. This work presents two distinct experiments, and the results demonstrate that the proposed model works well for both long sequence news articles and short-sequence texts such as tweets. Finally, the findings suggest that the sentiments, tagging, linguistics, syntactic, and text embeddings are the features that have the potential to foster fake news detection through training the proposed model on various dimensionality to learn the contextual meaning of the news content

    An Improved Bees Algorithm for Training Deep Recurrent Networks for Sentiment Classification

    Get PDF
    Recurrent neural networks (RNNs) are powerful tools for learning information from temporal sequences. Designing an optimum deep RNN is difficult due to configuration and training issues, such as vanishing and exploding gradients. In this paper, a novel metaheuristic optimisation approach is proposed for training deep RNNs for the sentiment classification task. The approach employs an enhanced Ternary Bees Algorithm (BA-3+), which operates for large dataset classification problems by considering only three individual solutions in each iteration. BA-3+ combines the collaborative search of three bees to find the optimal set of trainable parameters of the proposed deep recurrent learning architecture. Local learning with exploitative search utilises the greedy selection strategy. Stochastic gradient descent (SGD) learning with singular value decomposition (SVD) aims to handle vanishing and exploding gradients of the decision parameters with the stabilisation strategy of SVD. Global learning with explorative search achieves faster convergence without getting trapped at local optima to find the optimal set of trainable parameters of the proposed deep recurrent learning architecture. BA-3+ has been tested on the sentiment classification task to classify symmetric and asymmetric distribution of the datasets from different domains, including Twitter, product reviews, and movie reviews. Comparative results have been obtained for advanced deep language models and Differential Evolution (DE) and Particle Swarm Optimization (PSO) algorithms. BA-3+ converged to the global minimum faster than the DE and PSO algorithms, and it outperformed the SGD, DE, and PSO algorithms for the Turkish and English datasets. The accuracy value and F1 measure have improved at least with a 30–40% improvement than the standard SGD algorithm for all classification datasets. Accuracy rates in the RNN model trained with BA-3+ ranged from 80% to 90%, while the RNN trained with SGD was able to achieve between 50% and 60% for most datasets. The performance of the RNN model with BA-3+ has as good as for Tree-LSTMs and Recursive Neural Tensor Networks (RNTNs) language models, which achieved accuracy results of up to 90% for some datasets. The improved accuracy and convergence results show that BA-3+ is an efficient, stable algorithm for the complex classification task, and it can handle the vanishing and exploding gradients problem of deep RNNs

    Automated design of the deep neural network pipeline

    Get PDF
    Dissertation (MSc (Computer Science))--University of Pretoria, 2021.Deep neural networks have been shown to be very effective for image processing and text processing. However the big challenge is designing the deep neural network pipeline, as it is time consuming and requires machine learning expertise. More and more non-experts are using deep neural networks in their day-to-day lives, but do not have the expertise to parameter tune and construct optimal deep neural network pipelines. AutoML has mainly focused on neural architecture design and parameter tuning, but little attention has been given to optimal design of the deep neural network pipeline and all of its constituent parts. In this work a single point hyper heuristic (SPHH) was used to automate iii the design of the deep neural network pipeline. The SPHH constructed a deep neural network pipeline design by selecting techniques to use at the various stages of the pipeline, namely: the preprocessing stage, the feature engineering stage, the augmentation stage as well as selecting a deep neural network architecture and relevant hyper-parameters. This work also investigated transfer learning by using a design that was created for one dataset as a starting point for the design process for a different dataset and the effect thereof was evaluated. The reusability of the designs themselves were also tested. The SPHH designed pipelines for both the image processing and text processing domain. The image processing domain covered maize disease detection and oral lesion detection specifically and text processing used sentiment analysis and spam detection, with multiple datasets being used for all the aforementioned tasks. The pipeline designs created by means of automated design were compared to manually derived pipelines from the literature for the given datasets. This research showed that automated design of a deep neural network pipeline using a single point hyper-heuristic is effective. Deep neural network pipelines designed by the SPHH are either better than or just as good as manually derived pipeline designs in terms of performance and application time. The results showed that the pipeline designs created by the SPHH are not reusable as they do not provide comparable performance to the results achieved when specifically creating a design for a dataset. Transfer learning using the designed pipelines is found to produce results comparable to or better than the results achieved when using the SPHH without transfer learning. Transfer learning is only effective when the correct target and source are chosen, for some target datasets negative transfer occurs when using certain datasets as the transfer learning source. Future work will include applying the automated design approach to more domains and making designs reusable. The transfer learning process will also be automated in future work to ensure positive transfer occurs. The last recommendation for future work is to construct a pipeline for unsupervised deep neural network techniques instead of supervised deep neural network techniques.The work presented in this thesis is supported by the National Research Foundation of South Africa (Grant Numbers 46712). Opinions expressed and conclusions arrived at, are those of the author and are not necessarily to be attributed to the NRF.Computer ScienceMSc (Computer Science)Unrestricte

    Text Classification Using Long Short-Term Memory With GloVe Features

    Get PDF
    In the classification of traditional algorithms, problems of high features dimension and data sparseness often occur when classifying text. Classifying text with traditional machine learning algorithms has high efficiency and stability characteristics. However, it has certain limitations with regard to large-scale dataset training. Deep Learning is a proposed method for solving problems in text classification techniques. By tuning the parameters and comparing the eight proposed Long Short-Term Memory (LSTM) models with a large-scale dataset, to show that LSTM with features GloVe can achieve good performance in text classification. The results show that text classification using LSTM with GloVe obtain the highest accuracy is in the sixth model with 95.17, the average precision, recall, and F1-score are 9
    corecore