6 research outputs found

    Universal Spam Detection using Transfer Learning of BERT Model

    Get PDF
    Several machine learning and deep learning algorithms were limited to one dataset of spam emails/texts, which waste valuable resources due to individual models. This research applied efficient classification of ham or spam emails in real-time scenarios. Deep learning transformer models become important by training on text data based on self-attention mechanisms. This manuscript demonstrated a novel universal spam detection model using pre-trained Google's Bidirectional Encoder Representations from Transformers (BERT) base uncased models with multiple spam datasets. Different methods for Enron, Spamassain, Lingspam, and Spamtext message classification datasets, were used to train models individually. The combined model is finetuned with hyperparameters of each model. When each model using its corresponding datasets, an F1-score is at 0.9 in the model architecture. The "universal model" was trained with four datasets and leveraged hyperparameters from each model. An overall accuracy reached 97%, with an F1 score at 0.96 combined across all four datasets

    A Comparative Study of Open-Domain and Specific-Domain Word Sense Disambiguation Based on Quranic Information Retrieval

    Full text link
    Information retrieval is the process of analysing typed query as well as to retrieve relevant document according to the user query. Several issues can significantly affect the effectiveness of information retrieval. One of the common issue is the ambiguity lies on the words where a single word could yield several meanings. The process of identifying the exact sense of word is called Word Sense Disambiguation (WSD). Quran is the holly book for nearly 1.5 billion Muslims around the world. In particularly, Quran contains numerous words that can undergone multiple meanings. Therefore, there is a vital demand to apply WSD approach on Quran, in order, to improve the information retrieval. Several WSD approaches have been proposed for Quranic retrieval. However, these approaches are divided into two main categories; open-domain WSD approach and specific-domain WSD approach. Open-domain WSD is an approach that utilizes an open-domain dictionary such as WordNet, that is exploited to provide the exact sense. Whereas, domain-specific WSD approach aims to utilize a restricted training data that contain specific senses related to the domain of Quran. Hence, this study aims to establish a comparative study to investigate the two WSD categories including domain-specific and open-domain. For the domain-specific approach, a predefined example data has been collected to train Yarwosky algorithm which is a semisupervised machine learning technique. Then, based on the training, such algorithm can classify the exact sense for the words. In contrast, WordNet which is an open-domain dictionary has been used in this study with semantic distances, in order, to identify the similarity between the query word and the results of WordNet’s concepts. That dataset that has been used in this study is a Quranic translation. The experimental results have shown the mixed superiority of Yarwosky algorithm and WordNet WSD approach

    A Comparative Study of Word Embedding Techniques for SMS Spam Detection

    Get PDF
    The file attached to this record is the author's final peer reviewed version. The Publisher's final version can be found by following the DOI link.E-mail and SMS are the most popular communication tools used by businesses, organizations and educational institutions. Every day, people receive hundreds of messages which could be either spam or ham. Spam is any form of unsolicited, unwanted digital communication, usually sent out in bulk. Spam emails and SMS waste resources by unnecessarily flooding network lines and consuming storage space. Therefore, it is important to develop high accuracy spam detection models to effectively block spam messages, so as to optimize resources and protect users. Various word-embedding techniques such as Bag of Words (BOW), N-grams, TF-IDF, Word2Vec and Doc2Vec have been widely applied to NLP problems, however a comparative study of these techniques for SMS spam detection is currently lacking. Hence, in this paper, we provide a comparative analysis of these popular word embedding techniques for SMS spam detection by evaluating their performance on a publicly available ham and spam dataset. We investigate the performance of the word embedding techniques using 5 different machine learning classifiers i.e. Multinomial Naive Bayes (MNB), KNN, SVM, Random Forest and Extra Trees. Based on the dataset employed in the study, N-gram, BOW and TF-IDF with oversampling recorded the highest F1 scores of 0.99 for ham and 0.94 for spam

    Estudo comparativo de modelos de aprendizado de máquina para detecção de email spam

    Get PDF
    Trabalho de conclusão de curso (graduação)—Universidade de Brasília, Instituto de Ciências Exatas, Departamento de Ciência da Computação, 2020.Haja vista o alto tréfego de email spam e a sua natureza inconveniente ou até mesmo, em alguns casos, nociva, este trabalho objetiva examinar a evolução de modelos supervision- ados de aprendizado de máquina capazes de classificar emails entre legítimo ou spam ao variar a vetorização das características textuais, dividindo os experimentos em etapas de diferentes níveis de complexidade a fim de explorar a capacidade de aprendizado dos mod- elos, avaliando numerica e graficamente o desenvolvimento de cada um deles, buscando-se alcançar os maiores resultados na última etapa. A principal medida utilizada para validar a solução é F1-score, além das análises das curvas de aprendizado. Foram utilizados os algoritmos SVM, Naive Bayes e KNN, sendo que os modelos SVM apresentaram as mel- hores respostas quanto ao avanço da complexidade do treinamento, obtendo os maiores resultados em todos os datasets, já os outros dois algoritmos manifestaram maior sensi- bilidade e incerteza quanto às medidas tomadas em cada etapa. Possíveis incrementos a este trabalho incluem: expansão dos conjuntos de dados utilizados, especialmente para verificar o progresso de modelos SVM de kernel polinomial, implementação de novas fea- tures extraídas a partir dos textos erroneamente classificados e utilização de técnicas de regressão para melhor avaliação das curvas de aprendizado.Because of the high spam traffic and its undesirable or even, in some cases, harmful nature, this present work aims to inspect the progress of supervised machine learning algorithms capable of labeling emails as spam or legitimate by diversifying the text’s feature vectorization. This is done by the split of the experiments into phases of different complexity levels, in order to explore the learning ability of the algorithms, numerically and graphically evaluating their development, seeking for the best results in the last phase. The main evaluation method used is the F1-score, also the learning curves analysis. The algorithms used were SVM, Naive Bayes and KNN, and the SVM models presented the best responses as the training complexity increased, obtaining the highest results in all datasets, whereas the other two algorithms showed greater sensitivity and uncertainty regarding the actions taken at each stage. Possible enhancements to this research include: data sets expansion, especially to verify polynomial kernel SVM’s progress, development of new features extracted from misclassified emails and the use of regression techniques to better evaluate the learning curves
    corecore