48 research outputs found

    SMS Spam Detection in a Real-World Platform using Machine Learning

    Get PDF
    Spam detection techniques have made our lives easier by unclogging our inboxes and keeping unsafe messages from being opened. With the automation of text messaging solutions and the increase in telecommunication companies and message providers, the volume of text messages has been on the rise. With this growth came along malicious traffic which users had little control over. In this thesis, we present an implementation of a spam detection system in a real-world text messaging platform. Using well-established machine learning algorithms, we make an in-depth analysis on the performance of the models using two different datasets: one publicly available (N=5,574) and the other gathered from actual traffic of the platform (N=1,477). Making use of the empirical results, we outline the models and hyperparameters which can be used in the platform and in which scenarios they produce optimal performance. The results indicate that our dataset poses a great challenge at accurate classification, most likely due to the small sample size and unbalanced dataset, along with nuances in the dataset. Nevertheless, there were models that were found to have a good all-around performance and they can be trained and used in the platform

    Artificial Intelligence Approaches for Filtering of Spams

    Get PDF
    Diplomová práce se zaměřuje na klasifikaci elektronické pošty. Popisuje základní způsoby filtrování nevyžádané pošty. Následně se zabývá bayesovskými klasifikátory spamu a umělými imunitními systémy. Popisuje existující aplikace a metriky vyhodnocování výsledků. Cílem práce je navrhnout a implementovat algoritmus na filtrování spamu. Nakonec porovnává získané výsledky s vybranými známými metodami.This thesis focuses on the e-mail classification and describes the basic ways of spam filtering. The Bayesian spam classifiers and artificial immune systems are analyzed and applied in this thesis. Furthermore, existing applications and evaluation metrics are described. The aim of this thesis is to design and implement an algorithm for spam filtering. Ultimately, the results are compared with selected known methods.

    Sentiment analysis in context: Investigating the use of BERT and other techniques for ChatBot improvement

    Get PDF
    openIn an increasingly digitized world, where large amounts of data are generated daily, its efficient analysis has become more and more stringent. Natural Language Processing (NLP) offers a solution by exploiting the power of artificial intelligence to process texts, to understand their content and to perform specific tasks. The thesis is based on an internship at Pat Srl, a company devoted to create solutions to support digital innovation, process automation, and service quality with the ultimate goal of improving leadership and customer satisfaction. The primary objective of this thesis is to develop a sentiment analysis model in order to improve the customer experience for clients using the ChatBot system created by the company itself. This task has gained significant attention in recent years as it can be applied to different fields, including social media monitoring, market research, brand monitoring or customer experience and feedback analysis. Following a careful analysis of the available data, a comprehensive evaluation of various models was conducted. Notably, BERT, a large language model that has provided promising results in several NLP tasks, emerged among all. Different approaches utilizing the BERT models were explored, such as the fine-tuning modality or the architectural structure. Moreover, some preprocessing steps of the data were emphasized and studied, due to the particular nature of the sentiment analysis task. During the course of the internship, the dataset underwent revisions aimed to mitigate the problem of inaccurate predictions. Additionally, techniques for data balancing were tested and evaluated, enhancing the overall quality of the analysis. Another important aspect of this project involved the deployment of the model. In a business environment, it is essential to carefully consider and balance resources before transitioning to production. The model distribution was carried out using specific tools, such as Docker and Kubernetes. These specialized technologies played a pivotal role in ensuring efficient and seamless deployment.In an increasingly digitized world, where large amounts of data are generated daily, its efficient analysis has become more and more stringent. Natural Language Processing (NLP) offers a solution by exploiting the power of artificial intelligence to process texts, to understand their content and to perform specific tasks. The thesis is based on an internship at Pat Srl, a company devoted to create solutions to support digital innovation, process automation, and service quality with the ultimate goal of improving leadership and customer satisfaction. The primary objective of this thesis is to develop a sentiment analysis model in order to improve the customer experience for clients using the ChatBot system created by the company itself. This task has gained significant attention in recent years as it can be applied to different fields, including social media monitoring, market research, brand monitoring or customer experience and feedback analysis. Following a careful analysis of the available data, a comprehensive evaluation of various models was conducted. Notably, BERT, a large language model that has provided promising results in several NLP tasks, emerged among all. Different approaches utilizing the BERT models were explored, such as the fine-tuning modality or the architectural structure. Moreover, some preprocessing steps of the data were emphasized and studied, due to the particular nature of the sentiment analysis task. During the course of the internship, the dataset underwent revisions aimed to mitigate the problem of inaccurate predictions. Additionally, techniques for data balancing were tested and evaluated, enhancing the overall quality of the analysis. Another important aspect of this project involved the deployment of the model. In a business environment, it is essential to carefully consider and balance resources before transitioning to production. The model distribution was carried out using specific tools, such as Docker and Kubernetes. These specialized technologies played a pivotal role in ensuring efficient and seamless deployment

    Applied Machine Learning for Cybersecurity in Spam Filtering and Malware Detection

    Get PDF
    Machine learning is one of the fastest-growing fields and its application to cybersecurity is increasing. In order to protect people from malicious attacks, several machine learning algorithms have been used to predict the malicious attacks. This research emphasizes two vulnerable areas of cybersecurity that could be easily exploited. First, we show that spam filtering is a well known problem that has been addressed by many authors, yet it still has vulnerabilities. Second, with the increase of malware threats in our world, a lot of companies use AutoAI to help protect their systems. Nonetheless, AutoAI is not perfect, and data scientists can still design better models. In this thesis I show that although there are efficient mechanisms to prevent malicious attacks, there are still vulnerabilities that could be easily exploited. In the visual spoofing experiment, we show that using a classifier trained on data using Latin alphabet, to classify a message with a combination of Latin and Cyrillic letters leads to much lower classification accuracy. In Malware prediction experiment, our model has been able to predict malware attacks on Microsoft computers and got higher accuracy than any well known Auto AI

    A Context-Dependent Supervised Learning Approach to Sentiment Detection in Large Textual Databases

    Get PDF
    Sentiment detection automatically identifies emotions in textual data. The increasing amount of emotive documents available in corporate databases and on the World Wide Web calls for automated methods to process this important source of knowledge. Sentiment detection draws attention from researchers and practitioners alike - to enrich business intelligence applications, for example, or to asure the impact of customer reviews on purchasing decisions. Most sentiment detection approaches do not consider language ambiguity, despite the fact that one and the same sentiment term might differ in polarity depending on the context, in which a statement is made. To address this shortcoming, this paper introduces a novel method that uses Naïve Bayes to identify ambiguous terms. A contextualized sentiment lexicon stores the polarity of these terms, together with a set of co-occurring context terms. A formal evaluation of the assigned polarities confirms that considering the usage context of ambiguous terms improves the accuracy of high-throughput sentiment detection methods. Such methods are a prerequisite for using sentiment as a metadata element in storage and distributed file-level intelligence applications, as well as in enterprise portals that provide a semantic repository of an organization's information assets

    Performance of Gaussian Naïve Bayes for classification with dependencies from Archemedian copula

    Get PDF
    Master's Project (M.S.) University of Alaska Fairbanks, 2022Naive Bayes is an application of Bayes theorem in which the likelihood function is factored into marginals by making the assumption that the variables are independent. Naive Bayes is typically used for classification problems in which the goal is to find the class with the largest probability given the data on hand. When the data on hand are continuous real numbers we can further assume they are class conditionally normally distributed, which is a particular version of Naive Bayes called Gaussian Naive Bayes. This paper explores when Gaussian Naive Bayes classification problems work well vs when they do not. Typically when assumptions are not valid, valid conclusions cannot be drawn. However, Naive Bayes is known to be robust even when the independence assumption is not met. We show using simulations that binary classification accuracy of Naive Bayes is much more sensitive to differences in the class conditional marginal distributions than the correlation between predictors. Additionally we show that Naive Bayes completely fails when predictors are generated using a Gumbel copula and compare results with a general Bayes classifier and the K-Nearest Neighbors classifier

    Machine Learning Algorithms for Smart Data Analysis in Internet of Things Environment: Taxonomies and Research Trends

    Get PDF
    Machine learning techniques will contribution towards making Internet of Things (IoT) symmetric applications among the most significant sources of new data in the future. In this context, network systems are endowed with the capacity to access varieties of experimental symmetric data across a plethora of network devices, study the data information, obtain knowledge, and make informed decisions based on the dataset at its disposal. This study is limited to supervised and unsupervised machine learning (ML) techniques, regarded as the bedrock of the IoT smart data analysis. This study includes reviews and discussions of substantial issues related to supervised and unsupervised machine learning techniques, highlighting the advantages and limitations of each algorithm, and discusses the research trends and recommendations for further study

    Hate Speech Detection for Banjarese Languages on Instagram Using Machine Learning Methods

    Get PDF
    Hate speech refers to verbal expression or communication that aims to provoke or discriminate against individuals. The Ministry of Communication and Information of Indonesia has encountered and dealt with 3,640 cases of hate speech transmitted through digital channels between 2018 and 2021. Particularly in South Kalimantan, hate speech in the local language, Banjarese has become increasingly prevalent in recent years. Surprisingly, there is a lack of research on using machine learning to detect hate speech in the Banjarese language, specifically on Instagram. Therefore, this study aimed to address this gap by constructing a dataset of Banjarese language hate speech and comparing various feature extraction and machine learning models to detect Banjarese language hate speech effectively. Thisresearch used several feature extraction techniques and machine learning methods to detect Banjareselanguage hate speech. The feature extraction methods used were Word N-Gram, Term Frequency- Inverse Document Frequency (TF-IDF), a combination of Word N-Gram and TF-IDF, Word2Vec, and Glove, while the machine learning methods used were Support Vector Machine (SVM), Na¨ıve Bayes, and Decision Tree. The results of this study revealed that the combination of TF-IDF for feature extraction and SVM as the model achieves exceptional performance. The average Recall, Precision, Accuracy, and F1-Score score exceeded 90%, demonstrating the model’s ability to identify Banjarese hate speech accurately
    corecore