13 research outputs found

    Implementation and Comparison of Deep Learning with Naïve Bayes for Language Processing

    Get PDF
    Text classification is one of the most important task in natural language processing, In this research, we carried out several experimental research on three (3) of the most popular Text classification NLP classifier in Convolutional Neural Network (CNN), Multinomial Naive Bayes (MNB), and Support Vector Machine (SVN). In the presence of enough training data, Deep Learning CNN work best in all parameters for evaluation with 77% accuracy, followed by SVM with accuracy of 76%, and multinomial Bayes with least performance of 69% accuracy. CNN has the best performance in the presence of large enough training dataset because of the presence of filter/ kernels which help to indentify patterns in text data regardless of their position in the sentence. We repeated the training again with just one-third of our data, at this point SVM comes with the best performance, the performance of CNN noticeably drops but still better than Multinomial Naive Bayes, the reason why SVM works best when we reduce the training data was because of its ability to look for a hyper-plane that creates a boundary between different classes of data so as to properly classify them, so we believed that getting the hyper-plane was more efficient when we reduce the dataset, hence reason for the good performance. Multinomial Naive Bayes have the least performance which we attributed to its assumption of independency between the features which sometimes does not hold true. We concluded that availability of data should be an important factor when choosing classifier for Natural Language Processing Text Classification task. CNN should be use in the presence of enough dataset, and SVM should be use when data is not enough. Multinomial Naive Bayes must not be trusted with state of the art NLP task due to its assumption of independency between the features

    Comparative Analysis of Deep Learning and Naïve Bayes for Language Processing Task

    Get PDF
    Text classification is one of the most important task in natural language processing, In this research, we carried out several experimental research on three (3) of the most popular Text classification NLP classifier in Convolutional Neural Network (CNN), Multinomial Naive Bayes (MNB), and Support Vector Machine (SVN). In the presence of enough training data, Deep Learning CNN work best in all parameters for evaluation with 77% accuracy, followed by SVM with accuracy of 76%, and multinomial Bayes with least performance of 69% accuracy. CNN has the best performance in the presence of large enough training dataset because of the presence of filter/ kernels which help to indentify patterns in text data regardless of their position in the sentence. We repeated the training again with just one-third of our data, at this point SVM comes with the best performance, the performance of CNN noticeably drops but still better than Multinomial Naive Bayes, the reason why SVM works best when we reduce the training data was because of its ability to look for a hyper-plane that creates a boundary between different classes of data so as to properly classify them, so we believed that getting the hyper-plane was more efficient when we reduce the dataset, hence reason for the good performance. Multinomial Naive Bayes have the least performance which we attributed to its assumption of independency between the features which sometimes does not hold true. We concluded that availability of data should be an important factor when choosing classifier for Natural Language Processing Text Classification task. CNN should be use in the presence of enough dataset, and SVM should be use when data is not enough. Multinomial Naive Bayes must not be trusted with state of the art NLP task due to its assumption of independency between the feature

    SURVEY OF E-MAIL CLASSIFICATION: REVIEW AND OPEN ISSUES

    Get PDF
    Email is an economical facet of communication, the importance of which is increasing in spite of access to other approaches, such as electronic messaging, social networks, and phone applications. The business arena depends largely on the use of email, which urges the proper management of emails due to disruptive factors such as spams, phishing emails, and multi-folder categorization. The present study aimed to review the studies regarding emails, which were published during 2016-2020, based on the problem description analysis in terms of datasets, applications areas, classification techniques, and feature sets. In addition, other areas involving email classifications were identified and comprehensively reviewed. The results indicated four email application areas, while the open issues and research directions of email classifications were implicated for further investigation

    Smart Substation Network Fault Classification Based on a Hybrid Optimization Algorithm

    Get PDF
    Accurate network fault diagnosis in smart substations is key to strengthening grid security. To solve fault classification problems and enhance classification accuracy, we propose a hybrid optimization algorithm consisting of three parts: anti-noise processing (ANP), an improved separation interval method (ISIM), and a genetic algorithm-particle swarm optimization (GA-PSO) method. ANP cleans out the outliers and noise in the dataset. ISIM uses a support vector machine (SVM) architecture to optimize SVM kernel parameters. Finally, we propose the GA-PSO algorithm, which combines the advantages of both genetic and particle swarm optimization algorithms to optimize the penalty parameter. The experimental results show that our proposed hybrid optimization algorithm enhances the classification accuracy of smart substation network faults and shows stronger performance compared with existing methods

    Smart Substation Network Fault Classification Based on a Hybrid Optimization Algorithm

    Get PDF
    Accurate network fault diagnosis in smart substations is key to strengthening grid security. To solve fault classification problems and enhance classification accuracy, we propose a hybrid optimization algorithm consisting of three parts: anti-noise processing (ANP), an improved separation interval method (ISIM), and a genetic algorithm-particle swarm optimization (GA-PSO) method. ANP cleans out the outliers and noise in the dataset. ISIM uses a support vector machine (SVM) architecture to optimize SVM kernel parameters. Finally, we propose the GA-PSO algorithm, which combines the advantages of both genetic and particle swarm optimization algorithms to optimize the penalty parameter. The experimental results show that our proposed hybrid optimization algorithm enhances the classification accuracy of smart substation network faults and shows stronger performance compared with existing methods

    APLICAÇÃO DE MACHINE LEARNING NA IDENTIFICAÇÃO DE E-MAILS COMO SPAM

    Get PDF
    O serviço de e-mail é uma das principais ferramentas utilizadas nos dias de hoje e é um exemplo de que a tecnologia facilita a troca de informações. Por outro lado, um dos maiores empecilhos enfrentados pelos serviços de e-mail corresponde ao spam, nome dado à mensagem não solicitada recebida por um usuário. A aplicação de aprendizado de máquina (machine learning) vem ganhando destaque nos últimos anos como alternativa para identificação eficiente de spam. Nessa área, diferentes algoritmos podem ser avaliados para identificar qual apresenta melhor desempenho. O objetivo deste estudo consiste em identificar a capacidade dos algoritmos de aprendizado de máquina em classificar corretamente os e-mails e identificar também qual algoritmo obteve maior acurácia. A base de dados utilizada foi retirada da plataforma Kaggle e os dados foram processados pelo software Orange com quatro algoritmos: Random Forest (RF), K-Nearest Neighbors (KNN), Support Vector Machine (SVM) e Naive Bayes (NB). A divisão dos dados em treino e teste considerou 80% dos dados para treinamento e 20% para teste. Os resultados evidenciam que o Random Forest foi o algoritmo com melhor desempenho com acurácia de 99%

    A review of spam email detection: analysis of spammer strategies and the dataset shift problem

    Get PDF
    .Spam emails have been traditionally seen as just annoying and unsolicited emails containing advertisements, but they increasingly include scams, malware or phishing. In order to ensure the security and integrity for the users, organisations and researchers aim to develop robust filters for spam email detection. Recently, most spam filters based on machine learning algorithms published in academic journals report very high performance, but users are still reporting a rising number of frauds and attacks via spam emails. Two main challenges can be found in this field: (a) it is a very dynamic environment prone to the dataset shift problem and (b) it suffers from the presence of an adversarial figure, i.e. the spammer. Unlike classical spam email reviews, this one is particularly focused on the problems that this constantly changing environment poses. Moreover, we analyse the different spammer strategies used for contaminating the emails, and we review the state-of-the-art techniques to develop filters based on machine learning. Finally, we empirically evaluate and present the consequences of ignoring the matter of dataset shift in this practical field. Experimental results show that this shift may lead to severe degradation in the estimated generalisation performance, with error rates reaching values up to 48.81%.SIPublicación en abierto financiada por el Consorcio de Bibliotecas Universitarias de Castilla y León (BUCLE), con cargo al Programa Operativo 2014ES16RFOP009 FEDER 2014-2020 DE CASTILLA Y LEÓN, Actuación:20007-CL - Apoyo Consorcio BUCL

    A new hybridized dimensionality reduction approach using genetic algorithm and folded linear discriminant analysis applied to hyperspectral imaging for effective rice seed classification

    Get PDF
    Hyperspectral imaging (HSI) has been reported to produce promising results in the classification of rice seeds. However, HSI data often require the use of dimensionality reduction techniques for the removal of redundant data. Folded linear discriminant analysis (F-LDA) is an extension of linear discriminant analysis (LDA, a commonly used technique for dimensionality reduction), and was recently proposed to address the limitations of LDA, particularly its poor performance when dealing with a small number of training samples which is a usual scenario in HSI applications. This article presents an improved version of F-LDA, exploring the feasibility of hybridizing a genetic algorithm (GA) and F-LDA for effective dimensionality reduction in HSI-based rice seeds classification. The proposed approach, inspired by the previous combination of GA with principle component analysis, is evaluated on rice seed datasets containing 256 spectral bands. Experimental results show that, in addition to attaining promising classification accuracies of up to 96.21%, this novel combination of GA and F-LDA (GA + F-LDA) can further reduce the computational complexity and memory requirement in the standalone F-LDA. It is worth noting that these benefits are not without a slight reduction in classification accuracy when evaluated against those reported for the standard F-LDA (up to 96.99%)

    A New Hybridized Dimensionality Reduction Approach Using Genetic Algorithm and Folded Linear Discriminant Analysis Applied to Hyperspectral Imaging for Effective Rice Seed Classification

    Get PDF
    Hyperspectral imaging (HSI) has been reported to produce promising results in the classification of rice seeds. However, HSI data often require the use of dimensionality reduction techniques for the removal of redundant data. Folded linear discriminant analysis (F-LDA) is an extension of linear discriminant analysis (LDA, a commonly used technique for dimensionality reduction), and was recently proposed to address the limitations of LDA, particularly its poor performance when dealing with a small number of training samples which is a usual scenario in HSI applications. This article presents an improved version of F-LDA, exploring the feasibility of hybridizing a genetic algorithm (GA) and F-LDA for effective dimensionality reduction in HSI-based rice seeds classification. The proposed approach, inspired by the previous combination of GA with principle component analysis, is evaluated on rice seed datasets containing 256 spectral bands. Experimental results show that, in addition to attaining promising classification accuracies of up to 96.21%, this novel combination of GA and F-LDA (GA + F-LDA) can further reduce the computational complexity and memory requirement in the standalone F-LDA. It is worth noting that these benefits are not without a slight reduction in classification accuracy when evaluated against those reported for the standard F-LDA (up to 96.99%)
    corecore