5 research outputs found

    DESIGNING A PYTHON BASED TEXT PRE-PROCESSING APPLICATION FOR TEXT CLASSIFICATION

    Get PDF
    The first step that is always passed by documents in natural language processing is pre-processing text. These steps are needed for transferring text from human language to machine-readable format for further processing. However, not many special applications have been found that function as text pre-processing. This has led to any research on natural language processing having to create its own program code for the pre-processing text phase. The main focus of this research is to create an integrated text pre-processing application that can be accessed by any researcher who needs it. Several issues discussed in this study include the design, implementation, testing and integration of each text pre-processing feature. Text preprocessing which is integrated in this research includes case folding, tokenizing, and feature selection. The tools used in this research are the NLTK library of python and Django framework. The design of the text pre-processing application can be made using the waterfall method. For the application stage, the utilization of the NLTK Library can be applied precisely and systematically. This library also facilitates the implementation phase because of the large number of NLP classes that can be directly applied

    Improved scheme of e-mail spam classification using meta-heuristics feature selection and support vector machine

    Get PDF
    With the technological revolution in the 21st century, time and distance of communication are decreased by using electronic mail (e-mail). Furthermore, the growing use of e-mail has led to the emergence and further growth problems caused by unsolicited bulk e-mails, commonly referred to as spam e-mail. Many of the existing supervised algorithms like the Support Vector Machine (SVM) were developed to stop the spam e-mail. However, the problem of dealing with large data and high dimensionality of feature space can lead to high execution-time and low accuracy of spam e-mail classification. Nowadays, removing the irrelevant and redundant features beside finding the optimal (or near-optimal) subset of features significantly influences the performance of spam e-mail classification; this has become one of the important challenges. Therefore, in order to optimize spam e-mail classification accuracy, dimensional reduction issues need to be solved. Feature selection schemes become very important in order to reduce the dimensionality through selecting a proper subset feature to facilitate the classification process. The objective of this study is to investigate and improve schemes to reduce the execution time and increase the accuracy of spam e-mail classification. The methodology of this study comprises of four schemes: one-way ANOVA f-test, Binary Differential Evolution (BDE), Opposition Differential Evolution (ODE) and Opposition Particle Swarm Optimization (OPSO), and combination of Differential Evolution (DE) and Particle Swarm Optimization (PSO). The four schemes were used to improve the spam e-mail classification accuracy. The classification accuracy of the proposed schemes were 95.05% with population size of 50 and 1000 number of iterations in 20 runs and 41 features. The experiment of the proposed schemes were carried out using spambase and spamassassin benchmark dataset to evaluate the feasibility of proposed schemes. The experimental findings demonstrate that the improved schemes were able to efficiently reduce the number of features as well as improving the e-mail classification accuracy
    corecore