209 research outputs found

    Improving the Prediction Accuracy of Text Data and Attribute Data Mining with Data Preprocessing

    Get PDF
    Data Mining is the extraction of valuable information from the patterns of data and turning it into useful knowledge. Data preprocessing is an important step in the data mining process. The quality of the data affects the result and accuracy of the data mining results. Hence, Data preprocessing becomes one of the critical steps in a data mining process. In the research of text mining, document classification is a growing field. Even though we have many existing classifying approaches, Naïve Bayes Classifier is good at classification because of its simplicity and effectiveness. The aim of this paper is to identify the impact of preprocessing the dataset on the performance of a Naïve Bayes Classifier. Naïve Bayes Classifier is suggested as the best method to identify the spam emails. The Impact of preprocessing phase on the performance of the Naïve Bayes classifier is analyzed by comparing the output of both the preprocessed dataset result and non-preprocessed dataset result. The test results show that combining Naïve Bayes classification with the proper data preprocessing can improve the prediction accuracy. In the research of Attributed data mining, a decision tree is an important classification technique. Decision trees have proved to be valuable tools for the classification, description, and generalization of data. J48 is a decision tree algorithm which is used to create classification model. J48 is an open source Java implementation of the C4.5 algorithm in the Weka data mining tool. In this paper, we present the method of improving accuracy for decision tree mining with data preprocessing. We applied the supervised filter discretization on J48 algorithm to construct a decision tree. We compared the results with the J48 without discretization. The results obtained from experiments show that accuracy of J48 after discretization is better than J48 before discretization

    Penanganan Fitur Kontinyu dengan Feature Discretization Berbasis Expectation Maximization Clustering untuk Klasifikasi Spam Email Menggunakan Algoritma ID3

    Full text link
    Pemanfaatan jaringan internet saat ini berkembang begitu pesatnya, salah satunya adalah pengiriman surat elektronik atau email. Akhir-akhir ini ramai diperbincangkan adanya spam email. Spam email adalah email yang tidak diminta dan tidak diinginkan dari orang asing yang dikirim dalam jumlah besar ke mailing list, biasanya beberapa dengan sifat komersial. Adanya spam ini mengurangi produktivitas karyawan karena harus meluangkan waktu untuk menghapus pesan spam. Untuk mengatasi permasalahan tersebut dibutuhkan sebuah filter email yang akan mendeteksi keberadaan spam sehingga tidak dimunculkan pada inbox mail. Banyak peneliti yang mencoba untuk membuat filter email dengan berbagai macam metode, tetapi belum ada yang menghasilkan akurasi maksimal. Pada penelitian ini akan dilakukan klasifikasi dengan menggunakan algoritma Decision Tree Iterative Dicotomizer 3 (ID3) karena ID3 merupakan algoritma yang paling banyak digunakan di pohon keputusan, terkenal dengan kecepatan tinggi dalam klasifikasi, kemampuan belajar yang kuat dan konstruksi mudah. Tetapi ID3 tidak dapat menangani fitur kontinyu sehingga proses klasifikasi tidak bisa dilakukan. Pada penelitian ini, feature discretization berbasis Expectation Maximization (EM) Clustering digunakan untuk merubah fitur kontinyu menjadi fitur diskrit, sehingga proses klasifikasi spam email bisa dilakukan. Hasil eksperimen menunjukkan ID3 dapat melakukan klasifikasi spam email dengan akurasi 91,96% jika menggunakan data training 90%. Terjadi peningkatan sebesar 28,05% dibandingkan dengan klasifikasi ID3 menggunakan binning

    Variable selection for Naive Bayes classification

    Get PDF
    The Naive Bayes has proven to be a tractable and efficient method for classification in multivariate analysis. However, features are usually correlated, a fact that violates the Naive Bayes' assumption of conditional independence, and may deteriorate the method's performance. Moreover, datasets are often characterized by a large number of features, which may complicate the interpretation of the results as well as slow down the method's execution. In this paper we propose a sparse version of the Naive Bayes classifier that is characterized by three properties. First, the sparsity is achieved taking into account the correlation structure of the covariates. Second, different performance measures can be used to guide the selection of features. Third, performance constraints on groups of higher interest can be included. Our proposal leads to a smart search, which yields competitive running times, whereas the flexibility in terms of performance measure for classification is integrated. Our findings show that, when compared against well-referenced feature selection approaches, the proposed sparse Naive Bayes obtains competitive results regarding accuracy, sparsity and running times for balanced datasets. In the case of datasets with unbalanced (or with different importance) classes, a better compromise between classification rates for the different classes is achieved.This research is partially supported by research grants and projects MTM2015-65915-R (Ministerio de Economia y Competitividad, Spain) and PID2019-110886RB-I00 (Ministerio de Ciencia, Innovacion y Universidades, Spain) , FQM-329 and P18-FR-2369 (Junta de Andalucia, Spain) , PR2019-029 (Universidad de Cadiz, Spain) , Fundacion BBVA and EC H2020 MSCA RISE NeEDS Project (Grant agreement ID: 822214) . This support is gratefully acknowledged. Documen

    Multi-Model Network Intrusion Detection System Using Distributed Feature Extraction and Supervised Learning

    Get PDF
    Intrusion Detection Systems (IDSs) monitor network traffic and system activities to identify any unauthorized or malicious behaviors. These systems usually leverage the principles of data science and machine learning to detect any deviations from normalcy by learning from the data associated with normal and abnormal patterns. The IDSs continue to suffer from issues like distributed high-dimensional data, inadequate robustness, slow detection, and high false-positive rates (FPRs). We investigate these challenges, determine suitable strategies, and propose relevant solutions based on the appropriate mathematical and computational concepts. To handle high-dimensional data in a distributed network, we optimize the feature space in a distributed manner using the PCA-based feature extraction method. The experimental results display that the classifiers built upon the features so extracted perform well by giving a similar level of accuracy as given by the ones that use the centrally extracted features. This method also significantly reduces the cumulative time needed for extraction. By utilizing the extracted features, we construct a distributed probabilistic classifier based on Naïve Bayes. Each node counts the local frequencies and passes those on to the central coordinator. The central coordinator accumulates the local frequencies and computes the global frequencies, which are used by the nodes to compute the required prior probabilities to perform classifications. Each node, being evenly trained, is capable of detecting intrusions individually to improve the overall robustness of the system. We also propose a similarity measure-based classification (SMC) technique that works by computing the cosine similarities between the class-specific frequential weights of the values in an observed instance and the average frequency-based data centroid. An instance is classified into the class whose weights for the values in it share the highest level of similarity with the centroid. SMC contributes alongside Naïve Bayes in a multi-model classification approach, which we introduce to reduce the FPR and improve the detection accuracy. This approach utilizes the similarities associated with each class label determined by SMC and the probabilities associated with each class label determined by Naïve Bayes. The similarities and probabilities are aggregated, separately, to form new features that are used to train and validate a tertiary classifier. We demonstrate that such a multi-model approach can attain a higher level of accuracy compared with the single-model classification techniques. The contributions made by this dissertation to enhance the scalability, robustness, and accuracy can help improve the efficacy of IDSs

    Multi-heuristic theory assessment with iterative selection

    Get PDF
    Modern day machine learning is not without its shortcomings. To start with, the heuristic accuracy, which is the standard assessment criteria for machine learning, is not always the best heuristic to gauge the performance of machine learners. Also machine learners many times produce theories that are unintelligible by people and must be assessed as automated classifiers through machines. Theses theories are either too large or not properly formatted for human interpretation. Furthermore, our studies have identified that most of the data sets we have encountered are satiated with worthless data that actually leads to the degradation of the accuracy of machine learners. Therefore, simpler learning is more optimal. This necessitates a simpler classifier that is not confused with highly correlated data. Lastly, existing machine learners are not sensitive to domains. That is, they are not tunable to search for theories that are most beneficial to specific domains
    corecore