    Sistem Temu Kembali Informasi dengan Pemeringkatan Metode Vector Space Model

    The objective of designing information retrieval system (IRS) with Vector Space Model (VSM) Method is to facilitate users to search Indonesian documents. IRS Software is designed to provide search results with the optimum number of documents (low recall) and accuracy (high precision) with VSM method that users may get fast and accurate results. VSM method provides a different credit for each document stored in a database which in turns to determine the document most similar to the query, where the documents with the highest credits are placed on the top of the search results. The evaluation of search results with IRS is conducted under recall and precision tests. This study fascinatingly creates a system which can preprocess (tokenizing, filtering, and stemming) within computation time of four minutes forty-one seconds

    Rancang Bangun Information Retrieval System (IRS) Bahasa Jawa Ngoko pada Palintangan Penjebar Semangad dengan Metode Vector Space Model (VSM)

    Bahasa Jawa adalah bahasa daerah yang paling banyak digunakan di Indonesia yang mulai ditinggalkan. Perlunya pelestarian bahasa jawa dalam bentuk online yang bisa diakses bagi penggunanya sehingga akanmemudahkan dalam pencarian dokumen teks khususnya dokumen bahasa jawa ngoko. Software IRS dirancang untuk memberikan hasil pencarian dokumen dalam jumlah yang optimal (recall rendah) dan akurat (precision tinggi) menggunakan metode VSM, sehingga user akan mendapatkan hasil pencarian cepat dan akurat. Metode VSM akan melakukan pembobotan tiap dokumen yang ada pada database sehingga antar dokumen memiliki bobot yang berbeda untuk menentukan dokumen mana yang paling mirip (similar) dengan query, dokumen dengan bobot tertinggi menempati ranking teratas dalam hasil pencarian. Evaluasi hasil pencarian IRS dilakukan dengan uji recall dan precision. Studi kasus yang telah dilakukan menggunakan IRS ini didapatkan hasil sistem mampu melakukan proses preprosesing (tokenisasi, filtering, dan stemming) dengan waktu komputasi 18 detik. Sistem mampu melakukan pencarian dokumen dan menampilkan hasil pencarian dokumen dalam waktu komputasi rata-rata 2 detik, memiliki rata-rata recall 0,04 dan rata-rata precision 0,84. Sistem dilengkapi dengan bobot tiap dokumen dan letakknya yang akan memudahkan user dalam pencarian dokumen teks bahasa Indonesia

    Optimal Feature Subset Selection Based on Combining Document Frequency and Term Frequency for Text Classification

    Feature selection plays a vital role to reduce the high dimension of the feature space in the text document classification problem. The dimension reduction of feature space reduces the computation cost and improves the text classification system accuracy. Hence, the identification of a proper subset of the significant features of the text corpus is needed to classify the data in less computational time with higher accuracy. In this proposed research, a novel feature selection method which combines the document frequency and the term frequency (FS-DFTF) is used to measure the significance of a term. The optimal feature subset which is selected by our proposed work is evaluated using Naive Bayes and Support Vector Machine classifier with various popular benchmark text corpus datasets. The experimental outcome confirms that the proposed method has a better classification accuracy when compared with other feature selection techniques

    Αναγνώριση ανεπιθύμητης αλληλογραφίας με χρήση νευρωνικών δικτύων

    Στην εποχή που τα Ανεπιθύμητα (Spam) Μηνύματα κατακλύζουν κάθε διαθέσιμο “γραμματοκιβώτιο”, η ανάγκη για την αυτοματοποιημένη αναγνώριση και αντιμετώπισή τους φαίνεται επιτακτική. Στην εργασία αυτή αρχικά μελετήθηκαν οι τεχνικές Αναγνώρισης Ανεπιθύμητης Αλληλογραφίας που χρησιμοποιούνται τα τελευταία χρόνια. Συγκεκριμένα, έμφαση δόθηκε σε προσεγγίσεις που χρησιμοποιούν αλγορίθμους Μηχανικής Μάθησης. Στη συνέχεια, αναπτύχθηκε ένα φίλτρο Ανεπιθύμητης Αλληλογραφίας χρησιμοποιώντας Νευρωνικά Δίκτυα και συγκεκριμένα τον αλγόριθμο Πολυεπίπεδου Δικτύου Αισθητήρων (Multilayer Perceptron / MLP). Έγινε χρήση της συλλογής αλγορίθμων Μηχανικής Μά- θησης και Εξόρυξης Δεδομένων Weka του Πανεπιστημίου του Waikato. Μελετώνται οι παράμετροι του Νευρωνικού Δικτύου και αποφασίζονται οι καταλληλότερες τιμές προκειμένου να μεγιστοποιηθεί η απόδοση της κατηγοριοποίησης του φίλτρου. Τόσο οι τεχνικές που χρησιμοποιήθηκαν γι’ αυτό όσο και τα πειραματικά αποτελέσματα παρουσιάζονται στην εργασία.In the period that Spam emails deluge every mailbox available, the need of automated recognition and filtering seems mandatory. During this project, the techniques of Email Spam Filtering used in the latest years were studied. Specifically, more attention was given to the approaches that use Machine Learning Algorithms. Subsequently, a Spam Filter was developed using Neural Networks and more specifically the Multilayer Perceptron / MLP algorithm. The collection of Machine Learing and Data Mining algorithms Weka developed in University of Waikato was used. The Neural Network’s parameters were studied and the optimal values are decided in order to maximize classification accuracy of the Spam Filter. Both the techniques that were used and the experimental results are reported in this project