Search CORE

7 research outputs found

Klasifikasi Dokumen Menggunakan Kombinasi Algoritma Principal Component Analysis dan SVM

Author: MICHAEL FREDDY HAMONANGAN SIANTURI
Publication venue: Universitas Telkom
Publication date: 03/11/2017
Field of study

ABSTRAK Klasifikasi dokumen teks adalah masalah yang sederhana namun sangat penting karena manfaatnya cukup besar mengingat jumlah dokumen yang ada setiap hari semakin bertambah. Namun, kebanyakan teknik klasifikasi dokumen yang ada memerlukan labeled documents dalam jumlah besar untuk melakukan tahap training dan testing. Dalam melakukan klasifikasi dokumen, pada tugas akhir ini digunakan algoritma Principal Component Analysis yang dikombinasikan dengan Support Vector Machines untuk supervised document. Principal Component Analysis merupakan suatu teknik yang dapat digunakan untuk mengekstrasi struktur dari suatu data yang berdimensi tinggi tanpa menghilangkan informasi yang signifikan pada keseluruhan data. Kemudian dibutuhkan sebuah algoritma yang dapat menghasilkan prediksi dan akurasi dari dokumen tersebut yaitu Support Vector Machines (SVM). SVM adalah metode learning machine yang bekerja atas prinsip Structural Risk Minimization (SRM) dengan tujuan menemukan hyperplane terbaik yang memisahkan dua buah class pada input space. Hyperplane pemisah terbaik antara kedua kelas dapat ditemukan dengan mengukur margin hyperplane tersebut dan mencari titik maksimalnya. Hasil dari pengujian sistem menggunakan data yang direduksi oleh Principal Component Analysis (PCA) memiliki akurasi yang sedikit lebih rendah untuk dataset tertentu dibandingkan tanpa menggunakan PCA. Data yang digunakan adalah data R8 of Reuters-21578 Text Categorization Collection Data Set. Akurasi terbaik pada penelitian ini dihasilkan dari metode SVM dengan akurasi rata-rata 98.95%, sedangkan untuk metode SVM + PCA akurasi yang diperoleh rata-rata 96.7866%. Kata kunci : Klasifikasi Dokumen, Principal Component Analysis, Support Vector Machin

Open Library

A Comprehensive Filter Feature Selection for Improving Document Classification

Author: Ho Bao Quoc
Le Nguyen Hoai Nam
Publication venue
Publication date: 01/01/2015
Field of study

Waseda University Repository

A Comparative Analysis of Machine Learning Models for Banking News Extraction by Multiclass Classification With Imbalanced Datasets of Financial News: Challenges and Solutions

Author: Dogra Varun
Ghosh Uttam
Jhanjhi NZ
Le Dac-Nhuong
Verma Kavita
Verma Sahil
Publication venue: 'Universidad Internacional de La Rioja'
Publication date: 20/05/2022
Field of study

Online portals provide an enormous amount of news articles every day. Over the years, numerous studies have concluded that news events have a significant impact on forecasting and interpreting the movement of stock prices. The creation of a framework for storing news-articles and collecting information for specific domains is an important and untested problem for the Indian stock market. When online news portals produce financial news articles about many subjects simultaneously, finding news articles that are important to the specific domain is nontrivial. A critical component of the aforementioned system should, therefore, include one module for extracting and storing news articles, and another module for classifying these text documents into a specific domain(s). In the current study, we have performed extensive experiments to classify the financial news articles into the predefined four classes Banking, Non-Banking, Governmental, and Global. The idea of multi-class classification was to extract the Banking news and its most correlated news articles from the pool of financial news articles scraped from various web news portals. The news articles divided into the mentioned classes were imbalanced. Imbalance data is a big difficulty with most classifier learning algorithms. However, as recent works suggest, class imbalances are not in themselves a problem, and degradation in performance is often correlated with certain variables relevant to data distribution, such as the existence in noisy and ambiguous instances in the adjacent class boundaries. A variety of solutions to addressing data imbalances have been proposed recently, over-sampling, down-sampling, and ensemble approach. We have presented the various challenges that occur with data imbalances in multiclass classification and solutions in dealing with these challenges. The paper has also shown a comparison of the performances of various machine learning models with imbalanced data and data balances using sampling and ensemble techniques. From the result, it’s clear that the performance of Random Forest classifier with data balances using the over-sampling technique SMOTE is best in terms of precision, recall, F-1, and accuracy. From the ensemble classifiers, the Balanced Bagging classifier has shown similar results as of the Random Forest classifier with SMOTE. Random forest classifier's accuracy, however, was 100% and it was 99% with the Balanced Bagging classifier

Re-UNIR

PCA document reconstruction for email classification

Author: Gomez Juan Carlos
Moens Marie-Francine
Publication venue: 'Elsevier BV'
Publication date: 01/01/2012
Field of study

This paper presents a document classifier based on text content features and its application to email classification. We test the validity of a classifier which uses Principal Component Analysis Document Reconstruction (PCADR), where the idea is that principal component analysis (PCA) can compress optimally only the kind of documents - in our experiments email classes - that are used to compute the principal components (PCs), and that for other kinds of documents the compression will not perform well using only a few components. Thus, the classifier computes separately the PCA for each document class, and when a new instance arrives to be classified, this new example is projected in each set of computed PCs corresponding to each class, and then is reconstructed using the same PCs. The reconstruction error is computed and the classifier assigns the instance to the class with the smallest error or divergence from the class representation. We test this approach in email filtering by distinguishing between two message classes (e.g. spam from ham, or phishing from ham). The experiments show that PCADR is able to obtain very good results with the different validation datasets employed, reaching a better performance than the popular Support Vector Machine classifier.The publisher is Elsevier, not North-Hollandstatus: publishe

Lirias

PCA document reconstruction for email classification

Author: Abu-Nimeh
Anderson
Androutsopoulos
Barman
Berry
Blei
Bratko
Brutlag
Bíró
Carreras
Deerwester
Drucker
Fawcett
Fette
Fisher
Gansterer
Gansterer
Gee
Gomez
Goodman
Guzella
Hoffmann
Hofmann
Hotelling
Janecek
Jolliffe
Juan Carlos Gomez
Kanaris
Kim
Malagón-Borja
Mann
Marie-Francine Moens
Moler
Morita
Pearson
Platt
Robinson
Sculley
Silva
Torkkola
Vidal
Witten
Xia
Yu
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref