115,219 research outputs found

    KLASIFIKASI DOKUMEN BERITA KECELAKAAN TRANSPORTASI BERBAHASA INDONESIA MENGGUNAKAN METODE SUPPORT VECTOR MACHINE

    Get PDF
    ABSTRAKMenurut data Badan Inteligen Negara (BIN), kecelakaan transportasi di Indonesia merupakan pembunuh terbesar ketiga setelah penyakit jantung koroner dan Tuberculosis. Tingginya tingkat kecelakaan tersebut membuat informasi terkait topik kecelakaan transportasi sering ditemukan pada portal berita online berbahasa Indonesia. Saat ini, penentuan kategori berita yang dipublikasi masih dilakukan secara manual, sehingga diperlukan sistem yang dapat mengklasifikasikan berita secara otomatis. Pada penelitian ini, metode klasifikasi dalam teks mining akan diterapkan untuk mengolah dan menganalisis data web bertopik kecelakaan transportasi. Metode klasifikasi yang digunakan adalah Support Vector Machine (SVM) dengan pendekatan one-against-all, one-against-one dan Directed Acyclic Graph Support Vector Machine (DAGSVM). Ketiga pendekatan tersebut akan dibandingkan dalam proses klasifikasi, pendekatan terbaik akan ditentukan berdasarkan nilai f-measure yang dihasilkan mendekati 1. Klasifikasi dilakukan dengan mengkategorikan data web menjadi empat kategori yaitu kecelakaan darat, kecelakaan laut, kecelakaan udara dan lainnya. Tahapan penelitian ini terdiri dari pengumpulan data, pembersihan data, pembuatan kamus n-gram, pembangkitan fitur, pembangunan model dan klasifikasi. Data pembelajaran yang digunakan berjumlah 10.948 halaman web dan data pengujian berjumlah 550 halaman web yang terdiri dari 500 data berlabel dan 50 data tidak berlabel. Terdapat tiga pola fitur yang masing-masing dibangkitkan berdasarkan dua jenis kamus yaitu kamus tanpa stopwords dan kamus menggunakan stopwords sehingga dihasilkan 6 dataset yang berbeda. Hasil klasifikasi pada pengujian model menunjukkan bahwa pendekatan DAGSVM merupakan pendekatan terbaik dibandingkan dua pendekatan lainnya dengan nilai f-measure tertinggi yaitu 0,958. Hal yang sama juga diperlihatkan pada pengujian data berlabel dimana pendekatan DAGSVM memiliki f-measure terbaik sehingga pendekatan ini selanjutnya diterapkan pada klasifikasi data tidak berlabel untuk menentukan kategori berita tersebut.Kata kunci: klasifikasi web, SVM multiclass, one-against-all, one-against-one, DAGSVMABSTRACTAccording to Indonesian State Intelligence Agency data, the transportation accident in Indonesia is the third most killer after coronary heart and Tuberculosis. The high rates of transportation accidents make information related to these topics often found in Indonesian online news portal. Currently, the determination of news categories published is still done manually, so that it needs a system that could classify news automatically. In this research, the classification method in text mining is used to process and analyze web data by the topic of transportation accident categories. Support Vector Machine (SVM) with one-against-all, one-against-one and Directed Acyclic Graph Support Vector Machine (DAGSVM) approaches are used as classification methods. These approaches would be compared in the classification process. The approach with the highest accuracy value (closer to 1) will be the best approach. The classification is done by categorize web data into four categories, namely crash land, sea accidents, air crash and others. The steps of this research consist of data collecting, data cleaning, dictionaries n-gram building, features generating, models developing, and classifying. The data training which are used are 10,948 web pages, whereas the data testing are 550 web pages, consist of 500 labeled data and 50 unlabeled data. There are three features pattern that each of them is generated based on two kinds of dictionaries (with and without stopwords) then create six different features. The classification results on model testing show that DAGSVM is the best SVM approach compared to the other two approaches with the f-measure value is 0.958. DAGSVM also is the best approach on the testing of labeled data. So this approach applied to the classification of unlabeled data to determination of news categories.Keywords: web classification, SVM multiclass, one-against-all, one-against-one, DAGSV

    Named entity recognition using a new fuzzy support vector machine.

    Get PDF
    Recognizing and extracting exact name entities, like Persons, Locations, Organizations, Dates and Times are very useful to mining information from electronics resources and text. Learning to extract these types of data is called Named Entity Recognition(NER) task. Proper named entity recognition and extraction is important to solve most problems in hot research area such as Question Answering and Summarization Systems, Information Retrieval and Information Extraction, Machine Translation, Video Annotation, Semantic Web Search and Bioinformatics, especially Gene identification, proteins and DNAs names. Nowadays more researchers use three type of approaches namely, Rule-base NER, Machine Learning-base NER and Hybrid NER to identify names. Machine learning method is more famous and applicable than others, because it’s more portable and domain independent. Some of the Machine learning algorithms used in NER methods are, support vector machine(SVM), Hidden Markov Model, Maximum Entropy Model (MEM) and Decision Tree. In this paper, we review these methods and compare them based on precision in recognition and also portability using the Message Understanding Conference(MUC) named entity definition and its standard data set to find their strength and weakness of each these methods. We have improved the precision in NER from text using the new proposed method that calls FSVM for NER. In our method we have employed Support Vector Machine as one of the best machine learning algorithm for classification and we contribute a new fuzzy membership function thus removing the Support Vector Machine’s weakness points in NER precision and multi classification. The design of our method is a kind of One-Against-All multi classification technique to solve the traditional binary classifier in SVM

    Stemming text-based web page classification using machine learning algorithms: a comparison

    Get PDF
    The research aim is to determine the effect of word-stemming in web pages classification using different machine learning classifiers, namely Naive Bayes (NB), k-Nearest Neighbour (k-NN), Support Vector Machine (SVM) and Multilayer Perceptron (MP). Each classifiers' performance is evaluated in term of accuracy and processing time. This research uses BBC dataset that has five predefined categories. The result demonstrates that classifiers' performance is better without word stemming, whereby all classifiers show higher classification accuracy, with the highest accuracy produced by NB and SVM at 97% for F1 score, while NB takes shorter training time than SVM. With word stemming, the effect on training and classification time is negligible, except on Multilayer Perceptron in which word stemming has effectively reduced the training time

    Semantic Learning and Web Image Mining with Image Recognition and Classification

    Get PDF
    Image mining is more than just an extension of data mining to image domain. Web Image mining is a technique commonly used to extract knowledge directly from images on WWW. Since main targets of conventional Web mining are numerical and textual data, Web mining for image data is on demand. There are huge image data as well as text data on the Web. However, mining image data from the Web is paid less attention than mining text data, since treating semantics of images are much more difficult. This paper proposes a novel image recognition and image classification technique using a large number of images automatically gathered from the Web as learning images. For classification the system uses imagefeature- based search exploited in content-based image retrieval(CBIR), which do not restrict target images unlike conventional image recognition methods and support vector machine(SVM), which is one of the most efficient & widely used statistical method for generic image classification that fit to the learning tasks. By the experiments it is observed that the proposed system outperforms some existing search system

    Analisis Data Bank Direct Marketing dengan Perbandingan Klasifikasi Data Mining Berbasis Optimize Selection (Evolutionary)

    Get PDF
    In determining marketing strategies, the bank performs a classification from a customer database, the database will be analyzed by a decision maker and this is not easy for a decision maker, because of the complexity of the vast data and the many attributes of the data owned, so that it becomes an obstacle and obstacle. in decision making. This of course can have a negative effect on the company's business processes because there will be delays in determining marketing strategies. Data mining method is a method that can classify large data to determine the level of accuracy of a database. In overcoming these problems, it is necessary to do a database analysis to determine the accuracy level of the database classification owned by the company. For this reason, in this study a classification process will be carried out with the Bank Direct Marketing dataset taken from the UCI Machine Learning Repository web, using the Naïve Bayes algorithm, K-Nearest Neighbor, Support Vector Machine with Optimize Selection (Evolutionary) optimization, the calculation process using a data mining application. namely Rapidminer 5.3, to find the highest accuracy value from the calculation algorithm. Test method with 10-fold cross validation. In this study, the classification results with the highest level of accuracy were obtained using Optimize Selection (Evolutionary) optimization, namely the Naïve Bayes algorithm 90.18%, then K-Nearest Neighbor 86.66%, and Support Vector Machine 89.40%.

    Aspect term extraction for sentiment analysis in large movie reviews using Gini Index feature selection method and SVM classifier

    Get PDF
    With the rapid development of the World Wide Web, electronic word-of-mouth interaction has made consumers active participants. Nowadays, a large number of reviews posted by the consumers on the Web provide valuable information to other consumers. Such information is highly essential for decision making and hence popular among the internet users. This information is very valuable not only for prospective consumers to make decisions but also for businesses in predicting the success and sustainability. In this paper, a Gini Index based feature selection method with Support Vector Machine (SVM) classifier is proposed for sentiment classification for large movie review data set. The results show that our Gini Index method has better classification performance in terms of reduced error rate and accuracy

    Automatic document classification of biological literature

    Get PDF
    Background: Document classification is a wide-spread problem with many applications, from organizing search engine snippets to spam filtering. We previously described Textpresso, a text-mining system for biological literature, which marks up full text according to a shallow ontology that includes terms of biological interest. This project investigates document classification in the context of biological literature, making use of the Textpresso markup of a corpus of Caenorhabditis elegans literature. Results: We present a two-step text categorization algorithm to classify a corpus of C. elegans papers. Our classification method first uses a support vector machine-trained classifier, followed by a novel, phrase-based clustering algorithm. This clustering step autonomously creates cluster labels that are descriptive and understandable by humans. This clustering engine performed better on a standard test-set (Reuters 21578) compared to previously published results (F-value of 0.55 vs. 0.49), while producing cluster descriptions that appear more useful. A web interface allows researchers to quickly navigate through the hierarchy and look for documents that belong to a specific concept. Conclusions: We have demonstrated a simple method to classify biological documents that embodies an improvement over current methods. While the classification results are currently optimized for Caenorhabditis elegans papers by human-created rules, the classification engine can be adapted to different types of documents. We have demonstrated this by presenting a web interface that allows researchers to quickly navigate through the hierarchy and look for documents that belong to a specific concept

    A framework for supporting knowledge representation – an ontological based approach

    Get PDF
    Dissertação para obtenção do Grau de Mestre em Engenharia Electrotécnica e de ComputadoresThe World Wide Web has had a tremendous impact on society and business in just a few years by making information instantly available. During this transition from physical to electronic means for information transport, the content and encoding of information has remained natural language and is only identified by its URL. Today, this is perhaps the most significant obstacle to streamlining business processes via the web. In order that processes may execute without human intervention, knowledge sources, such as documents, must become more machine understandable and must contain other information besides their main contents and URLs. The Semantic Web is a vision of a future web of machine-understandable data. On a machine understandable web, it will be possible for programs to easily determine what knowledge sources are about. This work introduces a conceptual framework and its implementation to support the classification and discovery of knowledge sources, supported by the above vision, where such sources’ information is structured and represented through a mathematical vector that semantically pinpoints the relevance of those knowledge sources within the domain of interest of each user. The presented work also addresses the enrichment of such knowledge representations, using the statistical relevance of keywords based on the classical vector space model concept, and extending it with ontological support, by using concepts and semantic relations, contained in a domain-specific ontology, to enrich knowledge sources’ semantic vectors. Semantic vectors are compared against each other, in order to obtain the similarity between them, and better support end users with knowledge source retrieval capabilities
    corecore