3,788 research outputs found

    An Intelligent System For Arabic Text Categorization

    Get PDF
    Text Categorization (classification) is the process of classifying documents into a predefined set of categories based on their content. In this paper, an intelligent Arabic text categorization system is presented. Machine learning algorithms are used in this system. Many algorithms for stemming and feature selection are tried. Moreover, the document is represented using several term weighting schemes and finally the k-nearest neighbor and Rocchio classifiers are used for classification process. Experiments are performed over self collected data corpus and the results show that the suggested hybrid method of statistical and light stemmers is the most suitable stemming algorithm for Arabic language. The results also show that a hybrid approach of document frequency and information gain is the preferable feature selection criterion and normalized-tfidf is the best weighting scheme. Finally, Rocchio classifier has the advantage over k-nearest neighbor classifier in the classification process. The experimental results illustrate that the proposed model is an efficient method and gives generalization accuracy of about 98%

    Sentiment Analysis in Karonese Tweet using Machine Learning

    Get PDF
    Recently, many social media users expressed their conditions, ideas, emotions using local languages ​​on social media, for example via tweets or status. Due to the large number of texts, sentiment analysis is used to identify opinions, ideas, or thoughts from social media. Sentiment analysis research has also been widely applied to local languages. Karonese is one of the largest local languages ​​in North Sumatera, Indonesia. Karo society actively use the language in expression on twitter. This study proposes two things: Karonese tweet dataset for classification and analysis of sentiment on Karonese. Several machine learning algorithms are implemented in this research, that is Logistic regression, Naive bayes, K-nearest neighbor, and Support Vector Machine (SVM). Karonese tweets is obtained from timeline twitter based on several keywords and hashtags. Transcribers from ethnic figures helped annotating the Karo tweets into three classes: positive, negative, and neutral. To get the best model, several scenarios were run based on various compositions of training data and test data. The SVM algorithm has highest accuracy, precision, recall, and F-1 scores than others. As the research is a preliminary research of sentiment analysis on Karonese language, there are many feature works to improvement

    Analysis and Implementation Machine Learning for YouTube Data Classification by Comparing the Performance of Classification Algorithms

    Get PDF
    Every day, people around the world upload 1.2 million videos to YouTube or more than 100 hours per minute, and this number is increasing. The condition of this continuous data will be useless if not utilized again. To dig up information on large-scale data, a technique called data mining can be a solution. One of the techniques in data mining is classification. For most YouTube users, when searching for video titles do not match the desired video category. Therefore, this research was conducted to classify YouTube data based on its search text. This article focuses on comparing three algorithms for the classification of YouTube data into the Kesenian and Sains category. Data collection in this study uses scraping techniques taken from the YouTube website in the form of links, titles, descriptions, and searches. The method used in this research is an experimental method by conducting data collection, data processing, proposed models, testing, and evaluating models. The models applied are Random Forest, SVM, Naive Bayes. The results showed that the accuracy rate of the random forest model was better by 0.004%, with the label encoder not being applied to the target class, and the label encoder had no effect on the accuracy of the classification models. The most appropriate model for YouTube data classification from data taken in this study is NaĂŻve Bayes, with an accuracy rate of 88% and an average precision of 90%

    Comparative Analysis of KNN, NaĂŻve Bayes and SVM Algorithms for Movie Genres Classification Based on Synopsis.

    Get PDF
    Text classification is a process of categorizing a text into the correct label. Text classification in natural language processing is a challenging task that requires accuracy to get the correct results, manual text classification tends to be inefficient because it requires a lot of time and also experts. The utilization of machine learning for automatic text classification can be a solution to this problem. KNN, Naive Bayes, and SVM are known as some of the most algorithms to solve classification problems, especially text classification. In this study, we are trying to compare the KNN, Naive Bayes, and SVM algorithms for text classification with the problem of classifying movie genres based on a synopsis using datasets obtained from Kaggle.com and IMDB Dataset. The results of this study indicate that of the 12 experiments, Support Vector Machine (SVM) is the bestperforming algorithm with an accuracy of 90%, 93%, 65%, and 63%. It is hoped that this research can help to determine the best algorithm in the text classification process.

    Deep Learning for Data Privacy Classification

    Get PDF
    The ubiquity of electronic services and communication has allowed organizations to collect increasingly large volumes of data on private citizens. As this trend continues, more advanced and automated methods are required to protect the privacy of these individuals. This project explores a number of machine learning techniques for classification of arbitrary text documents into three distinct privacy tiers: non-personal information, personal information, and sensitive personal information. We find that applying feed forward neural networks to bag-of-words representations of documents achieves the best performance while ensuring low training and prediction times
    • …
    corecore