3 research outputs found

    Performance Analysis of Machine Learning Approaches in Automatic Classification of Arabic Language

    Get PDF
    Text classification (TC) is a crucial subject. The number of digital files available on the internet is enormous. The goal of TC is to categorize texts into a series of predetermined groups. The number of studies conducted on the English database is significantly higher than the number of studies conducted on the Arabic database. Therefore, this research analyzes the performance of automatic TC of the Arabic language using Machine Learning (ML) approaches. Further, Single-label Arabic News Articles Datasets (SANAD) are introduced, which contain three different datasets, namely Akhbarona, Khaleej, and Arabiya. Initially, the collected texts are pre-processed in which tokenization and stemming occur. In this research, three kinds of stemming are employed, namely light stemming, Khoja stemming, and no- stemming, to evaluate the effect of the pre-processing technique on Arabic TC performance. Moreover, feature extraction and feature weighting are performed; in feature weighting, the term weighting process is completed by the term frequency- inverse document frequency (tf-idf) method. In addition, this research selects C4.5, Support Vector Machine (SVM), and Naïve Bayes (NB) as a classification algorithm. The results indicated that the SVM and NB methods had attained higher accuracy than the C4.5 method. NB achieved the maximum accuracy with a performance of 99.9%

    A Method of Combining Machine Learning Algorithms with Different Optimization Policies for Accurate Data Classification

    Get PDF
    データ分類は各インスタンスが属するクラスを正しく決定する処理である。データ分類は有用であり、医療、金融、教育、ビジネスなど広い分野に適用されている。しかし、1種類の分類器のみを用いた場合の問題の一つとして、分類器の精度が十分ではないということがある。このため、分類の性能を改善するために複数の分類器を組み合わせる方式が提案されてきた。複数の分類器を組み合わせる際には、各分類の利点を生かすように設計されているが、必ずしも精度が十分ではない。本論文では、異なるポリシーにより最適化された2つの分類器を直列に組み合わせた分類の設計手法を提案する。提案方式での1つ目の分類器は、未検知を減らすよう最適化を行い、2つ目の分類器は、誤検知を減らすように最適化を行う。評価実験では、乳癌診断のための研究者の間で最も用いられるウィスコンシン(WBC)データセットとウィスコンシン診断乳癌(WDBC)データセットを用いた。提案方式の比較対象として、1つの分類器である逐次最小最適化(SMO)とナイーブベイズ(NB)のアルゴリズム及び、既存の分類器の組み合わせ手法(WEKAの“Vote class”を使用)と同一のデータセットを用いて評価実験を行った。実験結果より、提案手法は、1つの分類器を用いた場合よりも精度が改善し,WBCのデータ?セットに対してそれぞれ99.13%と98.49%の精度が得られた.さらに、既存の分類器組み合わせ手法はそれぞれ96.99%と97.53%の精度であることから、本提案方式が有用性を明らかにすることができた。電気通信大学201

    Detecting cyberstalking from social media platform(s) using data mining analytics

    Get PDF
    Cybercrime is an increasing activity that leads to cyberstalking whilst making the use of data mining algorithms to detect or prevent cyberstalking from social media platforms imperative for this study. The aim of this study was to determine the prevalence of cyberstalking on the social media platforms using Twitter. To achieve the objective, machine learning models that perform data mining alongside the security metrics were used to detect cyberstalking from social media platforms. The derived security metrics were used to flag up any suspicious cyberstalking content. Two datasets of detailed tweets were analysed using NVivo and R Programming. The dominant occurrence of cyberstalking was assessed with the induction of fifteen unigrams identified from the preliminary dataset such as “abuse”, “annoying”, “creep or creepy”, “fear”, “follow or followers”, “gender”, “harassment”, “messaging”, “relationships p/p”, “scared”, “stalker”, “technology”, “unwanted”, “victim”, and “violent”. Ordinal regression was used to analyse the use of the fifteen unigrams which were categorised according to degree or relationship/link towards cyberstalking on the platform Twitter. Moreover, two lightweight machine learning algorithms were used for the model performance showcasing cyberstalking indicative content. K Nearest Neighbour and K Means Clustering were both coded in R computer language for the extraction, refined, analysation and visualisation process for this research. Results showed the emotional terms like “bad”, “sad” and “hate” were attached to the unigrams being linked to cyberstalking. Each emotional term was flagged up in correspondence with one of the fifteen unigrams in tweets that correlate cyberstalking indicative content, proving one must accompany the other. K Means Clustering results showed the two terms “bad” and “sad” were shown within 100 percent of the clustering results and the term “hate” was only seen within 60 percent of the results. Results also revealed that the accuracy of the KNN algorithm was up to 40% in predicting key terms-based cyberstalking content in a real Twitter dataset consisting of 1m data points. This study emphasises the continuous relationship between the fifteen unigrams, emotional terms, and tweets within numerous datasets portrayed in this research, and reveals a general picture that cyberstalking indicative content in fact happens on Twitter at a vast rate with the corresponding links or relationships within the detection of cyberstalking
    corecore