5 research outputs found

    Multi-Category Support Vector Machines for Identifying Arabic Topics

    Get PDF
    International audienceIt is known that Support Vector Machines were designed for binary classification. Nevertless, it would be fruitful to extend this operation to what is called Multi-category classification. That is why, Multi-category Support Nector Machines (MSVM) become nowadays the current subject of several serious researches, aiming to achieve high levels of multi-category classification tasks. This technique has been assessed recently recently in some fields as text categorization, Cancer classification, etc. We should notify that experiments which have been realized until now using MSVM are limited to small data sets, since its computation is more expensive. In this paper, we are interested in the use of this method, for the first time in topic identification. The experiments conducted concern topic identification of Arabic language. The corpora are extracted from ALWATAN newspaper. Achieved results lead to an improvement of MSVM performance i comparison to the baseline SVM method. Nevertheless, SVM still outperforms MSVM when using larger sizes of the vocabulary

    TR-Classifier and kNN Evaluation for Topic Identification tasks

    Get PDF
    International audienceThis paper focuses on studying topic identificationfor Arabic language by using two methods. The first method isthe well-known kNN (k Nearest Neighbors) which is used asbaseline. The second one is the TR-Classifier, mainly based oncomputing triggers. The experiments show that TR-Classifier hasthe advantage to give best performances compared to kNN, byusing much reduced sizes of Topic Vocabularies. TR-Classifierperformance is enhanced by increasing jointly the number oftriggers and the size of topic vocabularies. It should be noted thattopic vocabularies are used by the TR-Classifier. Whereas, ageneral vocabulary is needed for kNN, and it is obtained by theconcatenation of those used by the TR-Classifier. In addition tothe standard measures Recall and Precision used for theevaluation step, we have drawn ROC curves for some topics toillustrate more clearly the difference in performance between thetwo classifiers. The corpus used in our experiments is downloadedfrom an online Arabic newspaper. Its size is about 10 millionswords, distributed over six selected topics, in this case: culture,religion, economy, local news, international news and sports

    Comparing TR-Classifier and KNN by using Reduced Sizes of Vocabularies

    Get PDF
    International audienceThe aim of this study is topic identification byusing two methods, in this case, a new one that we haveproposed: TR-classifier which is based on computingtriggers, and the well-known k Nearest Neighbors.Performances are acceptable, particularly for TR-classifier,though we have used reduced sizes of vocabularies. For theTR-Classifier, each topic is represented by a vocabularywhich has been built using the corresponding trainingcorpus. Whereas, the kNN method uses a generalvocabulary, obtained by the concatenation of those used bythe TR-Classifier. For the evaluation task, six topics havebeen selected to be identified: Culture, religion, economy,local news, international news and sports. An Arabic corpushas been used to achieve experiments

    Evaluation of Topic Identification Methods on Arabic Corpora

    No full text
    International audienceTopic Identification is one of the important keysfor the success of many applications. Indeed, there are fewworks in this field concerning Arabic language because oflack of standard corpora. In this study, we will provide directlycomparable results of six text categorization methods on anew Arabic corpus Alwatan-2004. Hence, Topic UnigramLanguage Model (TULM), Term Frequency/Inverse DocumentFrequency (TFIDF), Neural Network, SVM, M-SVM and TRhave been experimented, and showed that TR-Classifier isthe most efficient among the set of classifiers, nevertheless,only binary SVM outperformed it thanks to its characteristics.Moreover, we should note that the size of Alwatan-2004 corpusused to achieve our experiments is considered the mostimportant compared to any other Arabic corpus which hadbeen used for topic identification experiments until now. Inaddition, we aim through using small sizes of vocabularies toreduce the time of computation. This is important for adaptivelanguage modeling, particularly Topic Adaptation, which isrequired in real time applications such as speech recognitionand machine translation systems. Our experiments indicatethat the results are better than other works dealing with Arabictext categorization
    corecore