Search CORE

7 research outputs found

Machine Learning for Biomedical Literature Triage

Author: Almeida Hayda
Butler Greg
Kosseim Leila
Meurs Marie-Jean
Tsang Adrian
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2014
Field of study

This paper presents a machine learning system for supporting the first task of the biological literature manual curation process, called triage. We compare the performance of various classification models, by experimenting with dataset sampling factors and a set of features, as well as three different machine learning algorithms (Naive Bayes, Support Vector Machine and Logistic Model Trees). The results show that the most fitting model to handle the imbalanced datasets of the triage classification task is obtained by using domain relevant features, an under-sampling technique, and the Logistic Model Trees algorithm

Concordia University Research Repository

Directory of Open Access Journals

PubMed Central

Comparative Evaluation of Sentiment Analysis Methods Across Arabic Dialects

Author: Aoun Rita
Baly Ramy
El-Hajj Wassim
El-Khoury Georges
Hajj Hazem
Moukalled Rawan
Shaban Khaled Bashir
Publication venue: 'Elsevier BV'
Publication date: 01/01/2017
Field of study

Sentiment analysis in Arabic is challenging due to the complex morphology of the language. The task becomes more challenging when considering Twitter data that contain significant amounts of noise such as the use of Arabizi, code-switching and different dialects that varies significantly across the Arab world, the use of non-Textual objects to express sentiments, and the frequent occurrence of misspellings and grammatical mistakes. Modeling sentiment in Twitter should become easier when we understand the characteristics of Twitter data and how its usage varies from one Arab region to another. We describe our effort to create the first Multi-Dialect Arabic Sentiment Twitter Dataset (MD-ArSenTD) that is composed of tweets collected from 12 Arab countries, annotated for sentiment and dialect. We use this dataset to analyze tweets collected from Egypt and the United Arab Emirates (UAE), with the aim of discovering distinctive features that may facilitate sentiment analysis. We also perform a comparative evaluation of different sentiment models on Egyptian and UAE tweets. These models are based on feature engineering and deep learning, and have already achieved state-of-The-Art accuracies in English sentiment analysis. Results indicate the superior performance of deep learning models, the importance of morphological features in Arabic NLP, and that handling dialectal Arabic leads to different outcomes depending on the country from which the tweets are collected.This work was made possible by NPRP 6-716-1-138 grant from the Qatar National Research Fund (a member of Qatar Foundation). The statements made herein are solely the responsibility of the authors.Scopu

Qatar University Institutional Repository

An empirical study to address the problem of Unbalanced Data Sets in sentiment classification

Author
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

SENTIMENT ANALYSIS FOR SPORTS FANATICISM IN ARABIC SOCIAL MEDIA TEXT

Author
Publication venue
Publication date
Field of study

KFUPM ePrints

A Novel Data Mining and Knowledge Discovery Framework for Digital Library Recommendations System based on User’s Feedback and Personalization

Author: Almaghrabi Maram
Publication venue
Publication date: 01/01/2021
Field of study

University of Canberra Research Repository

A Supervised Learning Approach for Imbalanced Text Classification of Biomedical Literature Triage

Author: Almeida Hayda
Publication venue
Publication date: 01/04/2015
Field of study

This thesis presents the development of a machine learning system, called mycoSORT , for supporting the first step of the biological literature manual curation process, called triage. The manual triage of documents is very demanding, as researchers usually face the time-consuming and error- prone task of screening a large amount of data to identify relevant information. After querying scientific databases for keywords related to a specific subject, researchers generally find a long list of retrieved results, that has to be carefully analysed to identify only a few documents that show a potential of being relevant to the topic. Such an analysis represents a severe bottleneck in the knowledge discovery and decision-making processes in scientific research. Hence, biocurators could greatly benefit from an automatic support when performing the triage task. In order to support the triage of scientific documents, we have used a corpus of document instances manually labeled by biocurators as “selected” or “rejected”, with regards to their potential to indicate relevant information about fungal enzymes. This document collection is characterized by being large, since many results are retrieved and analysed to finally identify potential candidate documents; and also highly imbalanced, concerning the distribution of instances per relevance: the great majority of documents are labeled as rejected, while only a very small portion are labeled as selected. Using this dataset, we studied the design of a classification model to identify the most discriminative features to automate the triage of scientific literature and to tackle the imbalance between the two classes of documents. To identify the most suitable model, we performed a study of 324 classification models, which demonstrated the results of using 9 different data undersampling factors, 4 sets of features, and the evaluation of 2 feature selection methods as well as 3 machine learning algorithms. Our results demonstrated that the use of an undersampling technique is effective to handle imbalanced datasets and also help manage large document collections. We also found that the combination of undersampling and feature selection using Odds Ratio can improve the performance of our classification model. Finally, our results demonstrated that the best fitting model to support the triage of scientific documents is composed by domain relevant features, filtered by Odds Ratio scores, the use of dataset undersampling and the Logistic Model Trees algorithm

Concordia University Research Repository