Search CORE

17 research outputs found

A Supervised Learning Approach for Imbalanced Text Classification of Biomedical Literature Triage

Author: Almeida Hayda
Publication venue
Publication date: 01/04/2015
Field of study

This thesis presents the development of a machine learning system, called mycoSORT , for supporting the first step of the biological literature manual curation process, called triage. The manual triage of documents is very demanding, as researchers usually face the time-consuming and error- prone task of screening a large amount of data to identify relevant information. After querying scientific databases for keywords related to a specific subject, researchers generally find a long list of retrieved results, that has to be carefully analysed to identify only a few documents that show a potential of being relevant to the topic. Such an analysis represents a severe bottleneck in the knowledge discovery and decision-making processes in scientific research. Hence, biocurators could greatly benefit from an automatic support when performing the triage task. In order to support the triage of scientific documents, we have used a corpus of document instances manually labeled by biocurators as “selected” or “rejected”, with regards to their potential to indicate relevant information about fungal enzymes. This document collection is characterized by being large, since many results are retrieved and analysed to finally identify potential candidate documents; and also highly imbalanced, concerning the distribution of instances per relevance: the great majority of documents are labeled as rejected, while only a very small portion are labeled as selected. Using this dataset, we studied the design of a classification model to identify the most discriminative features to automate the triage of scientific literature and to tackle the imbalance between the two classes of documents. To identify the most suitable model, we performed a study of 324 classification models, which demonstrated the results of using 9 different data undersampling factors, 4 sets of features, and the evaluation of 2 feature selection methods as well as 3 machine learning algorithms. Our results demonstrated that the use of an undersampling technique is effective to handle imbalanced datasets and also help manage large document collections. We also found that the combination of undersampling and feature selection using Odds Ratio can improve the performance of our classification model. Finally, our results demonstrated that the best fitting model to support the triage of scientific documents is composed by domain relevant features, filtered by Odds Ratio scores, the use of dataset undersampling and the Logistic Model Trees algorithm

Concordia University Research Repository

Machine Learning for Biomedical Literature Triage

Author: Almeida Hayda
Butler Greg
Kosseim Leila
Meurs Marie-Jean
Tsang Adrian
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2014
Field of study

This paper presents a machine learning system for supporting the first task of the biological literature manual curation process, called triage. We compare the performance of various classification models, by experimenting with dataset sampling factors and a set of features, as well as three different machine learning algorithms (Naive Bayes, Support Vector Machine and Logistic Model Trees). The results show that the most fitting model to handle the imbalanced datasets of the triage classification task is obtained by using domain relevant features, an under-sampling technique, and the Logistic Model Trees algorithm

Concordia University Research Repository

Directory of Open Access Journals

PubMed Central

mycoCLAP, the database for characterized lignocellulose-active proteins of fungal origin: resource and text mining curation support

Author: Almeida Hayda
Butler Greg
Kosseim Leila
McDonnell Erin
Meurs Marie-Jean
Nyaga Carol
Powlowski Justin
Strasser Kimchi-Audrey
Tsang Adrian
Wu Min
Wu Sherry
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2015
Field of study

Enzymes active on components of lignocellulosic biomass are used for industrial applications ranging from food processing to biofuels production. These include a diverse array of glycoside hydrolases, carbohydrate esterases, polysaccharide lyases and oxidoreductases. Fungi are prolific producers of these enzymes, spurring fungal genome sequencing efforts to identify and catalogue the genes that encode them. To facilitate the functional annotation of these genes, biochemical data on over 800 fungal lignocellulose-degrading enzymes have been collected from the literature and organized into the searchable database, mycoCLAP (http://mycoclap.fungalgenomics.ca). First implemented in 2011, and updated as described here, mycoCLAP is capable of ranking search results according to closest biochemically characterized homologues: this improves the quality of the annotation, and significantly decreases the time required to annotate novel sequences. The database is freely available to the scientific community, as are the open source applications based on natural language processing developed to support the manual curation of mycoCLAP. Database URL: http://mycoclap.fungalgenomics.ca

Crossref

Concordia University Research Repository

PubMed Central

mycoSet Feature Vector Representation.

Author: Adrian Tsang (9935)
Greg Butler (27859)
Hayda Almeida (680542)
Leila Kosseim (680544)
Marie-Jean Meurs (680543)
Publication venue
Publication date
Field of study

Feature Occurrence Represented in the Feature Vector.mycoSet Feature Vector Representation.</p

The Francis Crick Institute

mycoSORT Results - Set of Features F5.

Author: Adrian Tsang (9935)
Greg Butler (27859)
Hayda Almeida (680542)
Leila Kosseim (680544)
Marie-Jean Meurs (680543)
Publication venue
Publication date
Field of study

Results of Positive Class on Feature Setting #3, Using Only Bag-of-Words as Features.mycoSORT Results - Set of Features F5.</p

The Francis Crick Institute

mycoSORT F-measure scores.

Author: Adrian Tsang (9935)
Greg Butler (27859)
Hayda Almeida (680542)
Leila Kosseim (680544)
Marie-Jean Meurs (680543)
Publication venue
Publication date
Field of study

Results of the Best Classifiers for Each Classification Model.</p

The Francis Crick Institute

mycoSORT F-2 scores.

Author: Adrian Tsang (9935)
Greg Butler (27859)
Hayda Almeida (680542)
Leila Kosseim (680544)
Marie-Jean Meurs (680543)
Publication venue
Publication date
Field of study

Results of the Best Classifiers for Each Classification Model.</p

The Francis Crick Institute

mycoSORT Results - Set of Features F1+F2+F3+F4.

Author: Adrian Tsang (9935)
Greg Butler (27859)
Hayda Almeida (680542)
Leila Kosseim (680544)
Marie-Jean Meurs (680543)
Publication venue
Publication date
Field of study

Results of Positive Class on Feature Setting #4, Using Bio-entities, Content and EC Numbers as Features.mycoSORT Results - Set of Features F1+F2+F3+F4.</p

The Francis Crick Institute

mycoSet Confusion Matrix.

Author: Adrian Tsang (9935)
Greg Butler (27859)
Hayda Almeida (680542)
Leila Kosseim (680544)
Marie-Jean Meurs (680543)
Publication venue
Publication date
Field of study

Confusion Matrix of a Binary Classification.mycoSet Confusion Matrix.</p

The Francis Crick Institute

A Supervised Learning Approach for Imbalanced Text Classification of Biomedical Literature Triage

Machine Learning for Biomedical Literature Triage

mycoCLAP, the database for characterized lignocellulose-active proteins of fungal origin: resource and text mining curation support

<i>mycoSet</i> Feature Vector Representation.

mycoSORT Results - Set of Features F5.

mycoSORT F-measure scores.

mycoSORT F-2 scores.

mycoSORT Results - Set of Features F1+F2+F3+F4.

<i>mycoSet</i> Confusion Matrix.