Search CORE

16 research outputs found

Joint Entity Extraction and Assertion Detection for Clinical Text

Author: Bhatia Parminder
Celikkaya Busra
Khalilia Mohammed
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2019
Field of study

Negative medical findings are prevalent in clinical reports, yet discriminating them from positive findings remains a challenging task for information extraction. Most of the existing systems treat this task as a pipeline of two separate tasks, i.e., named entity recognition (NER) and rule-based negation detection. We consider this as a multi-task problem and present a novel end-to-end neural model to jointly extract entities and negations. We extend a standard hierarchical encoder-decoder NER model and first adopt a shared encoder followed by separate decoders for the two tasks. This architecture performs considerably better than the previous rule-based and machine learning-based systems. To overcome the problem of increased parameter size especially for low-resource settings, we propose the Conditional Softmax Shared Decoder architecture which achieves state-of-art results for NER and negation detection on the 2010 i2b2/VA challenge dataset and a proprietary de-identified clinical dataset.Comment: Accepted at the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019

arXiv.org e-Print Archive

Crossref

Relational data clustering algorithms with biomedical applications

Author: Khalilia Mohammed A.
Publication venue: University of Missouri--Columbia
Publication date
Field of study

University of Missouri: MOspace

SALMA: Arabic Sense-Annotated Corpus and WSD Benchmarks

Author: Hammouda Tymaa
Jarrar Mustafa
Khalilia Mohammed
Malaysha Sanad
Publication venue
Publication date: 29/10/2023
Field of study

SALMA, the first Arabic sense-annotated corpus, consists of ~34K tokens, which are all sense-annotated. The corpus is annotated using two different sense inventories simultaneously (Modern and Ghani). SALMA novelty lies in how tokens and senses are associated. Instead of linking a token to only one intended sense, SALMA links a token to multiple senses and provides a score to each sense. A smart web-based annotation tool was developed to support scoring multiple senses against a given word. In addition to sense annotations, we also annotated the corpus using six types of named entities. The quality of our annotations was assessed using various metrics (Kappa, Linear Weighted Kappa, Quadratic Weighted Kappa, Mean Average Error, and Root Mean Square Error), which show very high inter-annotator agreement. To establish a Word Sense Disambiguation baseline using our SALMA corpus, we developed an end-to-end Word Sense Disambiguation system using Target Sense Verification. We used this system to evaluate three Target Sense Verification models available in the literature. Our best model achieved an accuracy with 84.2% using Modern and 78.7% using Ghani. The full corpus and the annotation tool are open-source and publicly available at https://sina.birzeit.edu/salma/

arXiv.org e-Print Archive

ArBanking77: Intent Detection Neural Model and a New Dataset in Modern and Dialectical Arabic

Author: Birim Ahmet
Erden Mustafa
Ghanem Sana
Jarrar Mustafa
Khalilia Mohammed
Publication venue
Publication date: 29/10/2023
Field of study

This paper presents the ArBanking77, a large Arabic dataset for intent detection in the banking domain. Our dataset was arabized and localized from the original English Banking77 dataset, which consists of 13,083 queries to ArBanking77 dataset with 31,404 queries in both Modern Standard Arabic (MSA) and Palestinian dialect, with each query classified into one of the 77 classes (intents). Furthermore, we present a neural model, based on AraBERT, fine-tuned on ArBanking77, which achieved an F1-score of 0.9209 and 0.8995 on MSA and Palestinian dialect, respectively. We performed extensive experimentation in which we simulated low-resource settings, where the model is trained on a subset of the data and augmented with noisy queries to simulate colloquial terms, mistakes and misspellings found in real NLP systems, especially live chat queries. The data and the models are publicly available at https://sina.birzeit.edu/arbanking77

arXiv.org e-Print Archive

Arabic Fine-Grained Entity Recognition

Author: Abdul-Mageed Muhammad
El-Shangiti Ahmed Oumar
Jarrar Mustafa
Khalilia Mohammed
Liqreina Haneen
Publication venue
Publication date: 18/12/2023
Field of study

Traditional NER systems are typically trained to recognize coarse-grained entities, and less attention is given to classifying entities into a hierarchy of fine-grained lower-level subtypes. This article aims to advance Arabic NER with fine-grained entities. We chose to extend Wojood (an open-source Nested Arabic Named Entity Corpus) with subtypes. In particular, four main entity types in Wojood, geopolitical entity (GPE), location (LOC), organization (ORG), and facility (FAC), are extended with 31 subtypes. To do this, we first revised Wojood's annotations of GPE, LOC, ORG, and FAC to be compatible with the LDC's ACE guidelines, which yielded 5, 614 changes. Second, all mentions of GPE, LOC, ORG, and FAC (~44K) in Wojood are manually annotated with the LDC's ACE sub-types. We refer to this extended version of Wojood as WojoodF ine. To evaluate our annotations, we measured the inter-annotator agreement (IAA) using both Cohen's Kappa and F1 score, resulting in 0.9861 and 0.9889, respectively. To compute the baselines of WojoodF ine, we fine-tune three pre-trained Arabic BERT encoders in three settings: flat NER, nested NER and nested NER with subtypes and achieved F1 score of 0.920, 0.866, and 0.885, respectively. Our corpus and models are open-source and available at https://sina.birzeit.edu/wojood/

arXiv.org e-Print Archive

WojoodNER 2023: The First Arabic Named Entity Recognition Shared Task

Author: Abdul-Mageed Muhammad
Elmadany AbdelRahim
Hamad Nagham
Jarrar Mustafa
Khalilia Mohammed
Omar Alaa'
Talafha Bashar
Publication venue
Publication date: 24/10/2023
Field of study

We present WojoodNER-2023, the first Arabic Named Entity Recognition (NER) Shared Task. The primary focus of WojoodNER-2023 is on Arabic NER, offering novel NER datasets (i.e., Wojood) and the definition of subtasks designed to facilitate meaningful comparisons between different NER approaches. WojoodNER-2023 encompassed two Subtasks: FlatNER and NestedNER. A total of 45 unique teams registered for this shared task, with 11 of them actively participating in the test phase. Specifically, 11 teams participated in FlatNER, while

8

teams tackled NestedNER. The winning teams achieved F1 scores of 91.96 and 93.73 in FlatNER and NestedNER, respectively

arXiv.org e-Print Archive

Predicting disease risks from highly imbalanced data using random forest

Author: AP Bradley
C Chen
D Palmer
DA Davis
DH Mantzaris
E Cohen
F Provost
HCUP Project
J Mingers
JR Quinlan
L Breiman
L Breiman
L Breiman
M Skubic
Mihail Popescu
Mohammed Khalilia
N Japkowicz
P Hebert
Sounak Chakraborty
ST Moturu
T Hastie
T Yi
V Fuster
W Yu
W Zhang
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background We present a method utilizing Healthcare Cost and Utilization Project (HCUP) dataset for predicting disease risk of individuals based on their medical diagnosis history. The presented methodology may be incorporated in a variety of applications such as risk management, tailored health communication and decision support systems in healthcare. Methods We employed the National Inpatient Sample (NIS) data, which is publicly available through Healthcare Cost and Utilization Project (HCUP), to train random forest classifiers for disease prediction. Since the HCUP data is highly imbalanced, we employed an ensemble learning approach based on repeated random sub-sampling. This technique divides the training data into multiple sub-samples, while ensuring that each sub-sample is fully balanced. We compared the performance of support vector machine (SVM), bagging, boosting and RF to predict the risk of eight chronic diseases. Results We predicted eight disease categories. Overall, the RF ensemble learning method outperformed SVM, bagging and boosting in terms of the area under the receiver operating characteristic (ROC) curve (AUC). In addition, RF has the advantage of computing the importance of each variable in the classification process. Conclusions In combining repeated random sub-sampling with RF, we were able to overcome the class imbalance problem and achieve promising results. Using the national HCUP data set, we predicted eight disease categories with an average AUC of 88.79%.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

PDA project [abstract]

Author: Khalilia Mohammed
Publication venue: University of Missouri--Columbia. Office of Undergraduate Research
Publication date: 01/01/2004
Field of study

Faculty Mentor: Dr. Marjorie Skubic, Computer Engineering and Computer ScienceAbstract only availableThe goal of this project is to create a robot interface that allows a user to guide and control a robot to perform some task. The assumption is that, although the user may be a domain expert in how the task should be done, he is not an expert in robotics. During the actual robot use, he should focus on the task to be done rather than worrying about the robot or the interaction modality. To address this goal, we have been investigating the use of hand-drawn route maps, in which the user sketches an approximate representation of the environment and then sketches the desired robot trajectory with respect to that environment. The objective in the sketch interface is to extract spatial information about the map and a qualitative path through the landmarks drawn on the sketch. This information is used to build a task representation for the robot, which operates as a semiautonomous vehicle. The stylus interface of the PDA allows the user to sketch a map much as you would on paper. The PDA captures the string of (x,y) coordinates sketched on the screen, which forms a digital representation suitable for processing. The user first draws a representation of the environment by sketching the approximate boundary of each object. During the sketching process, a delimiter is included to separate the string of coordinates for each object in the environment. After all of the environment objects have been drawn, another delimiter is included to indicate the start of the robot trajectory, and the user sketches the desired path of the robot, relative to the sketched environment

University of Missouri: MOspace