16 research outputs found
Joint Entity Extraction and Assertion Detection for Clinical Text
Negative medical findings are prevalent in clinical reports, yet
discriminating them from positive findings remains a challenging task for
information extraction. Most of the existing systems treat this task as a
pipeline of two separate tasks, i.e., named entity recognition (NER) and
rule-based negation detection. We consider this as a multi-task problem and
present a novel end-to-end neural model to jointly extract entities and
negations. We extend a standard hierarchical encoder-decoder NER model and
first adopt a shared encoder followed by separate decoders for the two tasks.
This architecture performs considerably better than the previous rule-based and
machine learning-based systems. To overcome the problem of increased parameter
size especially for low-resource settings, we propose the Conditional Softmax
Shared Decoder architecture which achieves state-of-art results for NER and
negation detection on the 2010 i2b2/VA challenge dataset and a proprietary
de-identified clinical dataset.Comment: Accepted at the 57th Annual Meeting of the Association for
Computational Linguistics (ACL 2019
SALMA: Arabic Sense-Annotated Corpus and WSD Benchmarks
SALMA, the first Arabic sense-annotated corpus, consists of ~34K tokens,
which are all sense-annotated. The corpus is annotated using two different
sense inventories simultaneously (Modern and Ghani). SALMA novelty lies in how
tokens and senses are associated. Instead of linking a token to only one
intended sense, SALMA links a token to multiple senses and provides a score to
each sense. A smart web-based annotation tool was developed to support scoring
multiple senses against a given word. In addition to sense annotations, we also
annotated the corpus using six types of named entities. The quality of our
annotations was assessed using various metrics (Kappa, Linear Weighted Kappa,
Quadratic Weighted Kappa, Mean Average Error, and Root Mean Square Error),
which show very high inter-annotator agreement. To establish a Word Sense
Disambiguation baseline using our SALMA corpus, we developed an end-to-end Word
Sense Disambiguation system using Target Sense Verification. We used this
system to evaluate three Target Sense Verification models available in the
literature. Our best model achieved an accuracy with 84.2% using Modern and
78.7% using Ghani. The full corpus and the annotation tool are open-source and
publicly available at https://sina.birzeit.edu/salma/
ArBanking77: Intent Detection Neural Model and a New Dataset in Modern and Dialectical Arabic
This paper presents the ArBanking77, a large Arabic dataset for intent
detection in the banking domain. Our dataset was arabized and localized from
the original English Banking77 dataset, which consists of 13,083 queries to
ArBanking77 dataset with 31,404 queries in both Modern Standard Arabic (MSA)
and Palestinian dialect, with each query classified into one of the 77 classes
(intents). Furthermore, we present a neural model, based on AraBERT, fine-tuned
on ArBanking77, which achieved an F1-score of 0.9209 and 0.8995 on MSA and
Palestinian dialect, respectively. We performed extensive experimentation in
which we simulated low-resource settings, where the model is trained on a
subset of the data and augmented with noisy queries to simulate colloquial
terms, mistakes and misspellings found in real NLP systems, especially live
chat queries. The data and the models are publicly available at
https://sina.birzeit.edu/arbanking77
Arabic Fine-Grained Entity Recognition
Traditional NER systems are typically trained to recognize coarse-grained
entities, and less attention is given to classifying entities into a hierarchy
of fine-grained lower-level subtypes. This article aims to advance Arabic NER
with fine-grained entities. We chose to extend Wojood (an open-source Nested
Arabic Named Entity Corpus) with subtypes. In particular, four main entity
types in Wojood, geopolitical entity (GPE), location (LOC), organization (ORG),
and facility (FAC), are extended with 31 subtypes. To do this, we first revised
Wojood's annotations of GPE, LOC, ORG, and FAC to be compatible with the LDC's
ACE guidelines, which yielded 5, 614 changes. Second, all mentions of GPE, LOC,
ORG, and FAC (~44K) in Wojood are manually annotated with the LDC's ACE
sub-types. We refer to this extended version of Wojood as WojoodF ine. To
evaluate our annotations, we measured the inter-annotator agreement (IAA) using
both Cohen's Kappa and F1 score, resulting in 0.9861 and 0.9889, respectively.
To compute the baselines of WojoodF ine, we fine-tune three pre-trained Arabic
BERT encoders in three settings: flat NER, nested NER and nested NER with
subtypes and achieved F1 score of 0.920, 0.866, and 0.885, respectively. Our
corpus and models are open-source and available at
https://sina.birzeit.edu/wojood/
WojoodNER 2023: The First Arabic Named Entity Recognition Shared Task
We present WojoodNER-2023, the first Arabic Named Entity Recognition (NER)
Shared Task. The primary focus of WojoodNER-2023 is on Arabic NER, offering
novel NER datasets (i.e., Wojood) and the definition of subtasks designed to
facilitate meaningful comparisons between different NER approaches.
WojoodNER-2023 encompassed two Subtasks: FlatNER and NestedNER. A total of 45
unique teams registered for this shared task, with 11 of them actively
participating in the test phase. Specifically, 11 teams participated in
FlatNER, while teams tackled NestedNER. The winning teams achieved F1
scores of 91.96 and 93.73 in FlatNER and NestedNER, respectively
Predicting disease risks from highly imbalanced data using random forest
<p>Abstract</p> <p>Background</p> <p>We present a method utilizing Healthcare Cost and Utilization Project (HCUP) dataset for predicting disease risk of individuals based on their medical diagnosis history. The presented methodology may be incorporated in a variety of applications such as risk management, tailored health communication and decision support systems in healthcare.</p> <p>Methods</p> <p>We employed the National Inpatient Sample (NIS) data, which is publicly available through Healthcare Cost and Utilization Project (HCUP), to train random forest classifiers for disease prediction. Since the HCUP data is highly imbalanced, we employed an ensemble learning approach based on repeated random sub-sampling. This technique divides the training data into multiple sub-samples, while ensuring that each sub-sample is fully balanced. We compared the performance of support vector machine (SVM), bagging, boosting and RF to predict the risk of eight chronic diseases.</p> <p>Results</p> <p>We predicted eight disease categories. Overall, the RF ensemble learning method outperformed SVM, bagging and boosting in terms of the area under the receiver operating characteristic (ROC) curve (AUC). In addition, RF has the advantage of computing the importance of each variable in the classification process.</p> <p>Conclusions</p> <p>In combining repeated random sub-sampling with RF, we were able to overcome the class imbalance problem and achieve promising results. Using the national HCUP data set, we predicted eight disease categories with an average AUC of 88.79%.</p
PDA project [abstract]
Faculty Mentor: Dr. Marjorie Skubic, Computer Engineering and Computer ScienceAbstract only availableThe goal of this project is to create a robot interface that allows a user to guide and control a robot to perform some task. The assumption is that, although the user may be a domain expert in how the task should be done, he is not an expert in robotics. During the actual robot use, he should focus on the task to be done rather than worrying about the robot or the interaction modality. To address this goal, we have been investigating the use of hand-drawn route maps, in which the user sketches an approximate representation of the environment and then sketches the desired robot trajectory with respect to that environment. The objective in the sketch interface is to extract spatial information about the map and a qualitative path through the landmarks drawn on the sketch. This information is used to build a task representation for the robot, which operates as a semiautonomous vehicle. The stylus interface of the PDA allows the user to sketch a map much as you would on paper. The PDA captures the string of (x,y) coordinates sketched on the screen, which forms a digital representation suitable for processing. The user first draws a representation of the environment by sketching the approximate boundary of each object. During the sketching process, a delimiter is included to separate the string of coordinates for each object in the environment. After all of the environment objects have been drawn, another delimiter is included to indicate the start of the robot trajectory, and the user sketches the desired path of the robot, relative to the sketched environment