12 research outputs found
Morphological Analysis of the Dravidian Language Family
The Dravidian family is one of the most
widely spoken set of languages in the
world, yet there are very few annotated resources
available to NLP researchers. To
remedy this, we create DravMorph, a corpus
annotated for morphological segmentation
and part-of-speech. Also, we exploit
novel features and higher-order models to
achieve promising results on these corpora
on both tasks, beating techniques proposed
in the literature by as much as 4 points in
segmentation F1.Postprint (published version
Thread-level information for comment classification in community question answering
Community Question Answering (cQA) is a new application of QA in social contexts (e.g., fora). It presents new interesting challenges and research directions, e.g., exploiting the dependencies between the different comments of a thread to select the best answer for a given question. In this paper, we explored two ways of modeling such dependencies: (i) by designing specific features looking globally at the thread; and (ii) by applying structure prediction models. We trained and evaluated our models on data from SemEval-2015 Task 3 on Answer Selection in cQA. Our experiments show that: (i) the thread-level features consistently improve the performance for a variety of machine learning models, yielding state-of-the-art results; and (ii) sequential dependencies between the answer labels captured by structured prediction models are not enough to improve the results, indicating that more information is needed in the joint model
Precursor-induced conditional random fields: connecting separate entities by induction for improved clinical named entity recognition
Background
This paper presents a conditional random fields (CRF) method that enables the capture of specific high-order label transition factors to improve clinical named entity recognition performance. Consecutive clinical entities in a sentence are usually separated from each other, and the textual descriptions in clinical narrative documents frequently indicate causal or posterior relationships that can be used to facilitate clinical named entity recognition. However, the CRF that is generally used for named entity recognition is a first-order model that constrains label transition dependency of adjoining labels under the Markov assumption.
Methods
Based on the first-order structure, our proposed model utilizes non-entity tokens between separated entities as an information transmission medium by applying a label induction method. The model is referred to as precursor-induced CRF because its non-entity state memorizes precursor entity information, and the models structure allows the precursor entity information to propagate forward through the label sequence.
Results
We compared the proposed model with both first- and second-order CRFs in terms of their F1-scores, using two clinical named entity recognition corpora (the i2b2 2012 challenge and the Seoul National University Hospital electronic health record). The proposed model demonstrated better entity recognition performance than both the first- and second-order CRFs and was also more efficient than the higher-order model.
Conclusion
The proposed precursor-induced CRF which uses non-entity labels as label transition information improves entity recognition F1 score by exploiting long-distance transition factors without exponentially increasing the computational time. In contrast, a conventional second-order CRF model that uses longer distance transition factors showed even worse results than the first-order model and required the longest computation time. Thus, the proposed model could offer a considerable performance improvement over current clinical named entity recognition methods based on the CRF models.This work was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education [No. NRF-2015R1D1A1A01058075]; and also supported by a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health &Welfare, Republic of Korea [grant number HI14C1277]
Neural networks versus Logistic regression for 30 days all-cause readmission prediction
Heart failure (HF) is one of the leading causes of hospital admissions in the
US. Readmission within 30 days after a HF hospitalization is both a recognized
indicator for disease progression and a source of considerable financial burden
to the healthcare system. Consequently, the identification of patients at risk
for readmission is a key step in improving disease management and patient
outcome. In this work, we used a large administrative claims dataset to
(1)explore the systematic application of neural network-based models versus
logistic regression for predicting 30 days all-cause readmission after
discharge from a HF admission, and (2)to examine the additive value of
patients' hospitalization timelines on prediction performance. Based on data
from 272,778 (49% female) patients with a mean (SD) age of 73 years (14) and
343,328 HF admissions (67% of total admissions), we trained and tested our
predictive readmission models following a stratified 5-fold cross-validation
scheme. Among the deep learning approaches, a recurrent neural network (RNN)
combined with conditional random fields (CRF) model (RNNCRF) achieved the best
performance in readmission prediction with 0.642 AUC (95% CI, 0.640-0.645).
Other models, such as those based on RNN, convolutional neural networks and CRF
alone had lower performance, with a non-timeline based model (MLP) performing
worst. A competitive model based on logistic regression with LASSO achieved a
performance of 0.643 AUC (95%CI, 0.640-0.646). We conclude that data from
patient timelines improve 30 day readmission prediction for neural
network-based models, that a logistic regression with LASSO has equal
performance to the best neural network model and that the use of administrative
data result in competitive performance compared to published approaches based
on richer clinical datasets