21 research outputs found
Extrinsic Factors Affecting the Accuracy of Biomedical NER
Biomedical named entity recognition (NER) is a critial task that aims to
identify structured information in clinical text, which is often replete with
complex, technical terms and a high degree of variability. Accurate and
reliable NER can facilitate the extraction and analysis of important biomedical
information, which can be used to improve downstream applications including the
healthcare system. However, NER in the biomedical domain is challenging due to
limited data availability, as the high expertise, time, and expenses are
required to annotate its data. In this paper, by using the limited data, we
explore various extrinsic factors including the corpus annotation scheme, data
augmentation techniques, semi-supervised learning and Brill transformation, to
improve the performance of a NER model on a clinical text dataset (i2b2 2012,
\citet{sun-rumshisky-uzuner:2013}). Our experiments demonstrate that these
approaches can significantly improve the model's F1 score from original 73.74
to 77.55. Our findings suggest that considering different extrinsic factors and
combining these techniques is a promising approach for improving NER
performance in the biomedical domain where the size of data is limited
Recommending the Meanings of Newly Coined Words
AbstractIn this paper, we investigate how to recommend the meanings of newly coined words, such as newly coined named entities and Internet jargon. Our approach automatically chooses a document explaining a given newly coined word among candidate documents from multiple web references using Probabilistic Latent Semantic Analysis [1]. Briefly, it involves finding the topic of a document containing the newly coined word and computing the conditional probability of the topic given each candidate document. We validate our methodology with two real datasets from MySpace forums and Twitter by referencing three web services, Google, Urbandictionary, and Wikipedia, and we show that we properly recommend the meanings of a set of given newly coined words with 69.5% and 80.5% accuracies based on our three recommendations, respectively. Moreover, we compare our approach against three baselines where one references the result from each web service and our approach outperforms them
Yet Another Format of Universal Dependencies for Korean
In this study, we propose a morpheme-based scheme for Korean dependency
parsing and adopt the proposed scheme to Universal Dependencies. We present the
linguistic rationale that illustrates the motivation and the necessity of
adopting the morpheme-based format, and develop scripts that convert between
the original format used by Universal Dependencies and the proposed
morpheme-based format automatically. The effectiveness of the proposed format
for Korean dependency parsing is then testified by both statistical and neural
models, including UDPipe and Stanza, with our carefully constructed
morpheme-based word embedding for Korean. morphUD outperforms parsing results
for all Korean UD treebanks, and we also present detailed error analyses.Comment: COLING2022, Poste
Neural Automated Writing Evaluation with Corrective Feedback
The utilization of technology in second language learning and teaching has
become ubiquitous. For the assessment of writing specifically, automated
writing evaluation (AWE) and grammatical error correction (GEC) have become
immensely popular and effective methods for enhancing writing proficiency and
delivering instant and individualized feedback to learners. By leveraging the
power of natural language processing (NLP) and machine learning algorithms, AWE
and GEC systems have been developed separately to provide language learners
with automated corrective feedback and more accurate and unbiased scoring that
would otherwise be subject to examiners. In this paper, we propose an
integrated system for automated writing evaluation with corrective feedback as
a means of bridging the gap between AWE and GEC results for second language
learners. This system enables language learners to simulate the essay writing
tests: a student writes and submits an essay, and the system returns the
assessment of the writing along with suggested grammatical error corrections.
Given that automated scoring and grammatical correction are more efficient and
cost-effective than human grading, this integrated system would also alleviate
the burden of manually correcting innumerable essays.Comment: Supported by the SoTL Seed Program at UB
Trance parser model for Korean: Sejong treebank
<p>Trance parsing model + Embedding vector.</p>
<p>See https://github.com/tarowatanabe/trance for the parser and its usage. We also provide parsing and learning scripts for the Trance parser that we used for the paper; </p>
<p>1/ parsing model: ptb_train.txt.model-d100.tar.gz </p>
<p>2/ embedding vector: embedding-d100.vec.gz</p>
<p>3/ trance parser parsing script: trance-parsing.sh</p>
<p>4/ trance parser (batch) learning script: trance-training-batch.sh</p>
<p>5/ test.txt (gold file) and test.txt.leaf is for the parser input. </p>
<p>## Jungyeul Park, A Note on Constituent Parsing for Korean Using the Sejong Treebank (submitted to TALLIP). October 2017.</p>
<p>See https://github.com/jungyeul/tallip-sjtree-parsing for more detail. </p
Universal Dependencies for Korean: Hani (ver1.0)
<p>Universal Dependencies for Korean: Hani (ver1.0)</p>
<p> </p