3 research outputs found
A realistic and robust model for Chinese word segmentation
A realistic Chinese word segmentation tool must adapt to textual variations
with minimal training input and yet robust enough to yield reliable
segmentation result for all variants. Various lexicon-driven approaches to
Chinese segmentation, e.g. [1,16], achieve high f-scores yet require massive
training for any variation. Text-driven approach, e.g. [12], can be easily
adapted for domain and genre changes yet has difficulty matching the high
f-scores of the lexicon-driven approaches. In this paper, we refine and
implement an innovative text-driven word boundary decision (WBD) segmentation
model proposed in [15]. The WBD model treats word segmentation simply and
efficiently as a binary decision on whether to realize the natural textual
break between two adjacent characters as a word boundary. The WBD model allows
simple and quick training data preparation converting characters as contextual
vectors for learning the word boundary decision. Machine learning experiments
with four different classifiers show that training with 1,000 vectors and 1
million vectors achieve comparable and reliable results. In addition, when
applied to SigHAN Bakeoff 3 competition data, the WBD model produces OOV recall
rates that are higher than all published results. Unlike all previous work, our
OOV recall rate is comparable to our own F-score. Both experiments support the
claim that the WBD model is a realistic model for Chinese word segmentation as
it can be easily adapted for new variants with the robust result. In
conclusion, we will discuss linguistic ramifications as well as future
implications for the WBD approach.Comment: Proceedings of the 20th Conference on Computational Linguistics and
Speech Processin
Development of a Dataset and a Deep Learning Baseline Named Entity Recognizer for Three Low Resource Languages: Bhojpuri, Maithili and Magahi
In Natural Language Processing (NLP) pipelines, Named Entity Recognition
(NER) is one of the preliminary problems, which marks proper nouns and other
named entities such as Location, Person, Organization, Disease etc. Such
entities, without a NER module, adversely affect the performance of a machine
translation system. NER helps in overcoming this problem by recognising and
handling such entities separately, although it can be useful in Information
Extraction systems also. Bhojpuri, Maithili and Magahi are low resource
languages, usually known as Purvanchal languages. This paper focuses on the
development of a NER benchmark dataset for the Machine Translation systems
developed to translate from these languages to Hindi by annotating parts of
their available corpora. Bhojpuri, Maithili and Magahi corpora of sizes 228373,
157468 and 56190 tokens, respectively, were annotated using 22 entity labels.
The annotation considers coarse-grained annotation labels followed by the
tagset used in one of the Hindi NER datasets. We also report a Deep Learning
based baseline that uses an LSTM-CNNs-CRF model. The lower baseline F1-scores
from the NER tool obtained by using Conditional Random Fields models are 96.73
for Bhojpuri, 93.33 for Maithili and 95.04 for Magahi. The Deep Learning-based
technique (LSTM-CNNs-CRF) achieved 96.25 for Bhojpuri, 93.33 for Maithili and
95.44 for Magahi.Comment: 34 pages; 7 figure
M.: A systematic cross-comparison of sequence classifiers
In the CoNLL 2003 NER shared task, more than two thirds of the submitted systems used a feature-rich representation of the task. Most of them used the maximum entropy principle to combine the features together. Others used large margin linear classifiers, such as SVM and RRM. In this paper, we compare several common classifiers under exactly the same conditions, demonstrating that the ranking of systems in the shared task is due to feature selection and other causes and not due to inherent qualities of the algorithms, which should be ranked otherwise. We demonstrate that whole-sequence models generally outperform local models, and that large margin classifiers generally outperform maximum entropy-based classifiers.