5 research outputs found
A realistic and robust model for Chinese word segmentation
A realistic Chinese word segmentation tool must adapt to textual variations
with minimal training input and yet robust enough to yield reliable
segmentation result for all variants. Various lexicon-driven approaches to
Chinese segmentation, e.g. [1,16], achieve high f-scores yet require massive
training for any variation. Text-driven approach, e.g. [12], can be easily
adapted for domain and genre changes yet has difficulty matching the high
f-scores of the lexicon-driven approaches. In this paper, we refine and
implement an innovative text-driven word boundary decision (WBD) segmentation
model proposed in [15]. The WBD model treats word segmentation simply and
efficiently as a binary decision on whether to realize the natural textual
break between two adjacent characters as a word boundary. The WBD model allows
simple and quick training data preparation converting characters as contextual
vectors for learning the word boundary decision. Machine learning experiments
with four different classifiers show that training with 1,000 vectors and 1
million vectors achieve comparable and reliable results. In addition, when
applied to SigHAN Bakeoff 3 competition data, the WBD model produces OOV recall
rates that are higher than all published results. Unlike all previous work, our
OOV recall rate is comparable to our own F-score. Both experiments support the
claim that the WBD model is a realistic model for Chinese word segmentation as
it can be easily adapted for new variants with the robust result. In
conclusion, we will discuss linguistic ramifications as well as future
implications for the WBD approach.Comment: Proceedings of the 20th Conference on Computational Linguistics and
Speech Processin
Improving Named Entity Recognition for Chinese Social Media with Word Segmentation Representation Learning
Named entity recognition, and other information extraction tasks, frequently
use linguistic features such as part of speech tags or chunkings. For languages
where word boundaries are not readily identified in text, word segmentation is
a key first step to generating features for an NER system. While using word
boundary tags as features are helpful, the signals that aid in identifying
these boundaries may provide richer information for an NER system. New
state-of-the-art word segmentation systems use neural models to learn
representations for predicting word boundaries. We show that these same
representations, jointly trained with an NER system, yield significant
improvements in NER for Chinese social media. In our experiments, jointly
training NER and word segmentation with an LSTM-CRF model yields nearly 5%
absolute improvement over previously published results.Comment: This is the camera ready version of our ACL'16 paper. We also added a
supplementary material containing the results of our systems on a cleaner
dataset (much higher F1 scores). More information please refer to the repo
https://github.com/hltcoe/golden-hors
Chinese Named Entity Recognition Augmented with Lexicon Memory
Inspired by a concept of content-addressable retrieval from cognitive
science, we propose a novel fragment-based model augmented with a lexicon-based
memory for Chinese NER, in which both the character-level and word-level
features are combined to generate better feature representations for possible
name candidates. It is observed that locating the boundary information of
entity names is useful in order to classify them into pre-defined categories.
Position-dependent features, including prefix and suffix are introduced for NER
in the form of distributed representation. The lexicon-based memory is used to
help generate such position-dependent features and deal with the problem of
out-of-vocabulary words. Experimental results showed that the proposed model,
called LEMON, achieved state-of-the-art on four datasets
Simplify the Usage of Lexicon in Chinese NER
Recently, many works have tried to utilizing word lexicon to augment the
performance of Chinese named entity recognition (NER). As a representative work
in this line, Lattice-LSTM \cite{zhang2018chinese} has achieved new
state-of-the-art performance on several benchmark Chinese NER datasets.
However, Lattice-LSTM suffers from a complicated model architecture, resulting
in low computational efficiency. This will heavily limit its application in
many industrial areas, which require real-time NER response. In this work, we
ask the question: if we can simplify the usage of lexicon and, at the same
time, achieve comparative performance with Lattice-LSTM for Chinese NER?
Started with this question and motivated by the idea of Lattice-LSTM, we
propose a concise but effective method to incorporate the lexicon information
into the vector representations of characters. This way, our method can avoid
introducing a complicated sequence modeling architecture to model the lexicon
information. Instead, it only needs to subtly adjust the character
representation layer of the neural sequence model. Experimental study on four
benchmark Chinese NER datasets shows that our method can achieve much faster
inference speed, comparative or better performance over Lattice-LSTM and its
follwees. It also shows that our method can be easily transferred across
difference neural architectures.Comment: Use Lexicon for Chinese NER as simply as possibl
Chinese NER Using Lattice LSTM
We investigate a lattice-structured LSTM model for Chinese NER, which encodes
a sequence of input characters as well as all potential words that match a
lexicon. Compared with character-based methods, our model explicitly leverages
word and word sequence information. Compared with word-based methods, lattice
LSTM does not suffer from segmentation errors. Gated recurrent cells allow our
model to choose the most relevant characters and words from a sentence for
better NER results. Experiments on various datasets show that lattice LSTM
outperforms both word-based and character-based LSTM baselines, achieving the
best results.Comment: Accepted at ACL 2018 as Long pape