1,441 research outputs found
Exploiting Multiple Embeddings for Chinese Named Entity Recognition
Identifying the named entities mentioned in text would enrich many semantic
applications at the downstream level. However, due to the predominant usage of
colloquial language in microblogs, the named entity recognition (NER) in
Chinese microblogs experience significant performance deterioration, compared
with performing NER in formal Chinese corpus. In this paper, we propose a
simple yet effective neural framework to derive the character-level embeddings
for NER in Chinese text, named ME-CNER. A character embedding is derived with
rich semantic information harnessed at multiple granularities, ranging from
radical, character to word levels. The experimental results demonstrate that
the proposed approach achieves a large performance improvement on Weibo dataset
and comparable performance on MSRA news dataset with lower computational cost
against the existing state-of-the-art alternatives.Comment: accepted at CIKM 201
Learning Character-level Compositionality with Visual Features
Previous work has modeled the compositionality of words by creating
character-level models of meaning, reducing problems of sparsity for rare
words. However, in many writing systems compositionality has an effect even on
the character-level: the meaning of a character is derived by the sum of its
parts. In this paper, we model this effect by creating embeddings for
characters based on their visual characteristics, creating an image for the
character and running it through a convolutional neural network to produce a
visual character embedding. Experiments on a text classification task
demonstrate that such model allows for better processing of instances with rare
characters in languages such as Chinese, Japanese, and Korean. Additionally,
qualitative analyses demonstrate that our proposed model learns to focus on the
parts of characters that carry semantic content, resulting in embeddings that
are coherent in visual space.Comment: Accepted to ACL 201
Neural Chinese Word Segmentation with Lexicon and Unlabeled Data via Posterior Regularization
Existing methods for CWS usually rely on a large number of labeled sentences
to train word segmentation models, which are expensive and time-consuming to
annotate. Luckily, the unlabeled data is usually easy to collect and many
high-quality Chinese lexicons are off-the-shelf, both of which can provide
useful information for CWS. In this paper, we propose a neural approach for
Chinese word segmentation which can exploit both lexicon and unlabeled data.
Our approach is based on a variant of posterior regularization algorithm, and
the unlabeled data and lexicon are incorporated into model training as indirect
supervision by regularizing the prediction space of CWS models. Extensive
experiments on multiple benchmark datasets in both in-domain and cross-domain
scenarios validate the effectiveness of our approach.Comment: 7 pages, 11 figures, accepted by the 2019 World Wide Web Conference
(WWW '19
- …