Search CORE

2 research outputs found

Improving Named Entity Recognition for Chinese Social Media with Word Segmentation Representation Learning

Author: Dredze Mark
Peng Nanyun
Publication venue
Publication date: 28/03/2017
Field of study

Named entity recognition, and other information extraction tasks, frequently use linguistic features such as part of speech tags or chunkings. For languages where word boundaries are not readily identified in text, word segmentation is a key first step to generating features for an NER system. While using word boundary tags as features are helpful, the signals that aid in identifying these boundaries may provide richer information for an NER system. New state-of-the-art word segmentation systems use neural models to learn representations for predicting word boundaries. We show that these same representations, jointly trained with an NER system, yield significant improvements in NER for Chinese social media. In our experiments, jointly training NER and word segmentation with an LSTM-CRF model yields nearly 5% absolute improvement over previously published results.Comment: This is the camera ready version of our ACL'16 paper. We also added a supplementary material containing the results of our systems on a cleaner dataset (much higher F1 scores). More information please refer to the repo https://github.com/hltcoe/golden-hors

arXiv.org e-Print Archive

Incorporating Uncertain Segmentation Information into Chinese NER for Social Media Text

Author: Chen Xiaojun
Ding Ling
E Shijia
Jia Shengbin
Xiang Yang
Publication venue
Publication date: 15/06/2020
Field of study

Chinese word segmentation is necessary to provide word-level information for Chinese named entity recognition (NER) systems. However, segmentation error propagation is a challenge for Chinese NER while processing colloquial data like social media text. In this paper, we propose a model (UIcwsNN) that specializes in identifying entities from Chinese social media text, especially by leveraging ambiguous information of word segmentation. Such uncertain information contains all the potential segmentation states of a sentence that provides a channel for the model to infer deep word-level characteristics. We propose a trilogy (i.e., candidate position embedding -> position selective attention -> adaptive word convolution) to encode uncertain word segmentation information and acquire appropriate word-level representation. Experiments results on the social media corpus show that our model alleviates the segmentation error cascading trouble effectively, and achieves a significant performance improvement of more than 2% over previous state-of-the-art methods.Comment: SocialNLP@ACL202

arXiv.org e-Print Archive