Search CORE

14 research outputs found

Evaluation via Negativa of Chinese Word Segmentation for Information Retrieval

Author: Hsu Wen-Lian
Jiang Mike Tian-Jian
Kuo Chan-Hung
Shih Cheng-Wei
Tsai Richard Tzong-Han
Publication venue: Institute of Digital Enhancement of Cognitive Processing, Waseda University
Publication date: 01/01/2011
Field of study

Waseda University Repository

A Study of the Effectiveness of Suffixes for Chinese Word Segmentation

Author: Li Xiaoqing
Su Keh-Yih
Zong Chengqing
Publication venue: Department of English, National Chengchi University
Publication date: 01/01/2013
Field of study

Waseda University Repository

A Novel Schema-Oriented Approach for Chinese New Word Identification

Author: Gu Junzhong
Lu Zhao
Yan Zhixian
Publication venue: Department of English, National Chengchi University
Publication date: 01/01/2013
Field of study

Waseda University Repository

Improving Cross-Domain Chinese Word Segmentation with Word Embeddings

Author: Li Weikang
Qiu Likun
Sun Jian
Ye Yuxiao
Zhang Yue
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2019
Field of study

Cross-domain Chinese Word Segmentation (CWS) remains a challenge despite recent progress in neural-based CWS. The limited amount of annotated data in the target domain has been the key obstacle to a satisfactory performance. In this paper, we propose a semi-supervised word-based approach to improving cross-domain CWS given a baseline segmenter. Particularly, our model only deploys word embeddings trained on raw text in the target domain, discarding complex hand-crafted features and domain-specific dictionaries. Innovative subsampling and negative sampling methods are proposed to derive word embeddings optimized for CWS. We conduct experiments on five datasets in special domains, covering domains in novels, medicine, and patent. Results show that our model can obviously improve cross-domain CWS, especially in the segmentation of domain-specific noun entities. The word F-measure increases by over 3.0% on four datasets, outperforming state-of-the-art semi-supervised and unsupervised cross-domain CWS approaches with a large margin. We make our code and data available on Github

arXiv.org e-Print Archive

Crossref

RSpell: Retrieval-augmented Framework for Domain Adaptive Chinese Spelling Check

Author: Cao Ziqiang
Fu Guohong
Geng Lei
Lv Qi
Song Siqi
Publication venue
Publication date: 16/08/2023
Field of study

Chinese Spelling Check (CSC) refers to the detection and correction of spelling errors in Chinese texts. In practical application scenarios, it is important to make CSC models have the ability to correct errors across different domains. In this paper, we propose a retrieval-augmented spelling check framework called RSpell, which searches corresponding domain terms and incorporates them into CSC models. Specifically, we employ pinyin fuzzy matching to search for terms, which are combined with the input and fed into the CSC model. Then, we introduce an adaptive process control mechanism to dynamically adjust the impact of external knowledge on the model. Additionally, we develop an iterative strategy for the RSpell framework to enhance reasoning capabilities. We conducted experiments on CSC datasets in three domains: law, medicine, and official document writing. The results demonstrate that RSpell achieves state-of-the-art performance in both zero-shot and fine-tuning scenarios, demonstrating the effectiveness of the retrieval-augmented CSC framework. Our code is available at https://github.com/47777777/Rspell

arXiv.org e-Print Archive

Integrating Dictionary and Web N-grams for Chinese Spell Checking

Author: Hsun-Wen Chiu
Jason S Chang
Jian-Cheng Wu
Publication venue
Publication date: 01/01/2013
Field of study

Abstract Chinese spell checking is an important component of many NLP applications, including word processors, search engines, and automatic essay rating. Nevertheless, compared to spell checkers for alphabetical languages (e.g., English or French), Chinese spell checkers are more difficult to develop because there are no word boundaries in the Chinese writing system and errors may be caused by various Chinese input methods. In this paper, we propose a novel method for detecting and correcting Chinese typographical errors. Our approach involves word segmentation, detection rules, and phrase-based machine translation. The error detection module detects errors by segmenting words and checking word and phrase frequency based on compiled and Web corpora. The phonological or morphological typographical errors found then are corrected by running a decoder based on the statistical machine translation model (SMT). The results show that the proposed system achieves significantly better accuracy in error detection and more satisfactory performance in error correction than the state-of-the-art systems

CiteSeerX