5 research outputs found

    Neural Chinese Word Segmentation with Lexicon and Unlabeled Data via Posterior Regularization

    Full text link
    Existing methods for CWS usually rely on a large number of labeled sentences to train word segmentation models, which are expensive and time-consuming to annotate. Luckily, the unlabeled data is usually easy to collect and many high-quality Chinese lexicons are off-the-shelf, both of which can provide useful information for CWS. In this paper, we propose a neural approach for Chinese word segmentation which can exploit both lexicon and unlabeled data. Our approach is based on a variant of posterior regularization algorithm, and the unlabeled data and lexicon are incorporated into model training as indirect supervision by regularizing the prediction space of CWS models. Extensive experiments on multiple benchmark datasets in both in-domain and cross-domain scenarios validate the effectiveness of our approach.Comment: 7 pages, 11 figures, accepted by the 2019 World Wide Web Conference (WWW '19

    Neural Networks Incorporating Dictionaries for Chinese Word Segmentation

    No full text
    In recent years, deep neural networks have achieved significant success in Chinese word segmentation and many other natural language processing tasks. Most of these algorithms are end-to-end trainable systems and can effectively process and learn from large scale labeled datasets. However, these methods typically lack the capability of processing rare words and data whose domains are different from training data. Previous statistical methods have demonstrated that human knowledge can provide valuable information for handling rare cases and domain shifting problems. In this paper, we seek to address the problem of incorporating dictionaries into neural networks for the Chinese word segmentation task. Two different methods that extend the bi-directional long short-term memory neural network are proposed to perform the task. To evaluate the performance of the proposed methods, state-of-the-art supervised models based methods and domain adaptation approaches are compared with our methods on nine datasets from different domains. The experimental results demonstrate that the proposed methods can achieve better performance than other state-of-the-art neural network methods and domain adaptation approaches in most cases

    A reception study of machine translated subtitles for MOOCs

    Get PDF
    As MOOCs (Massive Open Online Courses) grow rapidly around the world, the language barrier is becoming a serious issue. Removing this obstacle by creating translated subtitles is an indispensable part of developing MOOCs and improving accessibility. Given the large quantity of MOOCs available worldwide and the considerable demand for them, machine translation (MT) appears to offer an alternative or complementary translation solution, thus providing the motivation for this research. The main goal of this research is to test the impact machine translated subtitles have on Chinese viewers’ reception of MOOC content. More specifically, the author is interested in whether there is any difference between viewers’ reception of raw machine translated subtitles as opposed to fully post-edited machine translated subtitles and human translated subtitles. Reception is operationalized by adapting Gambier's (2007) model, which divides ‘reception’ into ‘the three Rs’: (i) response, (ii) reaction and (iii) repercussion. Response refers to the initial physical response of a viewer to an audio-visual stimulus, in this case the subtitle and the rest of the image. Reaction involves the cognitive follow-on from initial response, and is linked to how much effort is involved in processing the subtitling stimulus and what is understood by the viewer. Repercussion refers to attitudinal and sociocultural dimensions of AVT consumption. The research contains a pilot study and a main experiment. Mixed methods of eye-tracking, questionnaires, translation quality assessment and frequency analysis were adopted. Over 60 native Chinese speakers were recruited as participants for this research. They were divided into three groups, those who read subtitles created by raw MT, post-edited MT (PE) and human translation (HT). Results show that most participants had a positive attitude towards the subtitles regardless of their type. Participants who were offered PE subtitles scored the best overall on the selected reception metrics. Participants who were offered HT subtitles performed the worst in some of the selected reception metrics
    corecore