10,986 research outputs found

    Can MDL Improve Unsupervised Chinese Word Segmentation?

    Get PDF
    International audienceIt is often assumed that Minimum Descrip- tion Length (MDL) is a good criterion for unsupervised word segmentation. In this paper, we introduce a new approach to unsupervised word segmentation of Man- darin Chinese, that leads to segmentations whose Description Length is lower than what can be obtained using other algo- rithms previously proposed in the litera- ture. Suprisingly, we show that this lower Description Length does not necessarily corresponds to better segmentation results. Finally, we show that we can use very basic linguistic knowledge to coerce the MDL towards a linguistically plausible hypoth- esis and obtain better results than any pre- viously proposed method for unsupervised Chinese word segmentation with minimal human effort

    Dual Long Short-Term Memory Networks for Sub-Character Representation Learning

    Full text link
    Characters have commonly been regarded as the minimal processing unit in Natural Language Processing (NLP). But many non-latin languages have hieroglyphic writing systems, involving a big alphabet with thousands or millions of characters. Each character is composed of even smaller parts, which are often ignored by the previous work. In this paper, we propose a novel architecture employing two stacked Long Short-Term Memory Networks (LSTMs) to learn sub-character level representation and capture deeper level of semantic meanings. To build a concrete study and substantiate the efficiency of our neural architecture, we take Chinese Word Segmentation as a research case example. Among those languages, Chinese is a typical case, for which every character contains several components called radicals. Our networks employ a shared radical level embedding to solve both Simplified and Traditional Chinese Word Segmentation, without extra Traditional to Simplified Chinese conversion, in such a highly end-to-end way the word segmentation can be significantly simplified compared to the previous work. Radical level embeddings can also capture deeper semantic meaning below character level and improve the system performance of learning. By tying radical and character embeddings together, the parameter count is reduced whereas semantic knowledge is shared and transferred between two levels, boosting the performance largely. On 3 out of 4 Bakeoff 2005 datasets, our method surpassed state-of-the-art results by up to 0.4%. Our results are reproducible, source codes and corpora are available on GitHub.Comment: Accepted & forthcoming at ITNG-201

    Name Strategy: Its Existence and Implications

    Get PDF
    It is argued that colour name strategy, object name strategy, and chunking strategy in memory are all aspects of the same general phenomena, called stereotyping, and this in turn is an example of a know-how representation. Such representations are argued to have their origin in a principle called the minimum duplication of resources. For most the subsequent discussions existence of colour name strategy suffices. It is pointed out that the Berlin†- Kay† universal partial ordering of colours and the frequency of traffic accidents classified by colour are surprisingly similar; a detailed analysis is not carried out as the specific colours recorded are not identical. Some consequences of the existence of a name strategy for the philosophy of language and mathematics are discussed: specifically it is argued that in accounts of truth and meaning it is necessary throughout to use real numbers as opposed to bi-valent quantities; and also that the concomitant label associated with sentences should not be of unconditional truth, but rather several real-valued quantities associated with visual communication. The implication of real-valued truth quantities is that the Continuum Hypothesis of pure mathematics is side-stepped, because real valued quantities occur ab initio. The existence of name strategy shows that thought/sememes and talk/phonemes can be separate, and this vindicates the assumption of thought occurring before talk used in psycho-linguistic speech production models.

    Orthographic input and phonological representations in learners of Chinese as a foreign language.

    Get PDF
    This paper provides evidence that the second language orthographic input affects the mental representations of L2 phonology in instructed beginner L2 learners. Previous research has shown that orthographic representations affect monolinguals' performance in phonological awareness tasks; in instructed L2 learners such representations could also affect pronunciation. This study looked at the phonological representations of Chinese rimes in beginner learners of Chinese as a foreign language, using a phoneme counting task and a phoneme segmentation task. Results show that learners do not count or segment the main vowel in those syllables where it is not represented in the pinyin (romanisation) orthographic representations. It appears that the pinyin orthographic input is reinterpreted according to L1 phonology-orthography correspondences, and interacts with the phonological input in shaping the phonological representations of Chinese syllables in beginner learners. This explains previous findings that learners of Chinese do not pronounce the main vowel in these syllables

    Fast and Accurate Neural Word Segmentation for Chinese

    Full text link
    Neural models with minimal feature engineering have achieved competitive performance against traditional methods for the task of Chinese word segmentation. However, both training and working procedures of the current neural models are computationally inefficient. This paper presents a greedy neural word segmenter with balanced word and character embedding inputs to alleviate the existing drawbacks. Our segmenter is truly end-to-end, capable of performing segmentation much faster and even more accurate than state-of-the-art neural models on Chinese benchmark datasets.Comment: To appear in ACL201

    The Zero Resource Speech Challenge 2017

    Full text link
    We describe a new challenge aimed at discovering subword and word units from raw speech. This challenge is the followup to the Zero Resource Speech Challenge 2015. It aims at constructing systems that generalize across languages and adapt to new speakers. The design features and evaluation metrics of the challenge are presented and the results of seventeen models are discussed.Comment: IEEE ASRU (Automatic Speech Recognition and Understanding) 2017. Okinawa, Japa
    • 

    corecore