13,176 research outputs found

    Semi-supervised learning for natural language

    Get PDF
    Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2005.Includes bibliographical references (p. 75-82).Statistical supervised learning techniques have been successful for many natural language processing tasks, but they require labeled datasets, which can be expensive to obtain. On the other hand, unlabeled data (raw text) is often available "for free" in large quantities. Unlabeled data has shown promise in improving the performance of a number of tasks, e.g. word sense disambiguation, information extraction, and natural language parsing. In this thesis, we focus on two segmentation tasks, named-entity recognition and Chinese word segmentation. The goal of named-entity recognition is to detect and classify names of people, organizations, and locations in a sentence. The goal of Chinese word segmentation is to find the word boundaries in a sentence that has been written as a string of characters without spaces. Our approach is as follows: In a preprocessing step, we use raw text to cluster words and calculate mutual information statistics. The output of this step is then used as features in a supervised model, specifically a global linear model trained using the Perception algorithm. We also compare Markov and semi-Markov models on the two segmentation tasks. Our results show that features derived from unlabeled data substantially improves performance, both in terms of reducing the amount of labeled data needed to achieve a certain performance level and in terms of reducing the error using a fixed amount of labeled data. We find that sometimes semi-Markov models can also improve performance over Markov models.by Percy Liang.M.Eng

    Learning Spatial-Semantic Context with Fully Convolutional Recurrent Network for Online Handwritten Chinese Text Recognition

    Get PDF
    Online handwritten Chinese text recognition (OHCTR) is a challenging problem as it involves a large-scale character set, ambiguous segmentation, and variable-length input sequences. In this paper, we exploit the outstanding capability of path signature to translate online pen-tip trajectories into informative signature feature maps using a sliding window-based method, successfully capturing the analytic and geometric properties of pen strokes with strong local invariance and robustness. A multi-spatial-context fully convolutional recurrent network (MCFCRN) is proposed to exploit the multiple spatial contexts from the signature feature maps and generate a prediction sequence while completely avoiding the difficult segmentation problem. Furthermore, an implicit language model is developed to make predictions based on semantic context within a predicting feature sequence, providing a new perspective for incorporating lexicon constraints and prior knowledge about a certain language in the recognition procedure. Experiments on two standard benchmarks, Dataset-CASIA and Dataset-ICDAR, yielded outstanding results, with correct rates of 97.10% and 97.15%, respectively, which are significantly better than the best result reported thus far in the literature.Comment: 14 pages, 9 figure
    • …
    corecore