82,928 research outputs found

    Applied morphological processing of English

    Get PDF
    We describe two newly developed computational tools for morphological processing: a program for analysis of English inflectional morphology, and a morphological generator, automatically derived from the analyser. The tools are fast, being based on finite-state techniques, have wide coverage, incorporating data from various corpora and machine readable dictionaries, and are robust, in that they are able to deal effectively with unknown words. The tools are freely available. We evaluate the accuracy and speed of both tools and discuss a number of practical applications in which they have been put to use

    Error-tolerant Finite State Recognition with Applications to Morphological Analysis and Spelling Correction

    Get PDF
    Error-tolerant recognition enables the recognition of strings that deviate mildly from any string in the regular set recognized by the underlying finite state recognizer. Such recognition has applications in error-tolerant morphological processing, spelling correction, and approximate string matching in information retrieval. After a description of the concepts and algorithms involved, we give examples from two applications: In the context of morphological analysis, error-tolerant recognition allows misspelled input word forms to be corrected, and morphologically analyzed concurrently. We present an application of this to error-tolerant analysis of agglutinative morphology of Turkish words. The algorithm can be applied to morphological analysis of any language whose morphology is fully captured by a single (and possibly very large) finite state transducer, regardless of the word formation processes and morphographemic phenomena involved. In the context of spelling correction, error-tolerant recognition can be used to enumerate correct candidate forms from a given misspelled string within a certain edit distance. Again, it can be applied to any language with a word list comprising all inflected forms, or whose morphology is fully described by a finite state transducer. We present experimental results for spelling correction for a number of languages. These results indicate that such recognition works very efficiently for candidate generation in spelling correction for many European languages such as English, Dutch, French, German, Italian (and others) with very large word lists of root and inflected forms (some containing well over 200,000 forms), generating all candidate solutions within 10 to 45 milliseconds (with edit distance 1) on a SparcStation 10/41. For spelling correction in Turkish, error-tolerantComment: Replaces 9504031. gzipped, uuencoded postscript file. To appear in Computational Linguistics Volume 22 No:1, 1996, Also available as ftp://ftp.cs.bilkent.edu.tr/pub/ko/clpaper9512.ps.

    Processing of regular and irregular past tense morphology in highly proficient second language learners of English: a self-paced reading study

    Get PDF
    Dual-system models suggest that English past tense morphology involves two processing routes: rule application for regular verbs and memory retrieval for irregular verbs (Pinker, 1999). In second language (L2) processing research, Ullman (2001a) suggested that both verb types are retrieved from memory, but more recently Clahsen and Felser (2006) and Ullman (2004) argued that past tense rule application can be automatised with experience by L2 learners. To address this controversy, we tested highly proficient Greek-English learners with naturalistic or classroom L2 exposure compared to native English speakers in a self-paced reading task involving past tense forms embedded in plausible sentences. Our results suggest that, irrespective to the type of exposure, proficient L2 learners of extended L2 exposure apply rule-based processing

    Morphological Analysis as Classification: an Inductive-Learning Approach

    Full text link
    Morphological analysis is an important subtask in text-to-speech conversion, hyphenation, and other language engineering tasks. The traditional approach to performing morphological analysis is to combine a morpheme lexicon, sets of (linguistic) rules, and heuristics to find a most probable analysis. In contrast we present an inductive learning approach in which morphological analysis is reformulated as a segmentation task. We report on a number of experiments in which five inductive learning algorithms are applied to three variations of the task of morphological analysis. Results show (i) that the generalisation performance of the algorithms is good, and (ii) that the lazy learning algorithm IB1-IG performs best on all three tasks. We conclude that lazy learning of morphological analysis as a classification task is indeed a viable approach; moreover, it has the strong advantages over the traditional approach of avoiding the knowledge-acquisition bottleneck, being fast and deterministic in learning and processing, and being language-independent.Comment: 11 pages, 5 encapsulated postscript figures, uses non-standard NeMLaP proceedings style nemlap.sty; inputs ipamacs (international phonetic alphabet) and epsf macro

    Developmental changes in the role of different metalinguistic awareness skills in Chinese reading acquisition from preschool to third grade

    Get PDF
    Copyright @ 2014 Wei et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.The present study investigated the relationship between Chinese reading skills and metalinguistic awareness skills such as phonological, morphological, and orthographic awareness for 101 Preschool, 94 Grade-1, 98 Grade-2, and 98 Grade-3 children from two primary schools in Mainland China. The aim of the study was to examine how each of these metalinguistic awareness skills would exert their influence on the success of reading in Chinese with age. The results showed that all three metalinguistic awareness skills significantly predicted reading success. It further revealed that orthographic awareness played a dominant role in the early stages of reading acquisition, and its influence decreased with age, while the opposite was true for the contribution of morphological awareness. The results were in stark contrast with studies in English, where phonological awareness is typically shown as the single most potent metalinguistic awareness factor in literacy acquisition. In order to account for the current data, a three-stage model of reading acquisition in Chinese is discussed.National Natural Science Foundation of China and Knowledge Innovation Program of the Chinese Academy of Sciences

    Beyond Stemming and Lemmatization: Ultra-stemming to Improve Automatic Text Summarization

    Full text link
    In Automatic Text Summarization, preprocessing is an important phase to reduce the space of textual representation. Classically, stemming and lemmatization have been widely used for normalizing words. However, even using normalization on large texts, the curse of dimensionality can disturb the performance of summarizers. This paper describes a new method for normalization of words to further reduce the space of representation. We propose to reduce each word to its initial letters, as a form of Ultra-stemming. The results show that Ultra-stemming not only preserve the content of summaries produced by this representation, but often the performances of the systems can be dramatically improved. Summaries on trilingual corpora were evaluated automatically with Fresa. Results confirm an increase in the performance, regardless of summarizer system used.Comment: 22 pages, 12 figures, 9 table

    Building Morphological Chains for Agglutinative Languages

    Get PDF
    In this paper, we build morphological chains for agglutinative languages by using a log-linear model for the morphological segmentation task. The model is based on the unsupervised morphological segmentation system called MorphoChains. We extend MorphoChains log linear model by expanding the candidate space recursively to cover more split points for agglutinative languages such as Turkish, whereas in the original model candidates are generated by considering only binary segmentation of each word. The results show that we improve the state-of-art Turkish scores by 12% having a F-measure of 72% and we improve the English scores by 3% having a F-measure of 74%. Eventually, the system outperforms both MorphoChains and other well-known unsupervised morphological segmentation systems. The results indicate that candidate generation plays an important role in such an unsupervised log-linear model that is learned using contrastive estimation with negative samples.Comment: 10 pages, accepted and presented at the CICLing 2017 (18th International Conference on Intelligent Text Processing and Computational Linguistics
    corecore