82,928 research outputs found
Applied morphological processing of English
We describe two newly developed computational tools for morphological processing: a program for analysis of English inflectional morphology, and a morphological generator, automatically derived from the analyser. The tools are fast, being based on finite-state techniques, have wide coverage, incorporating data from various corpora and machine readable dictionaries, and are robust, in that they are able to deal effectively with unknown words. The tools are freely available. We evaluate the accuracy and speed of both tools and discuss a number of practical applications in which they have been put to use
Error-tolerant Finite State Recognition with Applications to Morphological Analysis and Spelling Correction
Error-tolerant recognition enables the recognition of strings that deviate
mildly from any string in the regular set recognized by the underlying finite
state recognizer. Such recognition has applications in error-tolerant
morphological processing, spelling correction, and approximate string matching
in information retrieval. After a description of the concepts and algorithms
involved, we give examples from two applications: In the context of
morphological analysis, error-tolerant recognition allows misspelled input word
forms to be corrected, and morphologically analyzed concurrently. We present an
application of this to error-tolerant analysis of agglutinative morphology of
Turkish words. The algorithm can be applied to morphological analysis of any
language whose morphology is fully captured by a single (and possibly very
large) finite state transducer, regardless of the word formation processes and
morphographemic phenomena involved. In the context of spelling correction,
error-tolerant recognition can be used to enumerate correct candidate forms
from a given misspelled string within a certain edit distance. Again, it can be
applied to any language with a word list comprising all inflected forms, or
whose morphology is fully described by a finite state transducer. We present
experimental results for spelling correction for a number of languages. These
results indicate that such recognition works very efficiently for candidate
generation in spelling correction for many European languages such as English,
Dutch, French, German, Italian (and others) with very large word lists of root
and inflected forms (some containing well over 200,000 forms), generating all
candidate solutions within 10 to 45 milliseconds (with edit distance 1) on a
SparcStation 10/41. For spelling correction in Turkish, error-tolerantComment: Replaces 9504031. gzipped, uuencoded postscript file. To appear in
Computational Linguistics Volume 22 No:1, 1996, Also available as
ftp://ftp.cs.bilkent.edu.tr/pub/ko/clpaper9512.ps.
Processing of regular and irregular past tense morphology in highly proficient second language learners of English: a self-paced reading study
Dual-system models suggest that English past tense morphology involves two processing routes: rule application for regular verbs and memory retrieval for irregular verbs (Pinker, 1999). In second language (L2) processing research, Ullman (2001a) suggested that both verb types are retrieved from memory, but more recently Clahsen and Felser (2006) and Ullman (2004) argued that past tense rule application can be automatised with experience by L2 learners. To address this controversy, we tested highly proficient Greek-English learners with naturalistic or classroom L2 exposure compared to native English speakers in a self-paced reading task involving past tense forms embedded in plausible sentences. Our results suggest that, irrespective to the type of exposure, proficient L2 learners of extended L2 exposure apply rule-based processing
Morphological Analysis as Classification: an Inductive-Learning Approach
Morphological analysis is an important subtask in text-to-speech conversion,
hyphenation, and other language engineering tasks. The traditional approach to
performing morphological analysis is to combine a morpheme lexicon, sets of
(linguistic) rules, and heuristics to find a most probable analysis. In
contrast we present an inductive learning approach in which morphological
analysis is reformulated as a segmentation task. We report on a number of
experiments in which five inductive learning algorithms are applied to three
variations of the task of morphological analysis. Results show (i) that the
generalisation performance of the algorithms is good, and (ii) that the lazy
learning algorithm IB1-IG performs best on all three tasks. We conclude that
lazy learning of morphological analysis as a classification task is indeed a
viable approach; moreover, it has the strong advantages over the traditional
approach of avoiding the knowledge-acquisition bottleneck, being fast and
deterministic in learning and processing, and being language-independent.Comment: 11 pages, 5 encapsulated postscript figures, uses non-standard NeMLaP
proceedings style nemlap.sty; inputs ipamacs (international phonetic
alphabet) and epsf macro
Developmental changes in the role of different metalinguistic awareness skills in Chinese reading acquisition from preschool to third grade
Copyright @ 2014 Wei et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted
use, distribution, and reproduction in any medium, provided the original author and source are credited.The present study investigated the relationship between Chinese reading skills and metalinguistic awareness skills such as phonological, morphological, and orthographic awareness for 101 Preschool, 94 Grade-1, 98 Grade-2, and 98 Grade-3 children from two primary schools in Mainland China. The aim of the study was to examine how each of these metalinguistic awareness skills would exert their influence on the success of reading in Chinese with age. The results showed that all three metalinguistic awareness skills significantly predicted reading success. It further revealed that orthographic awareness played a dominant role in the early stages of reading acquisition, and its influence decreased with age, while the opposite was true for the contribution of morphological awareness. The results were in stark contrast with studies in English, where phonological awareness is typically shown as the single most potent metalinguistic awareness factor in literacy acquisition. In order to account for the current data, a three-stage model of reading acquisition in Chinese is discussed.National Natural Science Foundation of China and Knowledge Innovation Program of the
Chinese Academy of Sciences
Beyond Stemming and Lemmatization: Ultra-stemming to Improve Automatic Text Summarization
In Automatic Text Summarization, preprocessing is an important phase to
reduce the space of textual representation. Classically, stemming and
lemmatization have been widely used for normalizing words. However, even using
normalization on large texts, the curse of dimensionality can disturb the
performance of summarizers. This paper describes a new method for normalization
of words to further reduce the space of representation. We propose to reduce
each word to its initial letters, as a form of Ultra-stemming. The results show
that Ultra-stemming not only preserve the content of summaries produced by this
representation, but often the performances of the systems can be dramatically
improved. Summaries on trilingual corpora were evaluated automatically with
Fresa. Results confirm an increase in the performance, regardless of summarizer
system used.Comment: 22 pages, 12 figures, 9 table
Building Morphological Chains for Agglutinative Languages
In this paper, we build morphological chains for agglutinative languages by
using a log-linear model for the morphological segmentation task. The model is
based on the unsupervised morphological segmentation system called
MorphoChains. We extend MorphoChains log linear model by expanding the
candidate space recursively to cover more split points for agglutinative
languages such as Turkish, whereas in the original model candidates are
generated by considering only binary segmentation of each word. The results
show that we improve the state-of-art Turkish scores by 12% having a F-measure
of 72% and we improve the English scores by 3% having a F-measure of 74%.
Eventually, the system outperforms both MorphoChains and other well-known
unsupervised morphological segmentation systems. The results indicate that
candidate generation plays an important role in such an unsupervised log-linear
model that is learned using contrastive estimation with negative samples.Comment: 10 pages, accepted and presented at the CICLing 2017 (18th
International Conference on Intelligent Text Processing and Computational
Linguistics
- …