3,142 research outputs found
Lemmatization and lexicalized statistical parsing of morphologically rich languages: the case of French
This paper shows that training a lexicalized parser on a lemmatized morphologically-rich treebank such as the French Treebank slightly improves parsing results. We also show that lemmatizing a similar in size subset of the English
Penn Treebank has almost no effect on parsing performance with gold lemmas and leads to a small drop of performance when automatically assigned lemmas and POS tags are used. This highlights two facts: (i) lemmatization helps to reduce lexicon data-sparseness issues for French, (ii) it also makes the parsing process sensitive to correct assignment of POS tags to unknown words
Distributional effects and individual differences in L2 morphology learning
Second language (L2) learning outcomes may depend on the structure of the input and learnersâ cognitive abilities. This study tested whether less predictable input might facilitate learning and generalization of L2 morphology while evaluating contributions of statistical learning ability, nonverbal intelligence, phonological short-term memory, and verbal working memory. Over three sessions, 54 adults were exposed to a Russian case-marking paradigm with a balanced or skewed item distribution in the input. Whereas statistical learning ability and nonverbal intelligence predicted learning of trained items, only nonverbal intelligence also predicted generalization of case-marking inflections to new vocabulary. Neither measure of temporary storage capacity predicted learning. Balanced, less predictable input was associated with higher accuracy in generalization but only in the initial test session. These results suggest that individual differences in pattern extraction play a more sustained role in L2 acquisition than instructional manipulations that vary the predictability of lexical items in the input
The acquisition of a complex morphological paradigm by L1 and L2 children
The performances of 690 L1 and L2 submersion children of grades 4 to 6 on a test of past tense (passĂ© simple) production in French are compared with the aim of assessing how the two groups of children cope with learning a morphological form belonging to a complex paradigm. Homophony with other verbal forms of the paradigm (syncretisms) appears to play a role in the childrenâs answers. L2 submersion children have significantly lower scores than L1 children and they differ from L1 children in tending to overapply the regular ending. They also seem to be more attentive to agreement and to the visual form of the words than L1 children
Compositional Morphology for Word Representations and Language Modelling
This paper presents a scalable method for integrating compositional
morphological representations into a vector-based probabilistic language model.
Our approach is evaluated in the context of log-bilinear language models,
rendered suitably efficient for implementation inside a machine translation
decoder by factoring the vocabulary. We perform both intrinsic and extrinsic
evaluations, presenting results on a range of languages which demonstrate that
our model learns morphological representations that both perform well on word
similarity tasks and lead to substantial reductions in perplexity. When used
for translation into morphologically rich languages with large vocabularies,
our models obtain improvements of up to 1.2 BLEU points relative to a baseline
system using back-off n-gram models.Comment: Proceedings of the 31st International Conference on Machine Learning
(ICML
In search of isoglosses: continuous and discrete language embeddings in Slavic historical phonology
This paper investigates the ability of neural network architectures to
effectively learn diachronic phonological generalizations in a multilingual
setting. We employ models using three different types of language embedding
(dense, sigmoid, and straight-through). We find that the Straight-Through model
outperforms the other two in terms of accuracy, but the Sigmoid model's
language embeddings show the strongest agreement with the traditional
subgrouping of the Slavic languages. We find that the Straight-Through model
has learned coherent, semi-interpretable information about sound change, and
outline directions for future research
Moving beyond Kucera and Francis: a critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English
Word frequency is the most important variable in research on word processing and memory. Yet, the main criterion for selecting word frequency norms has been the availability of the measure, rather than its quality. As a result, much research is still based on the old Kucera and Francis frequency norms. By using the lexical decision times of recently published megastudies, we show how bad this measure is and what must be done to improve it. In particular, we investigated the size of the corpus, the language register on which the corpus is based, and the definition of the frequency measure. We observed that corpus size is of practical importance for small sizes (depending on the frequency of the word), but not for sizes above 16-30 million words. As for the language register, we found that frequencies based on television and film subtitles are better than frequencies based on written sources, certainly for the monosyllabic and bisyllabic words used in psycholinguistic research. Finally, we found that lemma frequencies are not superior to word form frequencies in English and that a measure of contextual diversity is better than a measure based on raw frequency of occurrence. Part of the superiority of the latter is due to the words that are frequently used as names. Assembling a new frequency norm on the basis of these considerations turned out to predict word processing times much better than did the existing norms (including Kucera & Francis and Celex). The new SUBTL frequency norms from the SUBTLEXUS corpus are freely available for research purposes from http://brm.psychonomic-journals.org/content/supplemental, as well as from the University of Ghent and Lexique Web sites
- âŠ