2,563 research outputs found

    Is the LAN effect in morphosyntactic processing an ERP artifact?

    Get PDF
    Available online 4 February 2019.The left anterior negativity (LAN) is an ERP component that has been often associated with morphosyntactic processing, but recent reports have questioned whether the LAN effect, in fact, exists. The present project examined whether the LAN effect, observed in the grand average response to local agreement violations, is the result of the overlap between two different ERP effects (N400, P600) at the level of subjects (n = 80), items (n = 120), or trials (n = 6160). By-subject, by-item, and by-trial analyses of the ERP effect between 300 and 500 ms showed a LAN for 55% of the participants, 46% of the items, and 49% of the trials. Many examples of the biphasic LAN-P600 response were observed. Mixed-linear models showed that the LAN effect size was not reduced after accounting for subject variability. The present results suggest that there are cases where the grand average LAN effect represents the brain responses of individual participants, items, and trials.This work was supported by the Spanish Ministry [PSI 2014-54500- P; IJCI-2016-27702; PSI2017-82941-P]; the Basque Government [PI_2015_1_25]; and the Severo Ochoa [SEV-2015-0490]

    Diacritic Restoration and the Development of a Part-of-Speech Tagset for the Māori Language

    Get PDF
    This thesis investigates two fundamental problems in natural language processing: diacritic restoration and part-of-speech tagging. Over the past three decades, statistical approaches to diacritic restoration and part-of-speech tagging have grown in interest as a consequence of the increasing availability of manually annotated training data in major languages such as English and French. However, these approaches are not practical for most minority languages, where appropriate training data is either non-existent or not publically available. Furthermore, before developing a part-of-speech tagging system, a suitable tagset is required for that language. In this thesis, we make the following contributions to bridge this gap: Firstly, we propose a method for diacritic restoration based on naive Bayes classifiers that act at word-level. Classifications are based on a rich set of features, extracted automatically from training data in the form of diacritically marked text. This method requires no additional resources, which makes it language independent. The algorithm was evaluated on one language, namely Māori, and an accuracy exceeding 99% was observed. Secondly, we present our work on creating one of the necessary resources for the development of a part-of-speech tagging system in Māori, that of a suitable tagset. The tagset described was developed in accordance with the EAGLES guidelines for morphosyntactic annotation of corpora, and was the result of in-depth analysis of the Māori grammar

    Functional differentiation and grammatical competition in the English Jespersen Cycle

    Get PDF
    Wallage argues for a model of the Middle English Jespersen Cycle in which each of its diachronic stages are functionally equivalent competitors in the sense proposed by Kroch. However, recent work on the Jespersen Cycle in various Romance languages by Schwenter, Hansen and Hansen & Visconti has argued that the forms in competition during the Jespersen Cycle are not simply diachronic stages, but perform diUerent pragmatic or discourse functions. Hansen and Hansen & Visconti suggest that functional change may therefore underpin the Jespersen Cycle in these languages. Hence this paper explores the interface between pragmatic or functional change, and change in the syntax of sentential negation. Analysis of data from the PPPCME? (Kroch & Taylor) show that ne (stage one) and ne. . . not(stage two) are similarly functionally diUerentiated during the ME Jespersen Cycle: ne. . . not is favoured in propositions that are discourse-old (given, or recoverable from the preceding discourse), whereas ne is favoured in propositions that are discourse-new. Frequency data appear to show the loss of these constraints over time. However, I argue that these frequency data are not conclusive evidence for a shift in the functions of ne or ne. . . not. Indeed, the results of a regression analysis indicate that these discourse constraints remain constant throughout Middle English, in spite of the overall spread of ne. . . not as the Jespersen Cycle progresses. Therefore, I conclude the spread of ne. . . not is independent of these particular discourse constraints on its use, rather than the result of changes in, or loss of, these constraints

    In search of grammaticalization in synchronic dialect data: General extenders in north-east England

    Get PDF
    In this paper, we draw on a socially stratified corpus of dialect data collected in north-east England to test recent proposals that grammaticalization processes are implicated in the synchronic variability of general extenders (GEs), i.e., phrase- or clause-final constructions such as and that and or something. Combining theoretical insights from the framework of grammaticalization with the empirical methods of variationist sociolinguistics, we operationalize key diagnostics of grammaticalization (syntagmatic length, decategorialization, semantic-pragmatic change) as independent factor groups in the quantitative analysis of GE variability. While multivariate analyses reveal rapid changes in apparent time to the social conditioning of some GE variants in our data, they do not reveal any evidence of systematic changes in the linguistic conditioning of variants in apparent time that would confirm an interpretation of ongoing grammaticalization. These results lead us to questio

    Statistical parsing of morphologically rich languages (SPMRL): what, how and whither

    Get PDF
    The term Morphologically Rich Languages (MRLs) refers to languages in which significant information concerning syntactic units and relations is expressed at word-level. There is ample evidence that the application of readily available statistical parsing models to such languages is susceptible to serious performance degradation. The first workshop on statistical parsing of MRLs hosts a variety of contributions which show that despite language-specific idiosyncrasies, the problems associated with parsing MRLs cut across languages and parsing frameworks. In this paper we review the current state-of-affairs with respect to parsing MRLs and point out central challenges. We synthesize the contributions of researchers working on parsing Arabic, Basque, French, German, Hebrew, Hindi and Korean to point out shared solutions across languages. The overarching analysis suggests itself as a source of directions for future investigations

    Adjectivization in Russian: Analyzing participles by means of lexical frequency and constraint grammar

    Get PDF
    This dissertation explores the factors that restrict and facilitate adjectivization in Russian, an affixless part-of-speech change leading to ambiguity between participles and adjectives. I develop a theoretical framework based on major approaches to adjectivization, and assess the effect of the factors on ambiguity in the empirical data. I build a linguistic model using the Constraint Grammar formalism. The model utilizes the factors of adjectivization and corpus frequencies as formal constraints for differentiating between participles and adjectives in a disambiguation task. The main question that is explored in this dissertation is which linguistic factors allow for the differentiation between adjectivized and unambiguous participles. Another question concerns which factors, syntactic or morphological, predict ambiguity in the corpus data and resolve it in the disambiguation model. In the theoretical framework, the syntactic context signals whether a participle is adjectivized, whereas internal morphosemantic properties (that is, tense, voice, and lexical meaning) cause or prevent adjectivization. The exploratory analysis of these factors in the corpus data reveals diverse results. The syntactic factor, the adverb of measure and degree očenʹ ‘very’, which is normally used with adjectives, also combines with participles, and is strongly associated with semantic classes of their base verbs. Nonetheless, the use of očenʹ with a participle only indicates ambiguity when other syntactic factors of adjectivization are in place. The lexical frequency (including the ranks of base verbs and the ratios of participles to other verbal forms) and several morphological types of participles strongly predict ambiguity. Furthermore, past passive and transitive perfective participles not only have the highest mean ratios among the other morphological types of participles, but are also strong predictors of ambiguity. The linguistic model using weighted syntactic rules shows the highest accuracy in disambiguation compared to the models with weighted morphological rules or the rule based on weights only. All of the syntactic, morphological, and weighted rules combined show the best performance results. Weights are the most effective for removing residual ambiguity (similar to the statistical baseline model), but are outperformed by the models that use factors of adjectivization as constraints

    Learning morphology with Morfette

    Get PDF
    Morfette is a modular, data-driven, probabilistic system which learns to perform joint morphological tagging and lemmatization from morphologically annotated corpora. The system is composed of two learning modules which are trained to predict morphological tags and lemmas using the Maximum Entropy classifier. The third module dynamically combines the predictions of the Maximum-Entropy models and outputs a probability distribution over tag-lemma pair sequences. The lemmatization module exploits the idea of recasting lemmatization as a classification task by using class labels which encode mappings from wordforms to lemmas. Experimental evaluation results and error analysis on three morphologically rich languages show that the system achieves high accuracy with no language-specific feature engineering or additional resources

    Discovering words and rules from speech input: an investigation into early morphosyntactic acquisition mechanisms

    Get PDF
    To acquire language proficiently, learners have to segment fluent speech into units \u2013 that is, words -, and to discover the structural regularities underlying word structure. Yet, these problems are not independent: in varying degrees, all natural languages express syntax as relations between nonadjacent word subparts. This thesis explores how developing infants come to successfully solve both tasks. The experimental work contained in the thesis approaches this issue from two complementary directions: investigating the computational abilities of infants, and assessing the distributional properties of the linguistic input directed to children. To study the nature of the computational mechanisms infants use to segment the speech stream into words, and to discover the structural regularities underlying words, I conducted seventeen artificial grammar studies. Along these experiments, I test the hypothesis that infants may use different mechanisms to learn words and word-internal rules. These mechanisms are supposed to be triggered by different signal properties, and possibly they become available at different stages of development. One mechanism is assumed to compute the distributional properties of the speech input. The other mechanism is hypothesized to be non-statistical in nature, and to project structural regularities without relying on the distributional properties of the speech input. Infants at different ages (namely, 7, 12 and 18 months) are tested in their abilities to detect statistically defined patterns, and to generalize structural regularities appearing inside word-like units. Results show that 18-month-old infants can both extract statistically defined sequences from a continuous stream (Experiment 12), and find internal-word rules only if the familiarization stream is segmented (Experiments 13 and 14). Twelve-month-olds can also segment words from a continuous stream (Experiment 5), but they cannot detect wordstraddling sequences even if they are statistically informative (Experiments 15 and 16). In contrast, they readily generalize word-internal regularities to novel instances after exposure to a segmented stream (Experiments 1-3 and 17), but not after exposure to a continuous stream (Experiment 4). Instead, 7-month-olds do not compute either statistics (Experiments 10 and 11) or within-word relations (Experiments 6 and 7), regardless of input properties. Overall, the results suggest that word segmentation and structural generalization rely on distinct mechanisms, requiring different signal properties to be activated --that is, the presence of segmentation cues is mandatory for the discovery of structural properties, while a continuous stream supports the extraction of statistically occurring patterns. Importantly, the two mechanisms have different developmental trajectories: generalizations became readily available from 12 months, while statistical computations remain rather limited along the first year. To understand how the computational selectivities and the limits of the computational mechanisms match up with the limitations and the properties of natural language, I evaluate the distributional properties of speech directed to children. These analyses aim at assessing with quantitative and qualitative measures whether the input children listen to may offer a reliable basis for the acquisition of morphosyntactic rules. I choose to examine Italian, a language with a rich and complex morphology, evaluating whether the word forms used in speech directed to children would provide sufficient evidence of the morphosyntactic rules of this language. Results show that the speech directed to children is highly systematic and consistent. The most frequently used word forms are also morphologically well-formed words in Italian: thus, frequency information correlates with structural information -- such as the morphological structure of words. While a statistical analysis of the speech input may provide a small set of words occurring with high frequency, how learners come to extract structural properties from them is another problem. In accord with the results of the infant studies, I propose that structural generalizations are projected on a different basis than statistical computations. Overall, the results of both the artificial grammar studies an the corpus analysis are compatible with the hypothesis that the tasks of segmenting words from fluent speech, and that of learning structural regularities underlying word structure rely on statistical and non-statistical cues respectively, placing constraints on computational mechanisms having different nature and selectivities in early development

    Morphosyntactic Linguistic Wavelets for Knowledge Management

    Get PDF
    corecore