604 research outputs found

    Context-Aware Prediction of Derivational Word-forms

    Full text link
    Derivational morphology is a fundamental and complex characteristic of language. In this paper we propose the new task of predicting the derivational form of a given base-form lemma that is appropriate for a given context. We present an encoder--decoder style neural network to produce a derived form character-by-character, based on its corresponding character-level representation of the base form and the context. We demonstrate that our model is able to generate valid context-sensitive derivations from known base forms, but is less accurate under a lexicon agnostic setting

    Predicting the Growth of Morphological Families from Social and Linguistic Factors

    Get PDF
    We present the first study that examines the evolution of morphological families, i.e., sets of morphologically related words such as “trump”, “antitrumpism”, and “detrumpify”, in social media. We introduce the novel task of Morphological Family Expansion Predic- tion (MFEP) as predicting the increase in the size of a morphological family. We create a ten-year Reddit corpus as a benchmark for MFEP and evaluate a number of baselines on this benchmark. Our experiments demonstrate very good performance on MFEP

    Statistical parsing of morphologically rich languages (SPMRL): what, how and whither

    Get PDF
    The term Morphologically Rich Languages (MRLs) refers to languages in which significant information concerning syntactic units and relations is expressed at word-level. There is ample evidence that the application of readily available statistical parsing models to such languages is susceptible to serious performance degradation. The first workshop on statistical parsing of MRLs hosts a variety of contributions which show that despite language-specific idiosyncrasies, the problems associated with parsing MRLs cut across languages and parsing frameworks. In this paper we review the current state-of-affairs with respect to parsing MRLs and point out central challenges. We synthesize the contributions of researchers working on parsing Arabic, Basque, French, German, Hebrew, Hindi and Korean to point out shared solutions across languages. The overarching analysis suggests itself as a source of directions for future investigations

    Computational investigations of derivational morphology

    Get PDF
    The notion that it is difficult to make predictions about derivational morphology has been a recurring theme in morphological research over the last decades. It can be unclear whether a derivative exists at all, what a derivative means exactly, and which affix is used to form a derivative. The central goal of this thesis is to demonstrate that recent progress in natural language processing (NLP) allows for a fresh view on the (un-)predictability of derivational morphology. Prior research in morphology has recognized semantic and extralinguistic factors as two key challenges for successfully predicting derivational morphology. The first set of papers contained in the thesis leverages novel methods from NLP and applies them to large-scale, socially-stratified datasets. I find that this computational approach results in substantially improved models, demonstrating that derivational morphology is predictable to a larger extent than previously thought. A side result of the first part of the thesis is that tokenization (i.e., the way in which words are segmented) affects the capability of NLP systems to predict derivational morphology, raising the question whether it deteriorates performance on a larger scale. The second set of papers contained in the thesis shows that this is indeed the case. As a remedy, I devise tokenization strategies that are directly informed by morphology, with beneficial effects on performance. On a wider scale, the results of this thesis suggest that NLP and deep learning more generally can greatly benefit linguistic research, a view that is still contested by many scholars in linguistics. At the same time, the thesis shows that even, or perhaps especially, in the age of large language models, linguistic insights continue to be relevant for the development of human language technology

    Word Knowledge and Word Usage

    Get PDF
    Word storage and processing define a multi-factorial domain of scientific inquiry whose thorough investigation goes well beyond the boundaries of traditional disciplinary taxonomies, to require synergic integration of a wide range of methods, techniques and empirical and experimental findings. The present book intends to approach a few central issues concerning the organization, structure and functioning of the Mental Lexicon, by asking domain experts to look at common, central topics from complementary standpoints, and discuss the advantages of developing converging perspectives. The book will explore the connections between computational and algorithmic models of the mental lexicon, word frequency distributions and information theoretical measures of word families, statistical correlations across psycho-linguistic and cognitive evidence, principles of machine learning and integrative brain models of word storage and processing. Main goal of the book will be to map out the landscape of future research in this area, to foster the development of interdisciplinary curricula and help single-domain specialists understand and address issues and questions as they are raised in other disciplines

    Induction, Semantic Validation and Evaluation of a Derivational Morphology Lexicon for German

    Get PDF
    This thesis is about computational morphology for German derivation. Derivation is a word formation process that creates new words from existing ones, where the base and the derived word share the same stem. Mostly, derivation is conducted by means of relatively regular affixation rules, as in to bake - bakery. In German, derivation is highly productive, thus leading to a high language variability which can be employed to express similar facts in different ways, as derivationally related words are often also semantically related (or transparent). However, linguistic variance is a challenge for computational applications, particularly in semantic processing: It makes it more difficult to automatically grasp the meaning of texts and to match similar information onto each other. Thus, computational systems require linguistic knowledge. We develop methods to induce and represent derivational knowledge, and to apply it in language processing. The main outcome of our study is DErivBase, a German derivational lexicon. It groups derivationally related words (words that are derived from the same stem) into derivational families. To achieve high quality and high coverage, we induce DErivBase by combining rule-based and data-driven methods: We implement linguistic derivation rules to define derivational processes, and feed lemmas extracted from a German corpus into the rules to derive new lemmas. All words that are connected - directly or indirectly - by such rules are considered a derivational family. As mentioned above, a derivational relationship often implies semantic relationship, but this is not always the case. Semantic drifts can cause semantically unrelated (opaque) derivational relations, such as to depart - department. Capturing the difference between transparent and opaque relations is important from a linguistic as well as a practical point of view. Thus, we conduct a semantic refinement of DErivBase, i.e., we determine which lemma pairs are derivationally and semantically related, and which are not. We establish a second, semantically validated version of our lexicon, where families are sub-clustered according to semantic coherence, using supervised machine learning methods: We learn a binary classifier based on features that arise from structural information about the derivation rules, and from distributional information about the semantic relatedness of lemmas. Accordingly, the derivational families are subdivided into semantically coherent clusters. To demonstrate the utility of the two lexicon versions, we evaluate them on three extrinsic - and in the broadest sense, semantic - tasks. The underlying assumption for applying DErivBase to semantic tasks is that derivational relatedness is a reasonable approximation to semantic relatedness, since derivation is often semantically transparent. Our three experiments are the following: 1., we incorporate DErivBase into distributional semantic models to overcome sparsity problems and to improve the prediction quality of the underlying model. We test this method, which we call derivational smoothing, for semantic similarity prediction, and for synonym choice. 2., we employ DErivBase to model a psycholinguistic experiment that examines priming effects of transparent and opaque derivations to draw conclusions about the mental lexical representation in German. Derivational information is again incorporated into a distributional model, but this time, it introduces a kind of morphological generalisation. 3., in order to solve the task of Recognising Textual Entailment, we integrate DErivBase into a matching-based entailment system by means of a query expansion. Assuming that derivational relationships between two texts suggest them to be entailing rather than non-entailing, this expansion increases the chance of a lexical overlap, which should improve the system's entailment predictions. The incorporation of DErivBase indeed improves the performance of the underlying systems in each task, however, it is differently suitable in different settings. In experiment 1., the semantically validated lexicon yields improvements over the purely morphological lexicon, and the more coarse-grained similarity prediction profits more from DErivBase than the synonym choice. In experiment 2., purely morphological information clearly outperforms the other lexicon version, as the latter cannot model opaque derivations. On the entailment task in experiment 3., DErivBase has only minor impact, because textual entailment is hard to solve by addressing only one linguistic phenomenon. In sum, our findings show that the induction of a high-quality, high-coverage derivational lexicon is beneficial for very different applications in computational linguistics. It might be worthwhile to further investigate the semantic aspects of derivation to better understand its impact on language and thus, on language processing

    Superbizarre Is Not Superb: Derivational Morphology Improves BERT’s Interpretation of Complex Words

    Get PDF
    How does the input segmentation of pretrained language models (PLMs) affect their interpretations of complex words? We present the first study investigating this question, taking BERT as the example PLM and focusing on its semantic representations of English derivatives. We show that PLMs can be interpreted as serial dual-route models, i.e., the meanings of complex words are either stored or else need to be computed from the subwords, which implies that maximally meaningful input tokens should allow for the best generalization on new words. This hypothesis is confirmed by a series of semantic probing tasks on which DelBERT (Derivation leveraging BERT), a model with derivational input segmentation, substantially outperforms BERT with WordPiece segmentation. Our results suggest that the generalization capabilities of PLMs could be further improved if a morphologically-informed vocabulary of input tokens were used
    corecore