Search CORE

133 research outputs found

A Graph Auto-encoder Model of Derivational Morphology

Author: Hofmann Valentin
Pierrehumbert Janet
Schütze Hinrich
Publication venue: Ludwig-Maximilians-Universität München
Publication date: 01/01/2020
Field of study

There has been little work on modeling the morphological well-formedness (MWF) of derivatives, a problem judged to be complex and difficult in linguistics (Bauer, 2019). We present a graph auto-encoder that learns em- beddings capturing information about the com- patibility of affixes and stems in derivation. The auto-encoder models MWF in English sur- prisingly well by combining syntactic and se- mantic information with associative informa- tion from the mental lexicon

Crossref

Open Access LMU

Oxford University Research Archive

Predicting the Growth of Morphological Families from Social and Linguistic Factors

Author: Hofmann Valentin
Pierrehumbert Janet
Schütze Hinrich
Publication venue: Ludwig-Maximilians-Universität München
Publication date: 01/01/2020
Field of study

We present the first study that examines the evolution of morphological families, i.e., sets of morphologically related words such as “trump”, “antitrumpism”, and “detrumpify”, in social media. We introduce the novel task of Morphological Family Expansion Predic- tion (MFEP) as predicting the increase in the size of a morphological family. We create a ten-year Reddit corpus as a benchmark for MFEP and evaluate a number of baselines on this benchmark. Our experiments demonstrate very good performance on MFEP

Crossref

Open Access LMU

Oxford University Research Archive

Superbizarre Is Not Superb: Derivational Morphology Improves BERT’s Interpretation of Complex Words

Author: Hofmann Valentin
Li Wenjie
Navigli Roberto
Pierrehumbert Janet B.
Schütze Hinrich
Xia Fei
Zong Chengqing
Publication venue
Publication date: 01/08/2021
Field of study

How does the input segmentation of pretrained language models (PLMs) affect their interpretations of complex words? We present the first study investigating this question, taking BERT as the example PLM and focusing on its semantic representations of English derivatives. We show that PLMs can be interpreted as serial dual-route models, i.e., the meanings of complex words are either stored or else need to be computed from the subwords, which implies that maximally meaningful input tokens should allow for the best generalization on new words. This hypothesis is confirmed by a series of semantic probing tasks on which DelBERT (Derivation leveraging BERT), a model with derivational input segmentation, substantially outperforms BERT with WordPiece segmentation. Our results suggest that the generalization capabilities of PLMs could be further improved if a morphologically-informed vocabulary of input tokens were used

Open Access LMU

Word Knowledge and Word Usage

Author
Publication venue: 'Walter de Gruyter GmbH'
Publication date
Field of study

Word storage and processing define a multi-factorial domain of scientific inquiry whose thorough investigation goes well beyond the boundaries of traditional disciplinary taxonomies, to require synergic integration of a wide range of methods, techniques and empirical and experimental findings. The present book intends to approach a few central issues concerning the organization, structure and functioning of the Mental Lexicon, by asking domain experts to look at common, central topics from complementary standpoints, and discuss the advantages of developing converging perspectives. The book will explore the connections between computational and algorithmic models of the mental lexicon, word frequency distributions and information theoretical measures of word families, statistical correlations across psycho-linguistic and cognitive evidence, principles of machine learning and integrative brain models of word storage and processing. Main goal of the book will be to map out the landscape of future research in this area, to foster the development of interdisciplinary curricula and help single-domain specialists understand and address issues and questions as they are raised in other disciplines

OAPEN Library

Morphological Cluster Induction of Bantu Words Using a Weighted Similarity Measure

Author: Chavula Catherine
Suleman Hussein
Publication venue: 'American College of Medical Physics (ACMP)'
Publication date: 01/01/2017
Field of study

Unsupervised morphological segmentation is attractive to low density languages with little linguistic description, such as many Bantu languages. However, techniques that cluster morphologically related words use string similarity metrics that are more suited to languages with simple morphological systems. The paper proposes a weighted similarity measure that uses an approach for calculating Ordered Weighted Aggregator (OWA) operator weights based on normal distribution. The weighting favours shared character sequences with high likelihood of being part of stems for highly agglutinative languages. The approach is evaluated on text for Chichewa and Citumbuka, which belong to the group N of Guthrie Bantu languages classification. Cluster analysis results show that the proposed weighted word similarity metric produces better clusters than Dice Coefficient. Morpheme segmentation results on clusters generated using the OWA weights metric are comparable to the state-of-the-art morphological analysis tools

UCT Computer Science Research Document Archive

Borrowings, Derivational Morphology, and Perceived Productivity in English, 1300-1600.

Author: Palmer Chris C.
Publication venue
Publication date
Field of study

This dissertation examines how borrowed derivational morphemes such as -age, -ity, -cion, and -ment became productive in the English language, particularly in the fourteenth through sixteenth centuries. It endeavors to expand our current understanding of morphological productivity as a historical phenomenon--to account for not only aggregate quantitative measures of the products of morphological processes, but also some of the linguistic mechanisms that made those processes more productive for language users. Judgments about the productivity of different suffixes in the late ME period cannot be made on counts of frequency alone, since the vast majority of uses were not neologisms or newly coined hybrid forms but rather borrowings from Latin and French. It is not immediately clear to the historical linguist if Middle English speakers perceived a derivative such as enformacion as an undecomposable word or as a morphologically complex word. By examining usage patterns of these derivatives in guild records, the Wycliffite Bible, end-rhymed poetry, medical texts, and personal correspondence, this project argues that several mechanisms helped contribute to the increased transparency and perceived productivity of these affixes. These mechanisms include the following: the use of rhetorical sequences of derivatives with the same base or derivatives ending in the same suffix; the frequent use of derivatives as end rhymes in poetry; the lexical variety of derivatives ending in the same suffix; and the more frequent use of certain bases compared to their derivatives. All of these textual and linguistic features increased readers' and listeners' ability to analyze borrowed derivatives as suffixed words. Ultimately, the dissertation finds that several borrowed affixes were seen as potentially productive units of language in the late ME period, though some were seen as more productive than others in different discourses and contexts. It also emphasizes the value of register studies for understanding the specific motivations for the use of borrowed derivatives in different discourses, as well as the morphological consequences of salient usage patterns within different registers.Ph.D.English Language & LiteratureUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/64624/1/palmercc_1.pd

Deep Blue Documents at the University of Michigan

Acronyms as an integral part of multi–word term recognition - A token of appreciation

Author: Spasic Irena
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2018
Field of study

Term conflation is the process of linking together different variants of the same term. In automatic term recognition approaches, all term variants should be aggregated into a single normalized term representative, which is associated with a single domain–specific concept as a latent variable. In a previous study, we described FlexiTerm, an unsupervised method for recognition of multi–word terms from a domain–specific corpus. It uses a range of methods to normalize three types of term variation – orthographic, morphological and syntactic variation. Acronyms, which represent a highly productive type of term variation, were not supported. In this study, we describe how the functionality of FlexiTerm has been extended to recognize acronyms and incorporate them into the term conflation process. The main contribution of this study is not acronym recognition per se, but rather its integration with other types of term variation into the term conflation process. We evaluated the effects of term conflation in the context of information retrieval as one of its most prominent applications. On average, relative recall increased by 32 percent points, whereas index compression factor increased by 7 percent points. Therefore, evidence suggests that integration of acronyms provides non–trivial improvement of term conflation

Online Research @ Cardiff

Ograničenja pri proširivanju afiksalnih kombinacija: strukturna ograničenja i uvjeti procesiranja

Author: Alexandra Soares Rodrigues
Publication venue: Crotian Philological Society
Publication date: 01/01/2017
Field of study

The study of the mental lexicon has been fostered by the analysis of the way complex words are mentally represented and processed. This paper concerns the syntagmatic extension of multiple affixation; specifically, the processing of complex words that contain four suffixes that operate in word–formation patterns of Portuguese. Although the individual addition of suffixes obeys structural constraints, the multiple combination results in complex words with low frequency and low expectedness by the speaker, which contribute to the lack of semantic transparency and of affixal salience of the combination. Our study demonstrates a relation between these factors and the experience of the speaker with the affixal combination, which determines the pattern character of the combination. We suggest that a suffix exerts the prediction of other suffixes as long as the combination is expected. Non–frequent heterocategorial complex words with a combination of four suffixes are contrasted with non–frequent words containing pleonastic affixation. In the latter type of words, the redundancy of semantic structures increases the semantic transparency of the word, which suggests a prediction effect operating on the semantic level of the affixal combination. Processing of complex words is dependent on the level of expectedness of the speaker towards the affix combination, which constrains the level of word acceptance by speakers.Istraživanja o mentalnom leksikonu podržana su analizama o tome na koji su način složene riječi mentalno reprezentirane i procesirane. U ovome radu analizira se proširivanje višestruke afiksacije na sintagmatskoj razini; preciznije, analizira se procesiranje složenih riječi koje sadržavaju četiri sufiksa koja nalazimo u obrascima tvorbe riječi u portugalskom jeziku. Premda je pojedinačno dodavanje sufiksa u skladu sa strukturnim ograničenjima, kombinacija više sufiksa rezultira složenim riječima niske čestotnosti i niskog očekivanja kod govornika, što pridonosi smanjenoj semantičkoj transparentnosti i istaknutosti afikasa u kombinaciji. Ova analiza pokazuje povezanost između navedenih čimbenika i iskustva govornika s pojedinom afiksalnom kombinacijom, što određuje činjenica da su kombinacije građene na principu obrazaca. U radu predlažemo da jedan sufiks priziva ostale sufikse dokle god je kombinacija sufikasa očekivana. Rjeđe heterokategorijske složene riječi s kombinacijom četiriju sufiksa kontrastiraju se s rjeđim riječima koje sadržavaju pleonastičku afiksaciju. U potonjem tipu riječi redundantnost značenjskih struktura povećava značenjsku transparentnost riječi, što upućuje na predvidivost koja se odvija na značenjskoj razini afiksalne kombinacije. Procesiranje složenih riječi ovisi o razini govornikova očekivanja prema kombinaciji afikasa, što ograničava razinu prihvaćanja riječi kod govornika

Hrčak - Portal of scientific journals of Croatia