133 research outputs found
A Graph Auto-encoder Model of Derivational Morphology
There has been little work on modeling the morphological well-formedness (MWF) of derivatives, a problem judged to be complex and difficult in linguistics (Bauer, 2019). We present a graph auto-encoder that learns em- beddings capturing information about the com- patibility of affixes and stems in derivation. The auto-encoder models MWF in English sur- prisingly well by combining syntactic and se- mantic information with associative informa- tion from the mental lexicon
Predicting the Growth of Morphological Families from Social and Linguistic Factors
We present the first study that examines the evolution of morphological families, i.e., sets of morphologically related words such as ātrumpā, āantitrumpismā, and ādetrumpifyā, in social media. We introduce the novel task of Morphological Family Expansion Predic- tion (MFEP) as predicting the increase in the size of a morphological family. We create a ten-year Reddit corpus as a benchmark for MFEP and evaluate a number of baselines on this benchmark. Our experiments demonstrate very good performance on MFEP
Superbizarre Is Not Superb: Derivational Morphology Improves BERTās Interpretation of Complex Words
How does the input segmentation of pretrained language models (PLMs) affect their interpretations of complex words? We present the first study investigating this question, taking BERT as the example PLM and focusing on its semantic representations of English derivatives. We show that PLMs can be interpreted as serial dual-route models, i.e., the meanings of complex words are either stored or else need to be computed from the subwords, which implies that maximally meaningful input tokens should allow for the best generalization on new words. This hypothesis is confirmed by a series of semantic probing tasks on which DelBERT (Derivation leveraging BERT), a model with derivational input segmentation, substantially outperforms BERT with WordPiece segmentation. Our results suggest that the generalization capabilities of PLMs could be further improved if a morphologically-informed vocabulary of input tokens were used
Word Knowledge and Word Usage
Word storage and processing define a multi-factorial domain of scientific inquiry whose thorough investigation goes well beyond the boundaries of traditional disciplinary taxonomies, to require synergic integration of a wide range of methods, techniques and empirical and experimental findings. The present book intends to approach a few central issues concerning the organization, structure and functioning of the Mental Lexicon, by asking domain experts to look at common, central topics from complementary standpoints, and discuss the advantages of developing converging perspectives. The book will explore the connections between computational and algorithmic models of the mental lexicon, word frequency distributions and information theoretical measures of word families, statistical correlations across psycho-linguistic and cognitive evidence, principles of machine learning and integrative brain models of word storage and processing. Main goal of the book will be to map out the landscape of future research in this area, to foster the development of interdisciplinary curricula and help single-domain specialists understand and address issues and questions as they are raised in other disciplines
Morphological Cluster Induction of Bantu Words Using a Weighted Similarity Measure
Unsupervised morphological segmentation is attractive to low density languages with little linguistic description, such as many Bantu languages. However, techniques that cluster morphologically related words use string similarity metrics that are more suited to languages with simple morphological systems. The paper proposes a weighted similarity measure that uses an approach for calculating Ordered Weighted Aggregator (OWA) operator weights based on normal distribution. The weighting favours shared character sequences with high likelihood of being part of stems for highly agglutinative languages. The approach is evaluated on text for Chichewa and Citumbuka, which belong to the group N of Guthrie Bantu languages classification. Cluster analysis results show that the proposed weighted word similarity metric produces better clusters than Dice Coefficient. Morpheme segmentation results on clusters generated using the OWA weights metric are comparable to the state-of-the-art morphological analysis tools
Borrowings, Derivational Morphology, and Perceived Productivity in English, 1300-1600.
This dissertation examines how borrowed derivational morphemes such as -age, -ity, -cion, and -ment became productive in the English language, particularly in the
fourteenth through sixteenth centuries. It endeavors to expand our current understanding of morphological productivity as a historical phenomenon--to account for not only aggregate quantitative measures of the products of morphological processes, but also some of the linguistic mechanisms that made those processes more productive for language users. Judgments about the productivity of different suffixes in the late ME period cannot be made on counts of frequency alone, since the vast majority of uses were not neologisms or newly coined hybrid forms but rather borrowings from Latin and French. It is not immediately clear to the historical linguist if Middle English speakers perceived a derivative such as enformacion as an undecomposable word or as a morphologically complex word. By examining usage patterns of these derivatives in guild records, the Wycliffite Bible, end-rhymed poetry, medical texts, and personal correspondence, this project argues that several mechanisms helped contribute to the increased transparency and perceived productivity of these affixes. These mechanisms include the following: the use of rhetorical sequences of derivatives with the same base or derivatives ending in the same suffix; the frequent use of derivatives as end rhymes in poetry; the lexical variety of derivatives ending in the same suffix; and the more frequent use of certain bases compared to their derivatives. All of these textual and linguistic features increased readers' and listeners' ability to analyze borrowed derivatives as suffixed words. Ultimately, the dissertation finds that several borrowed affixes were seen as potentially productive units of language in the late ME period, though some were seen as more productive than others in different discourses and contexts. It also emphasizes the value of register studies for understanding the specific motivations for the use of borrowed derivatives in different discourses, as well as the morphological consequences of salient usage patterns within different registers.Ph.D.English Language & LiteratureUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/64624/1/palmercc_1.pd
Acronyms as an integral part of multiāword term recognition - A token of appreciation
Term conflation is the process of linking together different variants of the same term. In automatic term recognition approaches, all term variants should be aggregated into a single normalized term representative, which is associated with a single domaināspecific concept as a latent variable. In a previous study, we described FlexiTerm, an unsupervised method for recognition of multiāword terms from a domaināspecific corpus. It uses a range of methods to normalize three types of term variation ā orthographic, morphological and syntactic variation. Acronyms, which represent a highly productive type of term variation, were not supported. In this study, we describe how the functionality of FlexiTerm has been extended to recognize acronyms and incorporate them into the term conflation process. The main contribution of this study is not acronym recognition per se, but rather its integration with other types of term variation into the term conflation process. We evaluated the effects of term conflation in the context of information retrieval as one of its most prominent applications. On average, relative recall increased by 32 percent points, whereas index compression factor increased by 7 percent points. Therefore, evidence suggests that integration of acronyms provides nonātrivial improvement of term conflation
OgraniÄenja pri proÅ”irivanju afiksalnih kombinacija: strukturna ograniÄenja i uvjeti procesiranja
The study of the mental lexicon has been fostered by the analysis of the way complex
words are mentally represented and processed. This paper concerns the syntagmatic extension of multiple affixation; specifically, the processing of complex words that contain four
suffixes that operate in wordāformation patterns of Portuguese. Although the individual
addition of suffixes obeys structural constraints, the multiple combination results in complex words with low frequency and low expectedness by the speaker, which contribute to
the lack of semantic transparency and of affixal salience of the combination. Our study
demonstrates a relation between these factors and the experience of the speaker with
the affixal combination, which determines the pattern character of the combination. We
suggest that a suffix exerts the prediction of other suffixes as long as the combination is
expected. Nonāfrequent heterocategorial complex words with a combination of four suffixes
are contrasted with nonāfrequent words containing pleonastic affixation. In the latter type
of words, the redundancy of semantic structures increases the semantic transparency of
the word, which suggests a prediction effect operating on the semantic level of the affixal
combination. Processing of complex words is dependent on the level of expectedness of
the speaker towards the affix combination, which constrains the level of word acceptance
by speakers.Istraživanja o mentalnom leksikonu podržana su analizama o tome na koji su naÄin
složene rijeÄi mentalno reprezentirane i procesirane. U ovome radu analizira se proÅ”irivanje
viÅ”estruke afiksacije na sintagmatskoj razini; preciznije, analizira se procesiranje složenih rijeÄi
koje sadržavaju Äetiri sufiksa koja nalazimo u obrascima tvorbe rijeÄi u portugalskom jeziku.
Premda je pojedinaÄno dodavanje sufiksa u skladu sa strukturnim ograniÄenjima, kombinacija
viÅ”e sufiksa rezultira složenim rijeÄima niske Äestotnosti i niskog oÄekivanja kod govornika,
Å”to pridonosi smanjenoj semantiÄkoj transparentnosti i istaknutosti afikasa u kombinaciji. Ova
analiza pokazuje povezanost izmeÄu navedenih Äimbenika i iskustva govornika s pojedinom
afiksalnom kombinacijom, Å”to odreÄuje Äinjenica da su kombinacije graÄene na principu obrazaca.
U radu predlažemo da jedan sufiks priziva ostale sufikse dokle god je kombinacija sufikasa
oÄekivana. RjeÄe heterokategorijske složene rijeÄi s kombinacijom Äetiriju sufiksa kontrastiraju se
s rjeÄim rijeÄima koje sadržavaju pleonastiÄku afiksaciju. U potonjem tipu rijeÄi redundantnost
znaÄenjskih struktura poveÄava znaÄenjsku transparentnost rijeÄi, Å”to upuÄuje na predvidivost
koja se odvija na znaÄenjskoj razini afiksalne kombinacije. Procesiranje složenih rijeÄi ovisi o
razini govornikova oÄekivanja prema kombinaciji afikasa, Å”to ograniÄava razinu prihvaÄanja rijeÄi
kod govornika
- ā¦