220 research outputs found

    Computational Etymology: Word Formation and Origins

    Get PDF
    While there are over seven thousand languages in the world, substantial language technologies exist only for a small percentage of these. The large majority of world languages do not have enough bilingual or even monolingual data for developing technologies like machine translation using current approaches. The computational study and modeling of word origins and word formation is a key step in developing comprehensive translation dictionaries for low-resource languages. This dissertation presents novel foundational work in computational etymology, a promising field which this work is pioneering. The dissertation also includes novel models of core vocabulary, dictionary information distillation, and of the diverse linguistic processes of word formation and concept realization between languages, including compounding, derivation, sense-extension, borrowing, and historical cognate relationships, utilizing statistical and neural models trained on the unprecedented scale of thousands of languages. Collectively these are important components in tackling the grand challenges of universal translation, endangered language documentation and revitalization, and supporting technologies for speakers of thousands of underserved languages

    A theoretical approach to automatic loanword detection

    Get PDF
    For several years, computational methods found their way into humanities. Especially in the field of computational linguistics several analysis andmethods are studied. It is not surprising that computational analysis arouse interest in the field of historical linguistics. Due to such methods, language evolution can be studied from another point of view. Biological and linguistic evolution show certain parallels. Especially the parallels between phylogenetics and linguistics arouse the interest of combining both fields. Phylogenetics provide a great number of mathematical and computational methods for computing di erent tasks. Based on the parallels, the methods can be adapted into historical linguistics. In historical linguistics, the process of borrowing is a well-known evolutionary process where words are borrowed from one language and adapted into another. Borrowing has its corresponding parallel within phylogenetics, namely horizontal gene transfer. Horizontal gene transfer is the process of transferring genes from one organism to another. The similarity between borrowing and horizontal gene transfer is the transfer of genes or words whereas the organisms or languages are not related. Phylogenetics provides several computational methods and analysis to detect horizontal gene transfer. The methods might be adapted into linguistics to detect borrowing. This paper introduces the background of borrowing and phylogenetics as well as the combination of both fields. The new tree-based approach should indicate if provided methods of phylogenetics can be adapted into linguistics for the detection of borrowing.Vor einigen Jahren haben automatische Methoden und Computeranalysen ihren Weg in die Geisteswissenschaften gefunden. Vor allem die Computerlinguistik untersucht und entwickelt neue Methoden. Es ist daher nicht überraschend, dass das Interesse an unterschiedlichen Computeranalysen im Bereich der historischen Linguistik an Interesse gewonnen hat. Neue Ansätze haben die Sicht auf die Untersuchungsmethoden innerhalb der Sprachevolution verändert. Biologische Evolution und Sprachevolution weisen verschiedene Gemeinsamkeiten auf. Die Ähnlichkeiten zwischen Phylogenetik und Linguistik haben zu einer Kombination dieser Bereiche geführt. Die Phylogenetik stellt eine große Anzahl von mathematischen und auch implementierten Methoden zur Verfügung, um unterschiedliche Prozesse zu analysieren. Einige dieser Methoden können auf Grund der Gemeinsamkeiten dieser Bereiche in die historische Linguistik übernommen werden. In der historischen Linguistik ist die Entlehnung ein bekannter evolutionärer Prozess, bei welchem Wörter der einen Sprache in eine andere entlehnt werden. Der Prozess der Entlehnung weist große Ähnlichkeiten mit dem aus der Phylogenetik bekannten Prozess des Horizontalem Gentransfers auf. Horizontaler Gentransfer beschreibt die Übertragung von Genen von einem Organismus in einen anderen. Die Gemeinsamkeit von Entlehnung und Horizontalem Gentransfer ist die Übertragung von Genen oder Wörtern, wobei der Organismus oder die Sprache nicht verwandt sein müssen. Die Phylogenetik stellt mehrere mathematische Methoden und Analysen zur Verfügung, um Horizontalen Gentransfer zu erkennen. Diese könnten in die Linguistik übernommen werden. In dieser Arbeit werden die Hintergründe von Entlehnung und die Grundlagen der Phylogenetik erklärt. Des Weiteren wird die Kombination der beiden Bereiche erläutert. Der neue baumbasierte Ansatz soll zeigen, ob die Methoden aus der Phylogenetik in die Linguistik aufgenommen werden können und ob diese Entlehnungen erkennen können

    Computational approaches to semantic change (Volume 6)

    Get PDF
    Semantic change — how the meanings of words change over time — has preoccupied scholars since well before modern linguistics emerged in the late 19th and early 20th century, ushering in a new methodological turn in the study of language change. Compared to changes in sound and grammar, semantic change is the least understood. Ever since, the study of semantic change has progressed steadily, accumulating a vast store of knowledge for over a century, encompassing many languages and language families. Historical linguists also early on realized the potential of computers as research tools, with papers at the very first international conferences in computational linguistics in the 1960s. Such computational studies still tended to be small-scale, method-oriented, and qualitative. However, recent years have witnessed a sea-change in this regard. Big-data empirical quantitative investigations are now coming to the forefront, enabled by enormous advances in storage capability and processing power. Diachronic corpora have grown beyond imagination, defying exploration by traditional manual qualitative methods, and language technology has become increasingly data-driven and semantics-oriented. These developments present a golden opportunity for the empirical study of semantic change over both long and short time spans

    The Linguistic Market of Codeswitching in U.S. Latino Literature

    Get PDF
    This dissertation is a multidisciplinary study that brings together the fields of literature, sociolinguistics, and cultural studies in order to understand the motivation and meaning of English-Spanish codeswitching or language alternation in Latino literature produced in the United States. Codeswitching was first introduced in Latino literature around the time of the Chicano Movement in the 1970s and has been used as a distinctive feature of Latino literary works to this day. By doing a close linguistic analysis of narratives by four different authors belonging to the largest Latino communities in the country (Chicano, Puerto Ricans, Dominican Americans, and Cuban Americans), this study examines whether codeswitching is used as a mere decorative element to add ethnic flavor, performs a mimetic role of oral codeswitching, or responds to a political strategy. To reach representative conclusions, the political, social, cultural, and linguistic backgrounds of each community are studied in order to establish commonalities or differences in the experiences of these immigrant communities in the United States and how these experiences inform their writing. Considering the negative views held by speakers of both English and Spanish regarding the use of oral codeswitching, the need to study its use in literature is compelling. To that end, I have adopted social, and sociolinguistic theories to identify whether codeswitching operates as linguistic and symbolic capital in Latino literature, which authors may profit from to advance a Latino agenda. This work concludes that how codeswitching is used in Latino literature and the goals it ultimately achieves—if any—hinge on the positioning of the authors vis-à-vis hegemonic English monolingualism and their own experience as members of the Latino community to which they belong. Thus, the role of codeswitching may indeed be solely ornamental or ethnic or it may be a political one; that of expanding the space in which Latinos are allowed to operate. The narratives studied include Rudolfo Anaya’s Bless Me Ultima (1972), Esmeralda Santiago’s When I was Puerto Rican (1993), Cristina García’s Dreaming in Cuban (1992), and Junot Díaz’s The Brief Wondrous Life of Oscar Wao (2007)

    Save the trees

    Get PDF
    Skepticism regarding the tree model has a long tradition in historical linguistics. Although scholars have emphasized that the tree model and its long-standing counterpart, the wave theory, are not necessarily incompatible, the opinion that family trees are unrealistic and should be completely abandoned in the field of historical linguistics has always enjoyed a certain popularity. This skepticism has further increased with the advent of recently proposed techniques for data visualization which seem to confirm that we can study language history without trees. In this article, we show that the concrete arguments that have been brought up in favor of achronistic wave models do not hold. By comparing the phenomenon of incomplete lineage sorting in biology with processes in linguistics, we show that data which do not seem as though they can be explained using trees can indeed be explained without turning to diffusion as an explanation. At the same time, methodological limits in historical reconstruction might easily lead to an overestimation of regularity, which may in turn appear as conflicting patterns when the researcher is trying to reconstruct a coherent phylogeny. We illustrate how, in several instances, trees can benefit language comparison, although we also discuss their shortcomings in modeling mixed languages. While acknowledging that not all aspects of language history are tree-like, and that integrated models which capture both vertical and lateral language relations may depict language history more realistically than trees do, we conclude that all models claiming that vertical language relations can be completely ignored are essentially wrong: either they still tacitly draw upon family trees or they only provide a static display of data and thus fail to model temporal aspects of language history

    Automatic language similarity comparison using N-gram analysis

    Get PDF

    Detection and Morphological Analysis of Novel Russian Loanwords

    Full text link
    This paper investigates recent English loanwords in Russian and explores ways in which computational methods can help further theoretical research. The goal of the study is two-fold: to find new, previously unattested loanwords borrowed over the last decade and to examine the rate of adaptation of the new borrowings, attested by the degree to which they conform to the constraints of the Russian language. First, we train a finite-state pipeline that combines character n-gram language models, which encode phonotactic and lexical properties of loanwords, with a binary classifier to detect loanwords. The model achieves state-of-the-art performance results during evaluation, surpassing previously established benchmarks. Secondly, we introduce a new and extended corpus of recent Russian loanwords that have been detected in Web texts by our model. The corpus includes loanwords together with their morphological features, part-of-speech tags, and sentences in which they occur. We conduct an analysis of inflectional morphology of the identified loanwords, investigating the rate of indeclinability of recent loanwords and stem-final consonant alternations in verbs

    On the effects of English elements in German print advertisements

    Get PDF
    Diese Arbeit untersucht den Einfluss von englischen Elementen in deutschen Werbeanzeigen auf die Anmutung der Anzeige, die Bewertung des beworbenen Produkts sowie der beworbenen Marke und die Einschätzung der Zielgruppe. In einer quantitativen Onlinestudie wurden vier speziell entwickelte Werbeanzeigen, die sich nur hinsichtlich der Verwendung englischer Elemente unterschieden, von 297 Teilnehmern bewertet. Dabei zeigten sich nur in wenigen Fällen statistisch signifikante Unterschiede zwischen der Bewertung der deutschen Anzeigenversionen und der englisch-deutsch gemischten Anzeigenversionen. Da den Probanden jeweils nur eine Version der Anzeige gezeigt wurde und ihnen der linguistische Hintergrund der Untersuchung unbekannt war, spiegeln die Ergebnisse die Wirkung englischer Elemente in realen Kontaktsituationen wider. Dieser Werbewirkungsstudie ging eine Untersuchung der Sprachzuordung voraus, in der getestet wurde, welche Variablen einen Einfluss darauf haben, ob ein visuell präsentiertes Stimuluswort als Deutsch oder Englisch wahrgenommen wird. Als geeignete Prädiktoren erwiesen sich neben der etymologischen Herkunft des Wortes vor allem die Integration in das deutsche Lexikon (operationalisiert durch Konsultierung des Duden Universalwörterbuchs 7. Aufl.). Des Weiteren zeigte sich ein signifikanter Einfluss graphemischer Fremdheitsmarker auf die Sprachzuordnung der Lexeme. Dieser Einfluss konnte sowohl bei Wörtern englischen Ursprungs als auch bei Wörtern, die nicht-englischen Ursprungs waren (z.B. LINEAL, CREMIG), beobachtet werden und verdeutlicht die Wichtigkeit der visuellen Wortform für die Sprachzuordnung.This thesis studies the influence of English elements in German print advertisements on the emotional appeal of the advertisement, the evaluation of the advertised product and brand, and the evaluation of the implied target group. Four especially designed print advertisements, which only differed in their use of English elements, were evaluated by 297 participants in a quantitative online study. Only in a few cases statistically significant differences between the evaluation of the German advertisement versions and the English-German mixed advertisement versions were found. Since participants were only shown one version of the advertisement and because the linguistic background of the study was disguised, the results mirror the effects of English elements in actual contact situations. Prior to this research, a study on language decisions was conducted to test which variables influence whether a visually presented word is perceived as English or German. Next to the etymological origin of a word, especially the integration into the German lexicon (operationalised by consulting the Duden Universalwörterbuchs 7th ed.) proved to be a good predictor. Moreover, graphemic markers of foreignness significantly influenced to which language lexemes were assigned. This impact was witnessed for words of English origin as well as for words of non-English origin (e.g. LINEAL, CREMIG), which emphasises the importance of visual word form for language decisions

    Crosslinguistic interplay between semantics and phonology in late bilinguals: neurophysiological evidence

    Get PDF
    We investigated effects of crosslinguistic phonological and semantic similarity on the bilingual lexicon of late unbalanced bilinguals. Our masked priming paradigm used L1 (Russian) words as masked primes and L2 (English) words as targets. The primes and the targets either overlapped – phonologically, semantically, both phonologically and semantically – or did not overlap. Participants maintained the targets in memory and matched them against occasionally presented catch stimuli. N170 and N400 components of the word-elicited high-density ERPs were identified and analysed in signal and source space. Crosslinguistic semantic similarity shortened the reaction times. The semantics-related N400 amplitude difference correlated with individual L2 proficiency, while phonological similarity suppressed the N400 amplitude in the semantically unrelated condition. ERP source analysis suggests that these ERP dynamics are underpinned by cortical generators in the left IFG and the temporal pole. We conclude that the semantic and phonological interplay between L1 and L2 suggest an integrated bilingual lexicon

    Chamic and beyond : studies in mainland Austronesian languages

    Get PDF
    • …
    corecore