422,063 research outputs found

    Dr. Pangloss\u27s Puzzle

    Get PDF
    We wrote to Dr. Pangloss the other day, to ask him which languages he spoke and what places he had visited. To the first question, he sent us this mixed-up answer

    Theoretical issues in the interpretation of Cappadocian, a not-so-dead Greek contact language

    Get PDF
    Cappadocian is a mixed Greek-Turkish dialect continuum spoken in the Turkish Central Anatolia Region until the population exchange between Greece and Turkey in the 1920s. Only a few Cappadocian dialects are still spoken in present-day Greece. Since the publication of Thomason and Kaufman’s Language Contact, Creolization, and Genetic Linguistics in 1988, Cappadocian has attracted the attention of historical and contact linguists, because of its unique mixed character. In this paper, I will discuss a number of theoretical issues in the interpretation of the linguistic structure of Cappadocian, focusing on the following topics: (1) the status of loan phonemes and loan morphemes in contact languages, (2) the distinction between code switching and code mixing in relation to Poplack’s Free Morpheme Constraint, (3) the schizoid typology of contact languages

    Polyglot Semantic Parsing in APIs

    Full text link
    Traditional approaches to semantic parsing (SP) work by training individual models for each available parallel dataset of text-meaning pairs. In this paper, we explore the idea of polyglot semantic translation, or learning semantic parsing models that are trained on multiple datasets and natural languages. In particular, we focus on translating text to code signature representations using the software component datasets of Richardson and Kuhn (2017a,b). The advantage of such models is that they can be used for parsing a wide variety of input natural languages and output programming languages, or mixed input languages, using a single unified model. To facilitate modeling of this type, we develop a novel graph-based decoding framework that achieves state-of-the-art performance on the above datasets, and apply this method to two other benchmark SP tasks.Comment: accepted for NAACL-2018 (camera ready version

    OffMix-3L: A Novel Code-Mixed Dataset in Bangla-English-Hindi for Offensive Language Identification

    Full text link
    Code-mixing is a well-studied linguistic phenomenon when two or more languages are mixed in text or speech. Several works have been conducted on building datasets and performing downstream NLP tasks on code-mixed data. Although it is not uncommon to observe code-mixing of three or more languages, most available datasets in this domain contain code-mixed data from only two languages. In this paper, we introduce OffMix-3L, a novel offensive language identification dataset containing code-mixed data from three different languages. We experiment with several models on this dataset and observe that BanglishBERT outperforms other transformer-based models and GPT-3.5.Comment: arXiv admin note: substantial text overlap with arXiv:2310.1802

    Constrained Output Embeddings for End-to-End Code-Switching Speech Recognition with Only Monolingual Data

    Full text link
    The lack of code-switch training data is one of the major concerns in the development of end-to-end code-switching automatic speech recognition (ASR) models. In this work, we propose a method to train an improved end-to-end code-switching ASR using only monolingual data. Our method encourages the distributions of output token embeddings of monolingual languages to be similar, and hence, promotes the ASR model to easily code-switch between languages. Specifically, we propose to use Jensen-Shannon divergence and cosine distance based constraints. The former will enforce output embeddings of monolingual languages to possess similar distributions, while the later simply brings the centroids of two distributions to be close to each other. Experimental results demonstrate high effectiveness of the proposed method, yielding up to 4.5% absolute mixed error rate improvement on Mandarin-English code-switching ASR task.Comment: 5 pages, 3 figures, accepted to INTERSPEECH 201
    • …
    corecore