422,207 research outputs found
Dr. Pangloss\u27s Puzzle
We wrote to Dr. Pangloss the other day, to ask him which languages he spoke and what places he had visited. To the first question, he sent us this mixed-up answer
Theoretical issues in the interpretation of Cappadocian, a not-so-dead Greek contact language
Cappadocian is a mixed Greek-Turkish dialect continuum spoken in the Turkish Central Anatolia Region until the population exchange between Greece and Turkey in the 1920s.
Only a few Cappadocian dialects are still spoken in present-day Greece. Since the publication of Thomason and Kaufman’s Language Contact, Creolization, and Genetic Linguistics in 1988, Cappadocian has attracted the attention of historical and contact linguists, because of its unique mixed character. In this paper, I will discuss a number of theoretical issues in the interpretation of the linguistic structure of Cappadocian, focusing on the following topics: (1) the status of loan phonemes and loan morphemes in contact languages, (2) the distinction between code switching and code mixing in relation to Poplack’s Free Morpheme Constraint, (3) the schizoid typology of contact languages
Polyglot Semantic Parsing in APIs
Traditional approaches to semantic parsing (SP) work by training individual
models for each available parallel dataset of text-meaning pairs. In this
paper, we explore the idea of polyglot semantic translation, or learning
semantic parsing models that are trained on multiple datasets and natural
languages. In particular, we focus on translating text to code signature
representations using the software component datasets of Richardson and Kuhn
(2017a,b). The advantage of such models is that they can be used for parsing a
wide variety of input natural languages and output programming languages, or
mixed input languages, using a single unified model. To facilitate modeling of
this type, we develop a novel graph-based decoding framework that achieves
state-of-the-art performance on the above datasets, and apply this method to
two other benchmark SP tasks.Comment: accepted for NAACL-2018 (camera ready version
OffMix-3L: A Novel Code-Mixed Dataset in Bangla-English-Hindi for Offensive Language Identification
Code-mixing is a well-studied linguistic phenomenon when two or more
languages are mixed in text or speech. Several works have been conducted on
building datasets and performing downstream NLP tasks on code-mixed data.
Although it is not uncommon to observe code-mixing of three or more languages,
most available datasets in this domain contain code-mixed data from only two
languages. In this paper, we introduce OffMix-3L, a novel offensive language
identification dataset containing code-mixed data from three different
languages. We experiment with several models on this dataset and observe that
BanglishBERT outperforms other transformer-based models and GPT-3.5.Comment: arXiv admin note: substantial text overlap with arXiv:2310.1802
Constrained Output Embeddings for End-to-End Code-Switching Speech Recognition with Only Monolingual Data
The lack of code-switch training data is one of the major concerns in the
development of end-to-end code-switching automatic speech recognition (ASR)
models. In this work, we propose a method to train an improved end-to-end
code-switching ASR using only monolingual data. Our method encourages the
distributions of output token embeddings of monolingual languages to be
similar, and hence, promotes the ASR model to easily code-switch between
languages. Specifically, we propose to use Jensen-Shannon divergence and cosine
distance based constraints. The former will enforce output embeddings of
monolingual languages to possess similar distributions, while the later simply
brings the centroids of two distributions to be close to each other.
Experimental results demonstrate high effectiveness of the proposed method,
yielding up to 4.5% absolute mixed error rate improvement on Mandarin-English
code-switching ASR task.Comment: 5 pages, 3 figures, accepted to INTERSPEECH 201
- …