894 research outputs found

    Learning the Ordering of Coordinate Compounds and Elaborate Expressions in Hmong, Lahu, and Chinese

    Full text link
    Coordinate compounds (CCs) and elaborate expressions (EEs) are coordinate constructions common in languages of East and Southeast Asia. Mortensen (2006) claims that (1) the linear ordering of EEs and CCs in Hmong, Lahu, and Chinese can be predicted via phonological hierarchies and (2) these phonological hierarchies lack a clear phonetic rationale. These claims are significant because morphosyntax has often been seen as in a feed-forward relationship with phonology, and phonological generalizations have often been assumed to be phonetically "natural". We investigate whether the ordering of CCs and EEs can be learned empirically and whether computational models (classifiers and sequence labeling models) learn unnatural hierarchies similar to those posited by Mortensen (2006). We find that decision trees and SVMs learn to predict the order of CCs/EEs on the basis of phonology, with DTs learning hierarchies strikingly similar to those proposed by Mortensen. However, we also find that a neural sequence labeling model is able to learn the ordering of elaborate expressions in Hmong very effectively without using any phonological information. We argue that EE ordering can be learned through two independent routes: phonology and lexical distribution, presenting a more nuanced picture than previous work. [ISO 639-3:hmn, lhu, cmn]Comment: To be published in NAACL202

    Lexical prefixes and Tibeto-Burman laryngeal contrasts

    Get PDF
    Proceedings of the 37th Annual Meeting of the Berkeley Linguistics Society (2013), pp. 272-28

    ChatGPT MT: Competitive for High- (but not Low-) Resource Languages

    Full text link
    Large language models (LLMs) implicitly learn to perform a range of language tasks, including machine translation (MT). Previous studies explore aspects of LLMs' MT capabilities. However, there exist a wide variety of languages for which recent LLM MT performance has never before been evaluated. Without published experimental evidence on the matter, it is difficult for speakers of the world's diverse languages to know how and whether they can use LLMs for their languages. We present the first experimental evidence for an expansive set of 204 languages, along with MT cost analysis, using the FLORES-200 benchmark. Trends reveal that GPT models approach or exceed traditional MT model performance for some high-resource languages (HRLs) but consistently lag for low-resource languages (LRLs), under-performing traditional MT for 84.1% of languages we covered. Our analysis reveals that a language's resource level is the most important feature in determining ChatGPT's relative ability to translate it, and suggests that ChatGPT is especially disadvantaged for LRLs and African languages.Comment: 27 pages, 9 figures, 14 table

    Where New Words Are Born: Distributional Semantic Analysis of Neologisms and Their Semantic Neighborhoods

    Get PDF
    We perform statistical analysis of the phenomenon of neology, the process by which new words emerge in a language, using large diachronic corpora of English. We investigate the importance of two factors, semantic sparsity and frequency growth rates of semantic neighbors, formalized in the distributional semantics paradigm. We show that both factors are predictive of word emergence although we find more support for the latter hypothesis. Besides presenting a new linguistic application of distributional semantics, this study tackles the linguistic question of the role of language-internal factors (in our case, sparsity) in language change motivated by language-external factors (reflected in frequency growth)

    Interarm differences in systolic blood pressure and mortality among US army veterans:aetiological associations and risk prediction in the Vietnam experience study

    Get PDF
    Background Differences between the arms in systolic blood pressure (SBP) of ?10?mmHg have been associated with an increased risk of mortality in patients with hypertensive and chronic renal disease. For the first time, we examined these relationships in a non-clinical population. Design Cohort study. Methods Participants were 4419 men (mean age 38.37 years) from the Vietnam Experience Study. Bilateral SBP and diastolic BP (DBP), serum lipids, fasting glucose, erythrocyte sedimentation rate, metabolic syndrome, and ankle brachial index were assessed in 1986. Results Ten per cent of men had an interarm difference of ?10 and 2.4% of ?15?mmHg. A 15-year follow-up period gave rise to 246 deaths (64 from cardiovascular disease, CVD). Interarm differences of ?10?mmHg were associated with an elevated risk of all-cause mortality (hazard ratio, HR, 1.49, 95% confidence interval, CI, 1.04–2.14) and CVD mortality (HR 1.93, 95% CI 1.01–3.69). After adjusting for SBP, DBP, lipids, fasting glucose, and erythrocyte sedimentation rate, associations between interarm differences of ?10?mmHg and all-cause mortality (HR 1.35, 95% CI 0.94–1.95) and CVD mortality (1.62, 95% CI 0.84–3.14) were significantly attenuated. Conclusions In this non-clinical cohort study, interarm differences in SBP were not associated with mortality after accounting for traditional CVD risk factors. Interarm differences might not be valuable as an additional risk factor for mortality in populations with a low risk of CVD. <br/

    Towards Zero-shot Learning for Automatic Phonemic Transcription

    Full text link
    Automatic phonemic transcription tools are useful for low-resource language documentation. However, due to the lack of training sets, only a tiny fraction of languages have phonemic transcription tools. Fortunately, multilingual acoustic modeling provides a solution given limited audio training data. A more challenging problem is to build phonemic transcribers for languages with zero training data. The difficulty of this task is that phoneme inventories often differ between the training languages and the target language, making it infeasible to recognize unseen phonemes. In this work, we address this problem by adopting the idea of zero-shot learning. Our model is able to recognize unseen phonemes in the target language without any training data. In our model, we decompose phonemes into corresponding articulatory attributes such as vowel and consonant. Instead of predicting phonemes directly, we first predict distributions over articulatory attributes, and then compute phoneme distributions with a customized acoustic model. We evaluate our model by training it using 13 languages and testing it using 7 unseen languages. We find that it achieves 7.7% better phoneme error rate on average over a standard multilingual model.Comment: AAAI 202
    • …
    corecore