3,696 research outputs found

    Natural language processing for similar languages, varieties, and dialects: A survey

    Get PDF
    There has been a lot of recent interest in the natural language processing (NLP) community in the computational processing of language varieties and dialects, with the aim to improve the performance of applications such as machine translation, speech recognition, and dialogue systems. Here, we attempt to survey this growing field of research, with focus on computational methods for processing similar languages, varieties, and dialects. In particular, we discuss the most important challenges when dealing with diatopic language variation, and we present some of the available datasets, the process of data collection, and the most common data collection strategies used to compile datasets for similar languages, varieties, and dialects. We further present a number of studies on computational methods developed and/or adapted for preprocessing, normalization, part-of-speech tagging, and parsing similar languages, language varieties, and dialects. Finally, we discuss relevant applications such as language and dialect identification and machine translation for closely related languages, language varieties, and dialects.Non peer reviewe

    Computational Sociolinguistics: A Survey

    Get PDF
    Language is a social phenomenon and variation is inherent to its social nature. Recently, there has been a surge of interest within the computational linguistics (CL) community in the social dimension of language. In this article we present a survey of the emerging field of "Computational Sociolinguistics" that reflects this increased interest. We aim to provide a comprehensive overview of CL research on sociolinguistic themes, featuring topics such as the relation between language and social identity, language use in social interaction and multilingual communication. Moreover, we demonstrate the potential for synergy between the research communities involved, by showing how the large-scale data-driven methods that are widely used in CL can complement existing sociolinguistic studies, and how sociolinguistics can inform and challenge the methods and assumptions employed in CL studies. We hope to convey the possible benefits of a closer collaboration between the two communities and conclude with a discussion of open challenges.Comment: To appear in Computational Linguistics. Accepted for publication: 18th February, 201

    Master of Arts

    Get PDF
    thesisHigh-vowel lenition is attested in various forms in a number of languages, including Shoshoni, Lezgian, East Cree, Andean Spanish, and Japanese, along with many others. It is also attested in the development of the various Romance languages from Proto-Romance. High-vowel deletion and devoicing are both attested in Quebec French, with some authors reporting devoicing but no deletion, and others reporting frequent deletion and devoicing. Research indicates that both surrounding consonantal context and sociolinguistic factors contribute to (non)lenition of Quebec French high vowels, with some authors treating deletion and devoicing as separate phenomena and others treating them as different manifestations of the same phenomenon. Few studies have investigated high-vowel lenition in other varieties of French. This study investigates deletion and devoicing of the high-vowel phonemes /i/, /y/, and /u/ in the French spoken in Quebec and Paris, and identifies which phonetic and social factors, including left and right context, vowel phoneme, provenance, gender, and style, best predict these phenomena. It also addressed whether high-vowel deletion and devoicing are different manifestations of a single phenomenon or two separate phenomena in these varieties of French. Data are from recordings of native French speakers from the Phonologie du Francais Contemporain (PFC) corpus project. Each speaker participated in two different interviews representing two levels of style. For each speaker, each interview type, and each high-vowel phoneme, twenty interconsonantal tokens were transcribed and coded as deleted or present, and as voiced or devoiced, along with the surrounding consonantal context. Tokens were subjected to statistical analysis. Despite most expectations, there are no statistical differences between the rates of deletion and devoicing in Quebec and Paris, and neither phenomenon is unique to Quebec French. The best predictors of deletion were place and manner of articulation of surrounding consonants, while the best predictor of devoicing was voiceless surrounding consonants. These results indicate that deletion and devoicing are separate processes. Although not significant at the aggregate level, sociolinguistic factors were significant predictors in more specific models. Deletion and devoicing of French high-vowels are both more complex and more widespread than previous studies have suggested

    “You’re trolling because…” – A Corpus-based Study of Perceived Trolling and Motive Attribution in the Comment Threads of Three British Political Blogs

    Get PDF
    This paper investigates the linguistically marked motives that participants attribute to those they call trolls in 991 comment threads of three British political blogs. The study is concerned with how these motives affect the discursive construction of trolling and trolls. Another goal of the paper is to examine whether the mainly emotional motives ascribed to trolls in the academic literature correspond with those that the participants attribute to the alleged trolls in the analysed threads. The paper identifies five broad motives ascribed to trolls: emotional/mental health-related/social reasons, financial gain, political beliefs, being employed by a political body, and unspecified political affiliation. It also points out that depending on these motives, trolling and trolls are constructed in various ways. Finally, the study argues that participants attribute motives to trolls not only to explain their behaviour but also to insult them

    Проблеми соціального варіювання мови в аспекті перекладу

    Get PDF
    There is little question that English is the most widely taught, read, and spoken language that the world has ever known. It may seem strange, on some moments' reflection, that the native language of a relatively small island nation could have developed and spread to this status. Its path was foreseen, however, by John Adams, who, in the late eighteenth century, made the following insightful prophesy: English will be the most respectable language in the world and the most universally read and spoken in the next century, if not before the close of this one. When you are citing the document, use the following link http://essuir.sumdu.edu.ua/handle/123456789/1647

    Segmental Content Effects on Text-dependent Automatic Accent Recognition

    Get PDF
    This paper investigates the effects of an unknown speech sample’s segmental content (the specific vowels and consonants it contains) on its chances of being successfully classified by an automatic accent recognition system. While there has been some work to investigate this effect in automatic speaker recognition, it has not been explored in relation to automatic accent recognition. This is a task where we would hypothesise that segmental content has a particularly large effect on the likelihood of a successful classification, especially for shorter speech samples. By focussing on one particular text-dependent automatic accent recognition system, the Y-ACCDIST system, we uncover the phonemes that appear to contribute more or less to successful classifications using a corpus of Northern English accents. We also relate these findings to the sociophonetic literature on these specific spoken varieties to attempt to account for the patterns that we see and to consider other factors that might contribute to a sample’s successful classification

    An exploration of the rhythm of Malay

    Get PDF
    In recent years there has been a surge of interest in speech rhythm. However we still lack a clear understanding of the nature of rhythm and rhythmic differences across languages. Various metrics have been proposed as means for measuring rhythm on the phonetic level and making typological comparisons between languages (Ramus et al, 1999; Grabe & Low, 2002; Dellwo, 2006) but the debate is ongoing on the extent to which these metrics capture the rhythmic basis of speech (Arvaniti, 2009; Fletcher, in press). Furthermore, cross linguistic studies of rhythm have covered a relatively small number of languages and research on previously unclassified languages is necessary to fully develop the typology of rhythm. This study examines the rhythmic features of Malay, for which, to date, relatively little work has been carried out on aspects rhythm and timing. The material for the analysis comprised 10 sentences produced by 20 speakers of standard Malay (10 males and 10 females). The recordings were first analysed using rhythm metrics proposed by Ramus et. al (1999) and Grabe & Low (2002). These metrics (∆C, %V, rPVI, nPVI) are based on durational measurements of vocalic and consonantal intervals. The results indicated that Malay clustered with other so-called syllable-timed languages like French and Spanish on the basis of all metrics. However, underlying the overall findings for these metrics there was a large degree of variability in values across speakers and sentences, with some speakers having values in the range typical of stressed-timed languages like English. Further analysis has been carried out in light of Fletcher’s (in press) argument that measurements based on duration do not wholly reflect speech rhythm as there are many other factors that can influence values of consonantal and vocalic intervals, and Arvaniti’s (2009) suggestion that other features of speech should also be considered in description of rhythm to discover what contributes to listeners’ perception of regularity. Spectrographic analysis of the Malay recordings brought to light two parameters that displayed consistency and regularity for all speakers and sentences: the duration of individual vowels and the duration of intervals between intensity minima. This poster presents the results of these investigations and points to connections between the features which seem to be consistently regulated in the timing of Malay connected speech and aspects of Malay phonology. The results are discussed in light of current debate on the descriptions of rhythm
    corecore