115 research outputs found

    Survey on Publicly Available Sinhala Natural Language Processing Tools and Research

    Full text link
    Sinhala is the native language of the Sinhalese people who make up the largest ethnic group of Sri Lanka. The language belongs to the globe-spanning language tree, Indo-European. However, due to poverty in both linguistic and economic capital, Sinhala, in the perspective of Natural Language Processing tools and research, remains a resource-poor language which has neither the economic drive its cousin English has nor the sheer push of the law of numbers a language such as Chinese has. A number of research groups from Sri Lanka have noticed this dearth and the resultant dire need for proper tools and research for Sinhala natural language processing. However, due to various reasons, these attempts seem to lack coordination and awareness of each other. The objective of this paper is to fill that gap of a comprehensive literature survey of the publicly available Sinhala natural language tools and research so that the researchers working in this field can better utilize contributions of their peers. As such, we shall be uploading this paper to arXiv and perpetually update it periodically to reflect the advances made in the field

    Leveraging Auxiliary Domain Parallel Data in Intermediate Task Fine-tuning for Low-resource Translation

    Full text link
    NMT systems trained on Pre-trained Multilingual Sequence-Sequence (PMSS) models flounder when sufficient amounts of parallel data is not available for fine-tuning. This specifically holds for languages missing/under-represented in these models. The problem gets aggravated when the data comes from different domains. In this paper, we show that intermediate-task fine-tuning (ITFT) of PMSS models is extremely beneficial for domain-specific NMT, especially when target domain data is limited/unavailable and the considered languages are missing or under-represented in the PMSS model. We quantify the domain-specific results variations using a domain-divergence test, and show that ITFT can mitigate the impact of domain divergence to some extent.Comment: Accepted for poster presentation at the Practical Machine Learning for Developing Countries (PML4DC) workshop, ICLR 202

    Tamils and the nation: India and Sri Lanka compared

    Get PDF
    This dissertation examines the divergent trajectories of ethnic and national politics in the Tamil speaking regions of India and Sri Lanka. Despite comparable historical experiences and conditions, the south Indian Tamil speaking areas were peaceably accommodated within a pan-Indian framework whilst Sri Lankan politics was marked by escalating Tamil-Sinhala ethnic polarisation and violent conflict. The dissertation explains these contrasting outcomes by setting out a novel theoretical framework that draws on the work of Reinhart Koselleck and his analysis of the links between concepts and political conflict. It argues that in the era of popular sovereignty the nation and ethnicity have become central and unavoidable concepts of political order, but concepts that can be deliberately constructed through political activity in more or less inclusive ways. Setting out the conceptual connections between the nation, ethnicity and popular sovereignty, the dissertation shows how the conceptual tension between a unified national identity / interest and ethnic pluralism becomes a central and unavoidable locus of political contestation in the era of popular sovereignty. Tracing the politics of ethnicity and nationalism in India and Sri Lanka from the late nineteenth century to the late 1970’s, the analysis shows that the accommodation of Tamil identity within Indian nationalist frameworks and the escalation of Tamil – Sinhala ethnic conflict in Sri Lanka cannot be linked to differences in ethnic demography, political system, historical experiences or the structure of economic incentives. It reveals instead that these divergent outcomes are best explained as effects of contingent and competitive processes of political organisation and mobilisation through which deliberately more or less ethnically inclusive national identities are asserted, established and then contested

    ArzEn-ST: A Three-way Speech Translation Corpus for Code-Switched Egyptian Arabic - English

    Full text link
    We present our work on collecting ArzEn-ST, a code-switched Egyptian Arabic - English Speech Translation Corpus. This corpus is an extension of the ArzEn speech corpus, which was collected through informal interviews with bilingual speakers. In this work, we collect translations in both directions, monolingual Egyptian Arabic and monolingual English, forming a three-way speech translation corpus. We make the translation guidelines and corpus publicly available. We also report results for baseline systems for machine translation and speech translation tasks. We believe this is a valuable resource that can motivate and facilitate further research studying the code-switching phenomenon from a linguistic perspective and can be used to train and evaluate NLP systems.Comment: Accepted to the Seventh Arabic Natural Language Processing Workshop (WANLP 2022

    A reflection on the design and user acceptance of Tamil talk

    Get PDF
    Tamil talk is a speech to text application and was designed from a perspective of language and philosophy. This paper takes an indigenous approach in reflecting on the design and user acceptance of Tamil talk. The paper makes use of literature in critically reflecting on the design and the potential user acceptance of the application. It takes a multidisciplinary approach and explores the influence of factors like language shift, language maintenance and philosophy in the context of user acceptance of speech to text. The application may appeal to a section of the native Tamil speakers as suggested in the literature but there are complex challenges that needs further research. Further research shall be in developing the application that conforms to the conceptual framework and widely test with the native speakers to arrive at a more precise prediction of user acceptance

    Between Text and Talk: Expertise, Normativity, and Scales of Belonging in the Montreal Tamil Diasporas.

    Full text link
    In the global city-region of Montreal, Tamil-speaking residents are orienting themselves to multiple homelands, nations, and diasporas of different spatial and temporal scales. These scales of belonging are constituted by regimenting linguistic forms, practices, and speakers into a series of hierarchical relationships that are recursively modeled on the ideological distinctions between “text” and “talk”. Various language ideologies contribute to this politics of regimentation, including the globally dominant ethnolinguistic language ideology, the locally-specific language ideology of sociolinguistic compartmentalization, and the regionally-specific diglossia language ideology. Out of these mutually reinforcing ideologies and institutions have emerged two morally incommensurable Tamil sociolinguistic personas. In the Indian Tamil diaspora, the cultivation of talk-like expertise in Tamil is celebrated as an index of speakers’ globalizing and modernist moral sensibilities. In the Sri Lankan Tamil diaspora, the cultivation of text-like expertise in Tamil is celebrated as an index of speakers’ purist and primordialist moral sensibilities. There is a complementarity to this division of language labor, with Indian Tamils entrusted to modernize the prestige of the mother tongue and Sri Lankan Tamils entrusted to preserve the purity of the literary standard. The expansion of the Sri Lankan Tamil diaspora, with its heritage language institutions and textual facades, and the increase in Indian Tamil linguistic entrepreneurs testifies to the profitability of this arrangement for both Montreal Tamil groups. Each Tamil diaspora also socializes its youth to endorse mutually-opposed ethnonational Tamil personas while cultivating similar linguistic repertoires. Thus, even though 2nd generation Indian Tamils are socialized to speak English and colloquial Tamil and Sri Lankan Tamils are socialized to speak French and literary-stylized Tamil, incentives to habitually code-switch between Tamil, English, and/or French have caused these linguistic repertoires to converge. Sometimes, such acts of code-switching/code-mixing are intended to shift the normative scale of the communicative encounter or the discursive frame. For Sri Lankan Tamil nationalists, the political uncertainties of the refugee experience will precipitate a shift in the inter-discursive frame between diaspora and homeland. For other Montreal Tamils, the racialization of “tamouls” as permanent “étrangers” will prompt attempts to shift the scales of communicative encounters between majority and minority interlocutors.Ph.D.AnthropologyUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/61571/1/sndas_1.pd

    Survey of Low-Resource Machine Translation

    Get PDF
    International audienceWe present a survey covering the state of the art in low-resource machine translation (MT) research. There are currently around 7,000 languages spoken in the world and almost all language pairs lack significant resources for training machine translation models. There has been increasing interest in research addressing the challenge of producing useful translation models when very little translated training data is available. We present a summary of this topical research field and provide a description of the techniques evaluated by researchers in several recent shared tasks in low-resource MT
    • …
    corecore