Search CORE

1,493 research outputs found

Rule Based Transliteration Scheme for English to Punjabi

Author: Bhalla Deepti
Joshi Nisheeth
Mathur Iti
Publication venue
Publication date: 01/04/2013
Field of study

Machine Transliteration has come out to be an emerging and a very important research area in the field of machine translation. Transliteration basically aims to preserve the phonological structure of words. Proper transliteration of name entities plays a very significant role in improving the quality of machine translation. In this paper we are doing machine transliteration for English-Punjabi language pair using rule based approach. We have constructed some rules for syllabification. Syllabification is the process to extract or separate the syllable from the words. In this we are calculating the probabilities for name entities (Proper names and location). For those words which do not come under the category of name entities, separate probabilities are being calculated by using relative frequency through a statistical machine translation toolkit known as MOSES. Using these probabilities we are transliterating our input text from English to Punjabi

arXiv.org e-Print Archive

CogPrints Cognitive Sciences Eprint Archive

XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented Languages

Author: Adelani David I.
Axelrod Vera
Caswell Isaac
Cherry Colin
Clark Jonathan H.
Dickinson Dana L.
Garrette Dan
Gupta Nitish
Gutkin Alexander
Ingle Reeve
Johnson Melvin
Kale Mihir
Katanova Anna
Kirov Christo
Ma Min
Nicosia Massimo
Panteleev Dmitry
Rijhwani Shruti
Riley Parker
Roark Brian
Ruder Sebastian
Samanta Bidisha
Sarr Jean-Michel A.
Talukdar Partha
Tao Connie
Wang Xinyi
Wieting John
Publication venue
Publication date: 24/05/2023
Field of study

Data scarcity is a crucial issue for the development of highly multilingual NLP systems. Yet for many under-represented languages (ULs) -- languages for which NLP re-search is particularly far behind in meeting user needs -- it is feasible to annotate small amounts of data. Motivated by this, we propose XTREME-UP, a benchmark defined by: its focus on the scarce-data scenario rather than zero-shot; its focus on user-centric tasks -- tasks with broad adoption by speakers of high-resource languages; and its focus on under-represented languages where this scarce-data scenario tends to be most realistic. XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies including ASR, OCR, MT, and information access tasks that are of general utility. We create new datasets for OCR, autocomplete, semantic parsing, and transliteration, and build on and refine existing datasets for other tasks. XTREME-UP provides methodology for evaluating many modeling scenarios including text-only, multi-modal (vision, audio, and text),supervised parameter tuning, and in-context learning. We evaluate commonly used models on the benchmark. We release all code and scripts to train and evaluate model

arXiv.org e-Print Archive

Dialogic Heteroglossia in Asian-Canadian Literature

Author: AHMADI Sawssen
Publication venue: Mohammad Nassar for Researches (MNFR)
Publication date: 22/06/2023
Field of study

The feature of some Asian-Candian narratives is that they give a linguistic voice to voiceless/wordless characters to defend their rights as marginalized and invisible identities. The focus of this paper will be on the study of Obasan, Chorus of Mushrooms, What the Body Remembers and Everything Was Good-Bye as literary productions in a transitional era within which language and translation influence identity construction and representation. The purpose is to tackle the dialogic translingual and heteroglossic technique used by diasporic writers to represent ethnic minorities’ melancholic history and hybrid identities

GSSRR.ORG: International Journals: Publishing Research Papers in all Fields

Linguistic variation and ethnicity in a super-diverse community: The case of Vancouver English

Author: Presnyakova Irina
Publication venue
Publication date: 11/12/2020
Field of study

Today, people with British/European heritage comprise about half (49.3%) of the total population of Metro Vancouver, while the other half is represented by visual minorities, with Chinese (20.6%) and South Asians (11.9%) being the largest ones (Statistics Canada 2017). However, non-White population are largely unrepresented in sociolinguistic research on the variety of English spoken locally. The objective of this study is to determine whether and to what extent young people with non-White ethnic backgrounds participate in some of the on-going sound changes in Vancouver English. Data from 45 participants with British/Mixed European, Chinese and South Asian heritage, native speakers of English, were analyzed instrumentally to get the formant measurements of the vowels of each speaker. Interview data were subjected to thematic analysis that aimed to describe to which extent each participant affiliated with their heritage. The results of the descriptive and inferential statistical analysis showed that, first, the vowel systems of these young people are similar and they all are undoubtedly speakers of modern Canadian English as described in previous research (Boberg 2010). Second, all three groups participate in the most important changes in Canadian English: the Canadian Shift, Canadian Raising, the fronting of back vowels, and allophonic variation of /æ/ in pre-nasal and pre-velar positions. Some differences along the ethnic lines that were discovered concern the degree of advancement of a given change, not its presence or absence. Socio-ethnic profiles of the participants created on the basis of the thematic analysis can be roughly put into two categories, mono- and bicultural identity orientation (Comănaru et al. 2018). Great variability is described both within and across groups, with language emerging as one of the most important factors in the participants’ identity construction. Exploratory analysis showed some tendencies in vowel production by speakers with mono- and bicultural orientations, with differences both among and within two non-White groups. The findings of the study call into question both our understanding of the mechanisms of language acquisition and our approach to delimiting and describing speech communities in super-diverse urban centers

Simon Fraser University Institutional Repository

Breaking Language Barriers with a LEAP: Learning Strategies for Polyglot LLMs

Author: Ahuja Kabir
Bali Kalika
Balloli Vaibhav
Ganu Tanuja
Nambi Akshay
Ranjit Mercy
Sitaram Sunayana
Publication venue
Publication date: 28/05/2023
Field of study

Large language models (LLMs) are at the forefront of transforming numerous domains globally. However, their inclusivity and effectiveness remain limited for non-Latin scripts and low-resource languages. This paper tackles the imperative challenge of enhancing the multilingual performance of LLMs, specifically focusing on Generative models. Through systematic investigation and evaluation of diverse languages using popular question-answering (QA) datasets, we present novel techniques that unlock the true potential of LLMs in a polyglot landscape. Our approach encompasses three key strategies that yield remarkable improvements in multilingual proficiency. First, by meticulously optimizing prompts tailored for polyglot LLMs, we unlock their latent capabilities, resulting in substantial performance boosts across languages. Second, we introduce a new hybrid approach that synergizes GPT generation with multilingual embeddings and achieves significant multilingual performance improvement on critical tasks like QA and retrieval. Finally, to further propel the performance of polyglot LLMs, we introduce a novel learning algorithm that dynamically selects the optimal prompt strategy, LLM model, and embeddings per query. This dynamic adaptation maximizes the efficacy of LLMs across languages, outperforming best static and random strategies. Our results show substantial advancements in multilingual understanding and generation across a diverse range of languages

arXiv.org e-Print Archive

Language choice as a gate-keeping practice : an exploration into the psycho-social impacts of multilingualism through case studies from the educational and judicial sectors of Pakistan

Author: Nasir Aftab
Publication venue: Universitäts- und Landesbibliothek Bonn
Publication date
Field of study

Pakistan is a multilingual and multi-ethnic country: all the provinces have their own regional languages as lingua franca, i.e., Punjabi in Punjab, Pashto in Khyber Pakhtunkhwa, etc.; Urdu is the national language, i.e., it is the language of majority of the state schools and of the media; whereas English, owing to its colonial past, is the official language of Pakistan, i.e., it is the language of the official transactions, the constitution, law and higher education in the country. Such division means that individuals may have to switch from one language to another when they move from home to school, to work settings, or to official business in public or private offices. The analysis of the data collected from different sites (educational and judicial sectors) reveals how and why the discourses of differential use of languages, created and shaped by the educational institutes, are affected by the overall linguistic attitudes existing in society towards different languages. This research concludes that, on the societal level, this differential language system excludes those who do not know a particular language, i.e., English, and disempowers them structurally from getting their due share of state-provided services, such as justice and education. In 2013, all the stakeholders working for the development of Pakistan, both public and private, came together and agreed upon an agenda for development, called Vision 2025. This document is an aspirational tool to be used as a conceptual framework for steering the country into the direction of sustainable and inclusive development. The aim of this vision is to bring Pakistan among the top 25 world economies by year 2025. In order to see whether such ambitious attempts, as outlined in the document, complement or contradict the already existing social realities and discourses of development remains central theme to the current research. In this regard, language policies, perceptions, attitudes, and daily practices become the lens that is used to investigate the relation between the theoretical aspirations and practical situations on the ground. Language choice becomes a contested field in a multilingual society. Any act of speech in such societies is a political act, where different languages are chosen for different purposes. In a contemporary globalized world where one language, i.e., English enjoys the most acceptance; those who know better English acquire added leverage and symbolic power over others in everyday interactions of members of Pakistani society. This dissertation maintains that fascination with English in the context of Pakistan is actually a colonial legacy that has worked towards establishing and perpetuating symbolic superiority against other languages and speakers of those languages in contemporary Pakistani society. My contribution departs from the traditional themes of political economy of nation-state models that focus on the broader themes of nation-building and the issues of governance, identity and marginality in a post-colonial nation(s). I attempt to address the questions of power and distribution of linguistic resources in Pakistani polity from a sociological angle. In the following pages, I specifically conceptualise and analyse the social practices, attitudes and discourses of marginality and identity construction along linguistic lines by using the concepts of habitus, field, capital, and symbolic power. This dissertation tries to untangle multilingualism from two broad themes: 1) it addresses the questions of the sociocultural dominance of English and Urdu languages over regional languages; 2) it shows how the distribution of linguistic resources is contested, negotiated and reproduced in the praxis of the stakeholders interacting in a multilingual setting. In order to conduct an empirical investigation, two sectors are selected, i.e., the educational and judicial sectors of Pakistan. The rationale behind this choice is both theoretical and practical. The educational sector is selected as it becomes the seedbed where discourses and perceptions are produced and reproduced for the official and legitimate practices of language use; whereas, the judicial sector is selected as it links directly with the manifestation of these discourses and perceptions. It is in the judicial sector, where one sees the direct effects of knowing or not knowing a specific language manifested as every letter, word, and comma, matters in the judicial proceedings. An institute that is responsible for disseminating justice to the citizens of the state, the judicial system uses a language (English) that is alien to the majority of Pakistani population. For example, only in the Punjab province, around 45% of the total population speaks Punjabi, whereas only 4-7% can speak or understand English, yet the judicial system in its official discourse conducts all its business in English. The laws, court proceedings, and verdicts disseminated in various trials in the judicial courts are conducted in English. This research aims at finding out whether this very act of conducting judicial proceedings in English disenfranchises the masses from the system. A mixed-method research design was used to investigate these questions at the public universities and judicial courts in Pakistan. The question remains if and how the choice of language in education can become a tool for estrangement and exclusion. Discourses of development and language-based inequality seem to exist next to each other, weaved seamlessly in the overall social fabric, habitus, of the contemporary Pakistani society. The empirical evidence from the educational institutes further elaborates why a certain language, e.g., English, is preferred at the expense of others. What kind of benefits and disadvantages are entailed in knowing or not knowing English and what kind of identities are associated with English and other languages, such as Punjabi. The underlying generative principle, habitus, combines this language-based inequality with development in a way that current education policy actually perpetuates ideologies, perceptions and practices of social inequalities. These structurally inculcated distinctive principles inadvertently “convince” the dominated, the less-advantaged, into accepting the conditions of his/her own dominance as natural, thereby resulting in symbolic constraint. Moreover, this research shows how language is used as an aspirational capacity for social mobility and what hurdles, both social and psychological, students face in using this capacity in their prospective lives. Historically speaking, after the independence of Pakistan, the British rulers left in 1947, but the unequal social spaces they created stayed behind as these arrangements suited those who were already working under the British rule. Under such conditions and neo-colonial patterns of life, there emerges a hybrid form of speech; one where words of Urdu, English, Punjabi or other regional languages are inter-mixed. An act of using a signifier of one language, say English, while speaking in another language, say Urdu or Punjabi, results in providing extra leverage, symbolic superiority, and authority to the speaker. This hybrid speech serves two purposes; a) it keeps the power inequality intact as it renders one language, i.e., English, superior over all other local languages, and b) it helps to appease those who, not having the capacity to compete in the English dominant market, nevertheless remain at the periphery of the circle, trying to carve out their own spaces. Thus the linguistic interactions, eventually and inadvertently, result in shaping, reproducing, and reinforcing the sociological habitus that in the first place creates social inequalities generated by varied use of languages for various purposes. Therefore, it is argued that the official discourse of sustainable development, though promising in principle, stands miles away from the social realities of development of Pakistan. The development experts, both national and international, have to consider the socially participative model of development in order to address the pressing challenges of nation-building as compared to state-building as far as the language related problems of the Pakistani society are concerned

bonndoc – Der Publikationsserver der Universität Bonn