47 research outputs found
Constraint Based Hybrid Approach to Parsing Indian Languages
PACLIC 23 / City University of Hong Kong / 3-5 December 200
Assessing Translation capabilities of Large Language Models involving English and Indian Languages
Generative Large Language Models (LLMs) have achieved remarkable advancements
in various NLP tasks. In this work, our aim is to explore the multilingual
capabilities of large language models by using machine translation as a task
involving English and 22 Indian languages. We first investigate the translation
capabilities of raw large language models, followed by exploring the in-context
learning capabilities of the same raw models. We fine-tune these large language
models using parameter efficient fine-tuning methods such as LoRA and
additionally with full fine-tuning. Through our study, we have identified the
best performing large language model for the translation task involving LLMs,
which is based on LLaMA.
Our results demonstrate significant progress, with average BLEU scores of
13.42, 15.93, 12.13, 12.30, and 12.07, as well as CHRF scores of 43.98, 46.99,
42.55, 42.42, and 45.39, respectively, using 2-stage fine-tuned LLaMA-13b for
English to Indian languages on IN22 (conversational), IN22 (general),
flores200-dev, flores200-devtest, and newstest2019 testsets. Similarly, for
Indian languages to English, we achieved average BLEU scores of 14.03, 16.65,
16.17, 15.35 and 12.55 along with chrF scores of 36.71, 40.44, 40.26, 39.51,
and 36.20, respectively, using fine-tuned LLaMA-13b on IN22 (conversational),
IN22 (general), flores200-dev, flores200-devtest, and newstest2019 testsets.
Overall, our findings highlight the potential and strength of large language
models for machine translation capabilities, including for languages that are
currently underrepresented in LLMs
The crosslinguistic acquisition of sentence structure: Computational modeling and grammaticality judgments from adult and child speakers of English, Japanese, Hindi, Hebrew and K'iche'
This preregistered study tested three theoretical proposals for how children form productive yet restricted linguistic generalizations, avoiding errors such as *The clown laughed the man, across three age groups (5–6 years, 9–10 years, adults) and five languages (English, Japanese, Hindi, Hebrew and K'iche'). Participants rated, on a five-point scale, correct and ungrammatical sentences describing events of causation (e.g., *Someone laughed the man; Someone made the man laugh; Someone broke the truck; ?Someone made the truck break). The verb-semantics hypothesis predicts that, for all languages, by-verb differences in acceptability ratings will be predicted by the extent to which the causing and caused event (e.g., amusing and laughing) merge conceptually into a single event (as rated by separate groups of adult participants). The entrenchment and preemption hypotheses predict, for all languages, that by-verb differences in acceptability ratings will be predicted by, respectively, the verb's relative overall frequency, and frequency in nearly-synonymous constructions (e.g., X made Y laugh for *Someone laughed the man). Analysis using mixed effects models revealed that entrenchment/preemption effects (which could not be distinguished due to collinearity) were observed for all age groups and all languages except K'iche', which suffered from a thin corpus and showed only preemption sporadically. All languages showed effects of event-merge semantics, except K'iche' which showed only effects of supplementary semantic predictors. We end by presenting a computational model which successfully simulates this pattern of results in a single discriminative-learning mechanism, achieving by-verb correlations of around r = 0.75 with human judgment data.Additional co-authors: Rukmini Bhaya Nair, Seth Campbell, Clifton Pye, Pedro Mateo Pedro, Sindy Fabiola Can Pixabaj, Mario MarroquÃn PelÃz, Margarita Julajuj Mendoz
A Preliminary Work on Causative Verbs in Hindi
Abstract This paper introduces a preliminary work on Hindi causative verbs: their classification, a linguistic model for their classification and their verb frames. The main objective of this work is to come up with a classification of the Hindi causative verbs. In the classification we show how different types of Hindi verbs have different types of causative forms. It will be a linguistic resource for Hindi causative verbs which can be used in various NLP applications. This resource enriches the already available linguistic resource on Hindi verb frames (Begum et al., 2008b). This resource will be helpful in getting proper insight into Hindi verbs. In this paper, we present the morphology, semantics and syntax of the causative verbs. The morphology is captured by the word generation process; semantics is captured by the linguistic model followed for classifying the verbs and the syntax has been captured by the verb frames using relations given by Panini
CREATING LANGUAGE RESOURCES FOR NLP IN INDIAN LANGUAGES 1. BACKGROUND
Non-availability of lexical resources in the electronic form is a major bottleneck for anyone working in the field of NLP on Indian languages. Some measures were taken to alleviate this bottleneck in a quick and efficient way. It was felt that if the development of these resources is linked with an example application then it can act as a test bed for the developing resources and provide constant feedback. Moreover, immediate results in terms of a performing system also enthuses the developers for such time consuming jobs. It was decided to take up the building of a machine translation system as an example application, which would also serve as a vehicle for building lexical resources. 2. DEVELOPING LEXICAL RESOURCES The following lexical resources were built or are being built as part of a planned effort: a) Electronic dictionary (Shabdanjali English- Hindi dictionary) b) Transfer lexicon and grammar (TransLexGram) c) Part-of-Speech tagged corpora. These are described below. 2.1 SHABDANJALI ELECTRONIC DICTIONARY: As a first step in this direction a collaborative effort was undertaken to develop a bilingual electronic dictionary in the free software model. The interesting aspect of this effort was that the work was carried out by school children, teachers and others. People in about 8 cities were involved in the exercise. The school teachers participated, to some extent, in correcting and refining the work. The development of the dictionary resource took advantage of the bilingual ability of the contributors. The contributors provided the basic data: a) A number of Hindi equivalents required to cover various senses of the English lexical item in various contexts. b) An English example sentence for every Hindi equivalent. (The developed resource is now available as an "open resource " under General Public License