47 research outputs found

    Constraint Based Hybrid Approach to Parsing Indian Languages

    Get PDF
    PACLIC 23 / City University of Hong Kong / 3-5 December 200

    Assessing Translation capabilities of Large Language Models involving English and Indian Languages

    Full text link
    Generative Large Language Models (LLMs) have achieved remarkable advancements in various NLP tasks. In this work, our aim is to explore the multilingual capabilities of large language models by using machine translation as a task involving English and 22 Indian languages. We first investigate the translation capabilities of raw large language models, followed by exploring the in-context learning capabilities of the same raw models. We fine-tune these large language models using parameter efficient fine-tuning methods such as LoRA and additionally with full fine-tuning. Through our study, we have identified the best performing large language model for the translation task involving LLMs, which is based on LLaMA. Our results demonstrate significant progress, with average BLEU scores of 13.42, 15.93, 12.13, 12.30, and 12.07, as well as CHRF scores of 43.98, 46.99, 42.55, 42.42, and 45.39, respectively, using 2-stage fine-tuned LLaMA-13b for English to Indian languages on IN22 (conversational), IN22 (general), flores200-dev, flores200-devtest, and newstest2019 testsets. Similarly, for Indian languages to English, we achieved average BLEU scores of 14.03, 16.65, 16.17, 15.35 and 12.55 along with chrF scores of 36.71, 40.44, 40.26, 39.51, and 36.20, respectively, using fine-tuned LLaMA-13b on IN22 (conversational), IN22 (general), flores200-dev, flores200-devtest, and newstest2019 testsets. Overall, our findings highlight the potential and strength of large language models for machine translation capabilities, including for languages that are currently underrepresented in LLMs

    The crosslinguistic acquisition of sentence structure: Computational modeling and grammaticality judgments from adult and child speakers of English, Japanese, Hindi, Hebrew and K'iche'

    Get PDF
    This preregistered study tested three theoretical proposals for how children form productive yet restricted linguistic generalizations, avoiding errors such as *The clown laughed the man, across three age groups (5–6 years, 9–10 years, adults) and five languages (English, Japanese, Hindi, Hebrew and K'iche'). Participants rated, on a five-point scale, correct and ungrammatical sentences describing events of causation (e.g., *Someone laughed the man; Someone made the man laugh; Someone broke the truck; ?Someone made the truck break). The verb-semantics hypothesis predicts that, for all languages, by-verb differences in acceptability ratings will be predicted by the extent to which the causing and caused event (e.g., amusing and laughing) merge conceptually into a single event (as rated by separate groups of adult participants). The entrenchment and preemption hypotheses predict, for all languages, that by-verb differences in acceptability ratings will be predicted by, respectively, the verb's relative overall frequency, and frequency in nearly-synonymous constructions (e.g., X made Y laugh for *Someone laughed the man). Analysis using mixed effects models revealed that entrenchment/preemption effects (which could not be distinguished due to collinearity) were observed for all age groups and all languages except K'iche', which suffered from a thin corpus and showed only preemption sporadically. All languages showed effects of event-merge semantics, except K'iche' which showed only effects of supplementary semantic predictors. We end by presenting a computational model which successfully simulates this pattern of results in a single discriminative-learning mechanism, achieving by-verb correlations of around r = 0.75 with human judgment data.Additional co-authors: Rukmini Bhaya Nair, Seth Campbell, Clifton Pye, Pedro Mateo Pedro, Sindy Fabiola Can Pixabaj, Mario Marroquín Pelíz, Margarita Julajuj Mendoz

    A Preliminary Work on Causative Verbs in Hindi

    No full text
    Abstract This paper introduces a preliminary work on Hindi causative verbs: their classification, a linguistic model for their classification and their verb frames. The main objective of this work is to come up with a classification of the Hindi causative verbs. In the classification we show how different types of Hindi verbs have different types of causative forms. It will be a linguistic resource for Hindi causative verbs which can be used in various NLP applications. This resource enriches the already available linguistic resource on Hindi verb frames (Begum et al., 2008b). This resource will be helpful in getting proper insight into Hindi verbs. In this paper, we present the morphology, semantics and syntax of the causative verbs. The morphology is captured by the word generation process; semantics is captured by the linguistic model followed for classifying the verbs and the syntax has been captured by the verb frames using relations given by Panini

    CREATING LANGUAGE RESOURCES FOR NLP IN INDIAN LANGUAGES 1. BACKGROUND

    No full text
    Non-availability of lexical resources in the electronic form is a major bottleneck for anyone working in the field of NLP on Indian languages. Some measures were taken to alleviate this bottleneck in a quick and efficient way. It was felt that if the development of these resources is linked with an example application then it can act as a test bed for the developing resources and provide constant feedback. Moreover, immediate results in terms of a performing system also enthuses the developers for such time consuming jobs. It was decided to take up the building of a machine translation system as an example application, which would also serve as a vehicle for building lexical resources. 2. DEVELOPING LEXICAL RESOURCES The following lexical resources were built or are being built as part of a planned effort: a) Electronic dictionary (Shabdanjali English- Hindi dictionary) b) Transfer lexicon and grammar (TransLexGram) c) Part-of-Speech tagged corpora. These are described below. 2.1 SHABDANJALI ELECTRONIC DICTIONARY: As a first step in this direction a collaborative effort was undertaken to develop a bilingual electronic dictionary in the free software model. The interesting aspect of this effort was that the work was carried out by school children, teachers and others. People in about 8 cities were involved in the exercise. The school teachers participated, to some extent, in correcting and refining the work. The development of the dictionary resource took advantage of the bilingual ability of the contributors. The contributors provided the basic data: a) A number of Hindi equivalents required to cover various senses of the English lexical item in various contexts. b) An English example sentence for every Hindi equivalent. (The developed resource is now available as an "open resource " under General Public License
    corecore