Search CORE

3 research outputs found

Assessing Translation capabilities of Large Language Models involving English and Indian Languages

Author: Bhaskar Yash
Krishnamurthy Parameswari
Mujadia Vandan
Pavani Penumalla Aditya
Sharma Dipti Misra
Shravya Kukkapalli
Urlana Ashok
Publication venue
Publication date: 15/11/2023
Field of study

Generative Large Language Models (LLMs) have achieved remarkable advancements in various NLP tasks. In this work, our aim is to explore the multilingual capabilities of large language models by using machine translation as a task involving English and 22 Indian languages. We first investigate the translation capabilities of raw large language models, followed by exploring the in-context learning capabilities of the same raw models. We fine-tune these large language models using parameter efficient fine-tuning methods such as LoRA and additionally with full fine-tuning. Through our study, we have identified the best performing large language model for the translation task involving LLMs, which is based on LLaMA. Our results demonstrate significant progress, with average BLEU scores of 13.42, 15.93, 12.13, 12.30, and 12.07, as well as CHRF scores of 43.98, 46.99, 42.55, 42.42, and 45.39, respectively, using 2-stage fine-tuned LLaMA-13b for English to Indian languages on IN22 (conversational), IN22 (general), flores200-dev, flores200-devtest, and newstest2019 testsets. Similarly, for Indian languages to English, we achieved average BLEU scores of 14.03, 16.65, 16.17, 15.35 and 12.55 along with chrF scores of 36.71, 40.44, 40.26, 39.51, and 36.20, respectively, using fine-tuned LLaMA-13b on IN22 (conversational), IN22 (general), flores200-dev, flores200-devtest, and newstest2019 testsets. Overall, our findings highlight the potential and strength of large language models for machine translation capabilities, including for languages that are currently underrepresented in LLMs

arXiv.org e-Print Archive

Technology Pipeline for Large Scale Cross-Lingual Dubbing of Lecture Videos into Multiple Indian Languages

Cross-lingual dubbing of lecture videos requires the transcription of the original audio, correction and removal of disfluencies, domain term discovery, text-to-text translation into the target language, chunking of text using target language rhythm, text-to-speech synthesis followed by isochronous lipsyncing to the original video. This task becomes challenging when the source and target languages belong to different language families, resulting in differences in generated audio duration. This is further compounded by the original speaker's rhythm, especially for extempore speech. This paper describes the challenges in regenerating English lecture videos in Indian languages semi-automatically. A prototype is developed for dubbing lectures into 9 Indian languages. A mean-opinion-score (MOS) is obtained for two languages, Hindi and Tamil, on two different courses. The output video is compared with the original video in terms of MOS (1-5) and lip synchronisation with scores of 4.09 and 3.74, respectively. The human effort also reduces by 75%

arXiv.org e-Print Archive

GermEval 2018 : Machine Learning and Neural Network Approaches for Offensive Language Identification

Author: Lanka Soujanya
Mishra Pruthwik
Mujadia Vandan
Publication venue: oeaw
Publication date: 02/10/2018
Field of study

Social media has been an effective carrier of information from the day of its inception. People worldwide are able to interact and communicate freely without much of a hassle due to the wide reach of the social media. Though the advantages of this mode of communication are many, the severe drawbacks can not be ignored. One such instance is the rampant use of offensive language in the form of hurtful, derogatory or obscene comments. There is a greater need to employ checks on social media websites to curb the menace of the offensive languages. GermEval Task 2018 1 is an initiative in this direction to automatically identify offensive language in German Twitter posts. In this paper, we describe our approaches for different subtasks in the GermEval Task 2018. Two different kinds of approaches - machine learning and neural network approaches were explored for these subtasks. We observed that character n-grams in Support Vector Machine (SVM) approaches outperformed their neural network counterparts most of the times. The machine learning approaches used TF-IDF features for character n-grams and the neural networks made use of the word embeddings. We submitted the outputs of three runs, all using SVM - one run for Task 1 and two for Task 2

Elektronisches Publikationsportal der Ãsterreichischen Akademie der Wissenschaften

Elektronisches Publikationsportal der Österreichischen Akademie der Wissenschaften