Search CORE

335 research outputs found

GIRNet: Interleaved Multi-Task Recurrent State Sequence Models

Author: Chakrabarti Soumen
Chakraborty Tanmoy
Gupta Divam
Publication venue
Publication date: 25/12/2018
Field of study

In several natural language tasks, labeled sequences are available in separate domains (say, languages), but the goal is to label sequences with mixed domain (such as code-switched text). Or, we may have available models for labeling whole passages (say, with sentiments), which we would like to exploit toward better position-specific label inference (say, target-dependent sentiment annotation). A key characteristic shared across such tasks is that different positions in a primary instance can benefit from different `experts' trained from auxiliary data, but labeled primary instances are scarce, and labeling the best expert for each position entails unacceptable cognitive burden. We propose GITNet, a unified position-sensitive multi-task recurrent neural network (RNN) architecture for such applications. Auxiliary and primary tasks need not share training instances. Auxiliary RNNs are trained over auxiliary instances. A primary instance is also submitted to each auxiliary RNN, but their state sequences are gated and merged into a novel composite state sequence tailored to the primary inference task. Our approach is in sharp contrast to recent multi-task networks like the cross-stitch and sluice network, which do not control state transfer at such fine granularity. We demonstrate the superiority of GIRNet using three applications: sentiment classification of code-switched passages, part-of-speech tagging of code-switched text, and target position-sensitive annotation of sentiment in monolingual passages. In all cases, we establish new state-of-the-art performance beyond recent competitive baselines.Comment: Accepted at AAAI 201

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Bringing order into the realm of Transformer-based language models for artificial intelligence and law

Author: Greco Candida M.
Tagarelli Andrea
Publication venue
Publication date: 03/02/2024
Field of study

Transformer-based language models (TLMs) have widely been recognized to be a cutting-edge technology for the successful development of deep-learning-based solutions to problems and applications that require natural language processing and understanding. Like for other textual domains, TLMs have indeed pushed the state-of-the-art of AI approaches for many tasks of interest in the legal domain. Despite the first Transformer model being proposed about six years ago, there has been a rapid progress of this technology at an unprecedented rate, whereby BERT and related models represent a major reference, also in the legal domain. This article provides the first systematic overview of TLM-based methods for AI-driven problems and tasks in the legal sphere. A major goal is to highlight research advances in this field so as to understand, on the one hand, how the Transformers have contributed to the success of AI in supporting legal processes, and on the other hand, what are the current limitations and opportunities for further research development.Comment: Please refer to the published version: Greco, C.M., Tagarelli, A. (2023) Bringing order into the realm of Transformer-based language models for artificial intelligence and law. Artif Intell Law, Springer Nature. November 2023. https://doi.org/10.1007/s10506-023-09374-

arXiv.org e-Print Archive

Learning cross-lingual phonological and orthagraphic adaptations: a case study in improving neural machine translation between low-resource languages

Author: Jha Saurav
Singh Anil Kumar
Sudhakar Akhilesh
Publication venue
Publication date: 01/01/2019
Field of study

Out-of-vocabulary (OOV) words can pose serious challenges for machine translation (MT) tasks, and in particular, for low-resource language (LRL) pairs, i.e., language pairs for which few or no parallel corpora exist. Our work adapts variants of seq2seq models to perform transduction of such words from Hindi to Bhojpuri (an LRL instance), learning from a set of cognate pairs built from a bilingual dictionary of Hindi--Bhojpuri words. We demonstrate that our models can be effectively used for language pairs that have limited parallel corpora; our models work at the character level to grasp phonetic and orthographic similarities across multiple types of word adaptations, whether synchronic or diachronic, loan words or cognates. We describe the training aspects of several character level NMT systems that we adapted to this task and characterize their typical errors. Our method improves BLEU score by 6.3 on the Hindi-to-Bhojpuri translation task. Further, we show that such transductions can generalize well to other languages by applying it successfully to Hindi -- Bangla cognate pairs. Our work can be seen as an important step in the process of: (i) resolving the OOV words problem arising in MT tasks, (ii) creating effective parallel corpora for resource-constrained languages, and (iii) leveraging the enhanced semantic knowledge captured by word-level embeddings to perform character-level tasks.Comment: 47 pages, 4 figures, 21 tables (including Appendices

arXiv.org e-Print Archive

Biblioteka Nauki - repozytorium artykuÅÃ³w

Predicting the Type and Target of Offensive Social Media Posts in Marathi

Author: Chaudhari Mrinal
Gaikwad Saurabh
Krishna Prajwal
Nene Mayuresh
Paygude Shrunali
Ranasinghe Tharindu
Zampieri Marcos
Publication venue
Publication date: 09/07/2022
Field of study

The presence of offensive language on social media is very common motivating platforms to invest in strategies to make communities safer. This includes developing robust machine learning systems capable of recognizing offensive content online. Apart from a few notable exceptions, most research on automatic offensive language identification has dealt with English and a few other high resource languages such as French, German, and Spanish. In this paper we address this gap by tackling offensive language identification in Marathi, a low-resource Indo-Aryan language spoken in India. We introduce the Marathi Offensive Language Dataset v.2.0 or MOLD 2.0 and present multiple experiments on this dataset. MOLD 2.0 is a much larger version of MOLD with expanded annotation to the levels B (type) and C (target) of the popular OLID taxonomy. MOLD 2.0 is the first hierarchical offensive language dataset compiled for Marathi, thus opening new avenues for research in low-resource Indo-Aryan languages. Finally, we also introduce SeMOLD, a larger dataset annotated following the semi-supervised methods presented in SOLID.Comment: This is a preprint of an article published in the Journal of Intelligent Information Systems, Springer. The final authenticated version is available online at https://link.springer.com/article/10.1007/s13278-022-00906-

arXiv.org e-Print Archive

Aston Publications Explorer

Lancaster E-Prints

Breaking Language Barriers with a LEAP: Learning Strategies for Polyglot LLMs

Author: Ahuja Kabir
Bali Kalika
Balloli Vaibhav
Ganu Tanuja
Nambi Akshay
Ranjit Mercy
Sitaram Sunayana
Publication venue
Publication date: 28/05/2023
Field of study

Large language models (LLMs) are at the forefront of transforming numerous domains globally. However, their inclusivity and effectiveness remain limited for non-Latin scripts and low-resource languages. This paper tackles the imperative challenge of enhancing the multilingual performance of LLMs, specifically focusing on Generative models. Through systematic investigation and evaluation of diverse languages using popular question-answering (QA) datasets, we present novel techniques that unlock the true potential of LLMs in a polyglot landscape. Our approach encompasses three key strategies that yield remarkable improvements in multilingual proficiency. First, by meticulously optimizing prompts tailored for polyglot LLMs, we unlock their latent capabilities, resulting in substantial performance boosts across languages. Second, we introduce a new hybrid approach that synergizes GPT generation with multilingual embeddings and achieves significant multilingual performance improvement on critical tasks like QA and retrieval. Finally, to further propel the performance of polyglot LLMs, we introduce a novel learning algorithm that dynamically selects the optimal prompt strategy, LLM model, and embeddings per query. This dynamic adaptation maximizes the efficacy of LLMs across languages, outperforming best static and random strategies. Our results show substantial advancements in multilingual understanding and generation across a diverse range of languages

arXiv.org e-Print Archive

Cross-lingual Offensive Language Identification for Low Resource Languages: The Case of Marathi

Author: Homan Christopher
Ranasinghe Tharindu
Sampatrao Gaikwad Saurabh
Zampieri Marcos
Publication venue: INCOMA Ltd
Publication date: 01/09/2021
Field of study

Lancaster E-Prints