542 research outputs found

    A Review on Machine Translation in Indian Languages

    Get PDF
    Machine translation (MT) is considered an important task that can be used to attain information from documents written in different languages. In the current paper, we discuss different approaches of MT, the problems faced in MT in Indian languages, the limitations of some of the current existing MT systems, and present a review of the work that has been done until now in MT in Indian language perspective

    HiPHET: A Hybrid Approach to Translate Code Mixed Language (Hinglish) to Pure Languages (Hindi and English)

    Get PDF
    Bilingual code mixed (hybrid) languages has become very popular in India as a result of the spread of Western technology in the form of the television, the Internet and social media. Due to this increase in usage of code-mixed languages in day-to-day communication, the need for maintaining the integrity of Indian languages has arisen. As a result of this need the tool named Hinglish to Pure Hindi and English Translator was developed. The tool translated in three ways, namely, Hinglish to Pure Hindi and Pure English, Pure Hindi to Pure English and vice versa. The tool has achieved accuracy of 91% in giving Hindi sentences as output and of 84% in giving English sentences as output, where the input sentences were in Hinglish. The tool has also been compared with another similar tool in the paper

    Automatic Translation of Hate Speech to Non-hate Speech in Social Media Texts

    Full text link
    In this paper, we investigate the issue of hate speech by presenting a novel task of translating hate speech into non-hate speech text while preserving its meaning. As a case study, we use Spanish texts. We provide a dataset and several baselines as a starting point for further research in the task. We evaluated our baseline results using multiple metrics, including BLEU scores. The aim of this study is to contribute to the development of more effective methods for reducing the spread of hate speech in online communities

    Readability assessment and automatic text simplification, the analysis of basque complex structures

    Get PDF
    301 p.(eus); 217 (eng)Tesi-lan honetan, euskarazko testuen konplexutasuna eta sinplifikazioa automatikoki aztertzeko lehen urratsak egin ditugu. Testuen konplexutasuna aztertzeko, testuen sinplifikazio automatikoa helburu duten beste hizkuntzetako lanetan eta euskarazko corpusetan egindako azterketa linguistikoan oinarritu gara. Azterketa horietatik testuak automatikoki sinplifikatzeko oinarri linguistikoak ezarri ditugu. Konplexutasuna automatikoki analizatzeko, ezaugarri linguistikoetan eta ikasketa automatikoko tekniketan oinarrituta ErreXail sistema sortu eta inplementatu dugu.Horretaz gain, testuak automatikoki sinplifikatuko dituen Euskarazko Testuen Sinplifikatzailea (EuTS) sistemaren arkitektura diseinatu dugu, sistemaren modulu bakoitzean egingo diren eragiketak definituz eta, kasu-azterketa bezala,informazio biografikoa duten egitura parentetikoak sinplifikatuko dituen Biografix tresna eleaniztuna inplementatuz.Amaitzeko, Euskarazko Testu Sinplifikatuen Corpusa (ETSC) corpusa osatu dugu. Corpus hau baliatu dugu gure sinplifikaziorako azterketetatik ateratako hurbilpena beste batzuekin erkatzeko. Konparazio horiek egiteko, etiketatze-eskema bat ere definitu dugu

    Enhancing the Performance of NMT Models Using the Data-Based Domain Adaptation Technique for Patent Translation

    Get PDF
    During today’s age of unparalleled connectivity, language and data have become powerful tools capable of enabling effective communication and cross-cultural collaborations. Neural machine translation (NMT) models are especially capable of leveraging linguistic knowledge and parallel corpora to increase global connectivity and act as a tool for the transmission of knowledge. In this thesis, we apply a data-based domain adaptation technique to fine-tune three pre-existing NMT transformer models with attention mechanisms for the task of patent translation from English to Japanese. Languages, especially in the context of patents, can be very nuanced. A clear understanding of the intended meaning requires comprehensive domain knowledge and expert linguistic abilities which may become expensive and time-consuming. Automating the process of translation is helpful, however, commercially available NMT models perform poorly for this task as they are not trained on highly technical words whose meaning may be dependent on the relevant domain in which they are used. Our aim is to enhance the performance of translation models on highly technical inputs using a range of essential steps, focusing on data-based domain adaptation. These steps collectively contribute to the enhancement of the NMT model\u27s performance by a 41.22\% increase in the baseline BLEU score

    A Survey of GPT-3 Family Large Language Models Including ChatGPT and GPT-4

    Full text link
    Large language models (LLMs) are a special class of pretrained language models obtained by scaling model size, pretraining corpus and computation. LLMs, because of their large size and pretraining on large volumes of text data, exhibit special abilities which allow them to achieve remarkable performances without any task-specific training in many of the natural language processing tasks. The era of LLMs started with OpenAI GPT-3 model, and the popularity of LLMs is increasing exponentially after the introduction of models like ChatGPT and GPT4. We refer to GPT-3 and its successor OpenAI models, including ChatGPT and GPT4, as GPT-3 family large language models (GLLMs). With the ever-rising popularity of GLLMs, especially in the research community, there is a strong need for a comprehensive survey which summarizes the recent research progress in multiple dimensions and can guide the research community with insightful future research directions. We start the survey paper with foundation concepts like transformers, transfer learning, self-supervised learning, pretrained language models and large language models. We then present a brief overview of GLLMs and discuss the performances of GLLMs in various downstream tasks, specific domains and multiple languages. We also discuss the data labelling and data augmentation abilities of GLLMs, the robustness of GLLMs, the effectiveness of GLLMs as evaluators, and finally, conclude with multiple insightful future research directions. To summarize, this comprehensive survey paper will serve as a good resource for both academic and industry people to stay updated with the latest research related to GPT-3 family large language models.Comment: Preprint under review, 58 page
    • …
    corecore