54 research outputs found

    Fine-Tuning BERT Models for Intent Recognition Using a Frequency Cut-Off Strategy for Domain-Specific Vocabulary Extension

    Get PDF
    The work leading to these results was supported by the Spanish Ministry of Science and Innovation through the R& D&i projects GOMINOLA (PID2020-118112RB-C21 and PID2020118112RB-C22, funded by MCIN/AEI/10.13039/501100011033), CAVIAR (TEC2017-84593-C2-1-R, funded by MCIN/ AEI/10.13039/501100011033/FEDER "Una manera de hacer Europa"), and AMICPoC (PDC2021-120846-C42, funded by MCIN/AEI/10.13039/501100011033 and by "the European Union "NextGenerationEU/PRTR"). This research also received funding from the European Union's Horizon2020 research and innovation program under grant agreement No 823907 (http://menhirproject.eu, accessed on 2 February 2022). Furthermore, R.K.'s research was supported by the Spanish Ministry of Education (FPI grant PRE2018-083225).Intent recognition is a key component of any task-oriented conversational system. The intent recognizer can be used first to classify the user’s utterance into one of several predefined classes (intents) that help to understand the user’s current goal. Then, the most adequate response can be provided accordingly. Intent recognizers also often appear as a form of joint models for performing the natural language understanding and dialog management tasks together as a single process, thus simplifying the set of problems that a conversational system must solve. This happens to be especially true for frequently asked question (FAQ) conversational systems. In this work, we first present an exploratory analysis in which different deep learning (DL) models for intent detection and classification were evaluated. In particular, we experimentally compare and analyze conventional recurrent neural networks (RNN) and state-of-the-art transformer models. Our experiments confirmed that best performance is achieved by using transformers. Specifically, best performance was achieved by fine-tuning the so-called BETO model (a Spanish pretrained bidirectional encoder representations from transformers (BERT) model from the Universidad de Chile) in our intent detection task. Then, as the main contribution of the paper, we analyze the effect of inserting unseen domain words to extend the vocabulary of the model as part of the fine-tuning or domain-adaptation process. Particularly, a very simple word frequency cut-off strategy is experimentally shown to be a suitable method for driving the vocabulary learning decisions over unseen words. The results of our analysis show that the proposed method helps to effectively extend the original vocabulary of the pretrained models. We validated our approach with a selection of the corpus acquired with the Hispabot-Covid19 system obtaining satisfactory results.Spanish Ministry of Science and Innovation (MCIN/AEI) PID2020-118112RB-C21 PID2020118112RB-C22 PDC2021-120846-C42Spanish Ministry of Science and Innovation (MCIN/AEI/FEDER "Una manera de hacer Europa") TEC2017-84593-C2-1-RSpanish Ministry of Science and Innovation (European Union "NextGenerationEU/PRTR") PDC2021-120846-C42European Commission 823907German Research Foundation (DFG) PRE2018-08322

    Coping with Data Scarcity: First Steps towards Word Expansion for a Chatbot in the Urban transportation Domain

    Get PDF
    Hizkuntzaren Prozesamenduan (HP) zenbait arlotan hitzak erabili izan dira tradizionalki zabaltze-tekniken garapenean, hala nola Informazioaren Berreskurapenean (IB) edota Galdera-Erantzun (GE) sistemetan. Master tesi honek bi hurbilpen aurkezten ditu Elkarrizketa-Sistemen (ES) arloan zabaltze-teknikak garatze aldera, zehazkiago Donostiako (Gipuzkoa) hiri-garraiorako chatbot baten ulertze-modulua garatzera zuzendurik. Lehenengo hurbilpenak hitz-bektoreak erabiltzen ditu semantikoki antzekoak diren terminoak erauzteko, kasu honetan FastText-eko aurre-entreinaturiko embedding sorta espainieraz eta bigarren hurbiltzeak hitzen adiera-desanbiguazioa erabiltzen du sinonimoak datu-base lexiko baten bidez erauzteko, kasu honetan espainierazko WordNet-a. Horretarako, ataza kolaboratibo bat diseinatu da, non corpusa osatuko baitugu balizko-egoera erreal baten sarrerak jasoz. Bestalde, domeinuz kanpo dauden sarrerak identi katze aldera, bi esperimentu sorta garatu dira. Lehenengo fasean kali katze sistema bat garatu da, non corpuseko terminoak Term Frequency-Inverse Document Frequency (TF-IDF) erabiliz ordenatzen baitiren eta ondoren kali katze-sistema kosinu-antzekotasunaren bidez osatzen da. Bigarren faseak aurreko kali katze-sistema formalizatuko da, hiru datu-multzo prestatuz eta estrati katuz. Datu-multzo hauek erregresore lineal bat eta Kernel linealarekin euskarri bektoredun makina bat entreinatzeko erabili dira. Emaitzen arabera, aurre-entreinaturiko bektoreek leialtasun handiagoa daukate input errealari dagokionez. Hala ere, datu-base lexikoek estaldura linguistiko zabalagoa gehituko diote zabalduriko corpus hipotetikoari. Azkenik, domeinuaren diskriminazioari dagokionez, emaitzek TF-IDF-tik erauzitako termino gehienen zeukan datu-multzoa hobesten dute.Text expansion techniques have been used in some sub elds of Natural Language Processing (NLP) such as Information Retrieval or Question-Answering Systems. This Master's Thesis presents two approaches for expansion within the context of Dialogue Systems (DS), more precisely for the Natural Language Understanding (NLU) module of a chatbot for the urban transportation domain in San Sebastian (Gipuzkoa). The rst approach uses word vectors to obtain semantically similar terms while the second one involves synonym extraction from a lexical database. For this purpose, a corpus composed of real case scenario inputs has been exploited. Furthermore, the qualitative analysis of the implemented expansion techniques revealed a need to lter out-of-domain inputs. In relation to this problem, two di erent sets of experiments have been carried out. First, the feasibility of using Term Frequency-Inverse Document Frequency (TF-IDF) and cosine similarity as discrimination features was explored. Then, linear regression and Support Vector Machine (SVM) classi ers were trained and tested. Results show that pre-trained word embedding expansion constitutes a more loyal representation of real case scenario inputs, whereas lexical database expansion adds a wider linguistic coverage to a hypothetically expanded version of the corpus. For out-of-domain detection, increasing the number of features improves both, linear regression and SVM classi cation results

    Coping with Data Scarcity: First Steps towards Word Expansion for a Chatbot in the Urban transportation Domain

    Get PDF
    Hizkuntzaren Prozesamenduan (HP) zenbait arlotan hitzak erabili izan dira tradizionalki zabaltze-tekniken garapenean, hala nola Informazioaren Berreskurapenean (IB) edota Galdera-Erantzun (GE) sistemetan. Master tesi honek bi hurbilpen aurkezten ditu Elkarrizketa-Sistemen (ES) arloan zabaltze-teknikak garatze aldera, zehazkiago Donostiako (Gipuzkoa) hiri-garraiorako chatbot baten ulertze-modulua garatzera zuzendurik. Lehenengo hurbilpenak hitz-bektoreak erabiltzen ditu semantikoki antzekoak diren terminoak erauzteko, kasu honetan FastText-eko aurre-entreinaturiko embedding sorta espainieraz eta bigarren hurbiltzeak hitzen adiera-desanbiguazioa erabiltzen du sinonimoak datu-base lexiko baten bidez erauzteko, kasu honetan espainierazko WordNet-a. Horretarako, ataza kolaboratibo bat diseinatu da, non corpusa osatuko baitugu balizko-egoera erreal baten sarrerak jasoz. Bestalde, domeinuz kanpo dauden sarrerak identi katze aldera, bi esperimentu sorta garatu dira. Lehenengo fasean kali katze sistema bat garatu da, non corpuseko terminoak Term Frequency-Inverse Document Frequency (TF-IDF) erabiliz ordenatzen baitiren eta ondoren kali katze-sistema kosinu-antzekotasunaren bidez osatzen da. Bigarren faseak aurreko kali katze-sistema formalizatuko da, hiru datu-multzo prestatuz eta estrati katuz. Datu-multzo hauek erregresore lineal bat eta Kernel linealarekin euskarri bektoredun makina bat entreinatzeko erabili dira. Emaitzen arabera, aurre-entreinaturiko bektoreek leialtasun handiagoa daukate input errealari dagokionez. Hala ere, datu-base lexikoek estaldura linguistiko zabalagoa gehituko diote zabalduriko corpus hipotetikoari. Azkenik, domeinuaren diskriminazioari dagokionez, emaitzek TF-IDF-tik erauzitako termino gehienen zeukan datu-multzoa hobesten dute.Text expansion techniques have been used in some sub elds of Natural Language Processing (NLP) such as Information Retrieval or Question-Answering Systems. This Master's Thesis presents two approaches for expansion within the context of Dialogue Systems (DS), more precisely for the Natural Language Understanding (NLU) module of a chatbot for the urban transportation domain in San Sebastian (Gipuzkoa). The rst approach uses word vectors to obtain semantically similar terms while the second one involves synonym extraction from a lexical database. For this purpose, a corpus composed of real case scenario inputs has been exploited. Furthermore, the qualitative analysis of the implemented expansion techniques revealed a need to lter out-of-domain inputs. In relation to this problem, two di erent sets of experiments have been carried out. First, the feasibility of using Term Frequency-Inverse Document Frequency (TF-IDF) and cosine similarity as discrimination features was explored. Then, linear regression and Support Vector Machine (SVM) classi ers were trained and tested. Results show that pre-trained word embedding expansion constitutes a more loyal representation of real case scenario inputs, whereas lexical database expansion adds a wider linguistic coverage to a hypothetically expanded version of the corpus. For out-of-domain detection, increasing the number of features improves both, linear regression and SVM classi cation results

    Otrouha: A Corpus of Arabic ETDs and a Framework for Automatic Subject Classification

    Get PDF
    Although the Arabic language is spoken by more than 300 million people and is one of the six official languages of the United Nations (UN), there has been less research done on Arabic text data (compared to English) in the realm of machine learning, especially in text classification. In the past decade, Arabic data such as news, tweets, etc. have begun to receive some attention. Although automatic text classification plays an important role in improving the browsability and accessibility of data, Electronic Theses and Dissertations (ETDs) have not received their fair share of attention, in spite of the huge number of benefits they provide to students, universities, and future generations of scholars. There are two main roadblocks to performing automatic subject classification on Arabic ETDs. The first is the unavailability of a public corpus of Arabic ETDs. The second is the linguistic complexity of the Arabic language; that complexity is particularly evident in academic documents such as ETDs. To address these roadblocks, this paper presents Otrouha, a framework for automatic subject classification of Arabic ETDs, which has two main goals. The first is building a Corpus of Arabic ETDs and their key metadata such as abstracts, keywords, and title to pave the way for more exploratory research on this valuable genre of data. The second is to provide a framework for automatic subject classification of Arabic ETDs through different classification models that use classical machine learning as well as deep learning techniques. The first goal is aided by searching the AskZad Digital Library, which is part of the Saudi Digital Library (SDL). AskZad provides other key metadata of Arabic ETDs, such as abstract, title, and keywords. The current search results consist of abstracts of Arabic ETDs. This raw data then undergoes a pre-processing phase that includes stop word removal using the Natural Language Tool Kit (NLTK), and word lemmatization using the Farasa API. To date, abstracts of 518 ETDs across 12 subjects have been collected. For the second goal, the preliminary results show that among the machine learning models, binary classification (one-vs.-all) performed better than multiclass classification. The maximum per subject accuracy is 95%, with an average accuracy of 68% across all subjects. It is noteworthy that the binary classification model performed better for some categories than others. For example, Applied Science and Technology shows 95% accuracy, while the category of Administration shows 36%. Deep learning models resulted in higher accuracy but lower F-measure; their overall performance is lower than machine learning models. This may be due to the small size of the dataset as well as the imbalance in the number of documents per category. Work to collect additional ETDs will be aided by collaborative contributions of data from additional sources

    Chatbot de Suporte para Plataforma de Marketing Multicanal

    Get PDF
    E-goi is an organization which provides automated multichannel marketing possibilities. Given its system’s complexity, it requires a not so smooth learning curve, which means that sometimes costumers incur upon some difficulties which directs them towards appropriate Costumer Support resources. With an increase in the number of users, these Costumer Support requests are somewhat frequent and demand an increase in availability in Costumer Support channels which become inundated with simple, easily-resolvable requests. The organization idealized the possibility of automating significant portion of costumer generated tickets with the possibility of scaling to deal with other types of operations. This thesis aims to present a long-term solution to that request with the development of a chatbot system, fully integrated with the existing enterprise modules and data sources. In order to accomplish this, prototypes using several Chatbot management and Natural Language Processing frameworks were developed. Afterwards, their advantages and disadvantages were pondered, followed by the implementation of its accompanying system and testing of developed software and Natural Language Processing results. Although the developed overarching system achieved its designed functionalities, the master’s thesis could not offer a viable solution for the problem at hand given that the available data could not provide an intent mining model usable in a real-world context.A E-goi é uma organização que disponibiliza soluções de marketing digital automatizadas e multicanal. Dada a complexidade do seu Sistema, que requer uma curva de aprendizagem não muito suave, o que significa que os seus utilizadores por vezes têm dificuldades que os levam a recorrer aos canais de Apoio ao Cliente. Com um aumento de utilizadores, estes pedidos de Apoio ao Cliente tornam-se frequentes e requerem um aumento da disponibilidade nos canais apropriados que ficam inundados de pedidos simples e de fácil resolução. A organização idealizou a possibilidade de automatizar uma porção significativa de tais pedidos, podendo escalar para outro tipo de operações. Este trabalho de mestrado visa apresentar uma proposta de solução a longo prazo para este problema. Pretende-se o desenvolvimento de um sistema de chatbots, completamente integrado com o sistema existente da empresa e variadas fontes de dados. Para este efeito, foram desenvolvidos protótipos de várias frameworks para gestão de chatbots e de Natural Language Processing, ponderadas as suas vantagens e desvantagens, implementado o sistema englobante e realizados planos de testes ao software desenvolvido e aos resultados de Natural Language Processing. Apesar do sistema desenvolvido ter cumprido as funcionalidades pelas quais foi concebido, a tese de mestrado não foi capaz de obter uma solução viável para o problema dado que com os dados disponibilizados não foi possível produzir um modelo de deteção de intenções usável num contexto real

    Survey on Publicly Available Sinhala Natural Language Processing Tools and Research

    Full text link
    Sinhala is the native language of the Sinhalese people who make up the largest ethnic group of Sri Lanka. The language belongs to the globe-spanning language tree, Indo-European. However, due to poverty in both linguistic and economic capital, Sinhala, in the perspective of Natural Language Processing tools and research, remains a resource-poor language which has neither the economic drive its cousin English has nor the sheer push of the law of numbers a language such as Chinese has. A number of research groups from Sri Lanka have noticed this dearth and the resultant dire need for proper tools and research for Sinhala natural language processing. However, due to various reasons, these attempts seem to lack coordination and awareness of each other. The objective of this paper is to fill that gap of a comprehensive literature survey of the publicly available Sinhala natural language tools and research so that the researchers working in this field can better utilize contributions of their peers. As such, we shall be uploading this paper to arXiv and perpetually update it periodically to reflect the advances made in the field

    Proceedings of the Fifth Italian Conference on Computational Linguistics CLiC-it 2018 : 10-12 December 2018, Torino

    Get PDF
    On behalf of the Program Committee, a very warm welcome to the Fifth Italian Conference on Computational Linguistics (CLiC-­‐it 2018). This edition of the conference is held in Torino. The conference is locally organised by the University of Torino and hosted into its prestigious main lecture hall “Cavallerizza Reale”. The CLiC-­‐it conference series is an initiative of the Italian Association for Computational Linguistics (AILC) which, after five years of activity, has clearly established itself as the premier national forum for research and development in the fields of Computational Linguistics and Natural Language Processing, where leading researchers and practitioners from academia and industry meet to share their research results, experiences, and challenges

    Multilingual taxonomic text classification in an incremental number of languages

    Get PDF
    La classificazione tassonomica dei testi è una branca dell'elaborazione del linguaggio naturale (NLP) che mira a classificare i testi in uno schema organizzato gerarchicamente. Le sue applicazioni spaziano dalla ricerca pura a scopi industriali e commerciali. In questa tesi, in particolare, l'attenzione è focalizzata sul processo di sviluppo di un modello multilingue per la classificazione gerarchica del testo indipendentemente dal suo idioma. Viene effettuata un'analisi comparativa per valutare quali tecniche di embedding e quali metodi di selezione delle feature siano più performanti. Inoltre, poiché nelle applicazioni concrete può essere richiesto che un modello multilingue supporti nuove lingue nel corso del tempo, implementiamo e analizziamo una serie di tecniche, tra cui due algoritmi di apprendimento continuo, per estendere sequenzialmente la nostra rete a un numero incrementale di lingue. Gli esperimenti condotti mostrano sia la solidità che le criticità del modello attuale e pongono le basi per ulteriori ricerche.Taxonomic text classification is the Natural Language Processing (NLP) branch that aims at classifying text into a hierarchically organized schema. Its applications space from pure research to industrial and commercial purposes. In this thesis, in particular, the attention is focused on the process of developing a multilingual model for hierarchically classifying text independently from its idiom. Comparative analysis is performed to evaluate whose embedding techniques and feature selection methods perform better. Moreover, since in real-life scenarios a multilingual model may be required to support new languages over time, we implement and benchmark a set of techniques, among which are two Continual Learning algorithms, to sequentially extend our network to an incremental number of languages. The experiments carried out display both the solidity and criticalities of the current model and set the basis for further research
    corecore