54 research outputs found
Fine-Tuning BERT Models for Intent Recognition Using a Frequency Cut-Off Strategy for Domain-Specific Vocabulary Extension
The work leading to these results was supported by the Spanish Ministry of Science and Innovation through the R& D&i projects GOMINOLA (PID2020-118112RB-C21 and PID2020118112RB-C22, funded by MCIN/AEI/10.13039/501100011033), CAVIAR (TEC2017-84593-C2-1-R, funded by MCIN/ AEI/10.13039/501100011033/FEDER "Una manera de hacer Europa"), and AMICPoC (PDC2021-120846-C42, funded by MCIN/AEI/10.13039/501100011033 and by "the European Union "NextGenerationEU/PRTR"). This research also received funding from the European Union's Horizon2020 research and innovation program under grant agreement No 823907 (http://menhirproject.eu, accessed on 2 February 2022). Furthermore, R.K.'s research was supported by the Spanish Ministry of Education (FPI grant PRE2018-083225).Intent recognition is a key component of any task-oriented conversational system. The
intent recognizer can be used first to classify the user’s utterance into one of several predefined classes
(intents) that help to understand the user’s current goal. Then, the most adequate response can be
provided accordingly. Intent recognizers also often appear as a form of joint models for performing
the natural language understanding and dialog management tasks together as a single process, thus
simplifying the set of problems that a conversational system must solve. This happens to be especially
true for frequently asked question (FAQ) conversational systems. In this work, we first present an
exploratory analysis in which different deep learning (DL) models for intent detection and classification
were evaluated. In particular, we experimentally compare and analyze conventional recurrent
neural networks (RNN) and state-of-the-art transformer models. Our experiments confirmed that
best performance is achieved by using transformers. Specifically, best performance was achieved by
fine-tuning the so-called BETO model (a Spanish pretrained bidirectional encoder representations
from transformers (BERT) model from the Universidad de Chile) in our intent detection task. Then, as
the main contribution of the paper, we analyze the effect of inserting unseen domain words to extend
the vocabulary of the model as part of the fine-tuning or domain-adaptation process. Particularly,
a very simple word frequency cut-off strategy is experimentally shown to be a suitable method for
driving the vocabulary learning decisions over unseen words. The results of our analysis show that
the proposed method helps to effectively extend the original vocabulary of the pretrained models.
We validated our approach with a selection of the corpus acquired with the Hispabot-Covid19 system
obtaining satisfactory results.Spanish Ministry of Science and Innovation (MCIN/AEI) PID2020-118112RB-C21
PID2020118112RB-C22
PDC2021-120846-C42Spanish Ministry of Science and Innovation (MCIN/AEI/FEDER "Una manera de hacer Europa") TEC2017-84593-C2-1-RSpanish Ministry of Science and Innovation (European Union "NextGenerationEU/PRTR") PDC2021-120846-C42European Commission 823907German Research Foundation (DFG) PRE2018-08322
Recommended from our members
Sentiment Analysis for the Low-Resourced Latinised Arabic "Arabizi"
The expansion of digital communication mediums from private mobile messaging into the public through social media presented an opportunity for the data science research and industry to mine the generated big data for artificial information extraction. A popular information extraction task is sentiment analysis, which aims at extracting polarity opinions, positive, negative, or neutral, from the written natural language. This science helped organisations better understand the public’s opinion towards events, news, public figures, and products.
However, sentiment analysis has advanced for the English language ahead of Arabic. While sentiment analysis for Arabic is developing in the literature of Natural Language Processing (NLP), a popular variety of Arabic, Arabizi, has been overlooked for sentiment analysis advancements.
Arabizi is an informal transcription of the spoken dialectal Arabic in Latin script used for social texting. It is known to be common among the Arab youth, yet it is overlooked in efforts on Arabic sentiment analysis for its linguistic complexities.
As to Arabic, Arabizi is rich in inflectional morphology, but also codeswitched with English or French, and distinctively transcribed without adhering to a standard orthography. The rich morphology, inconsistent orthography, and codeswitching challenges are compounded together to have a multiplied effect on the lexical sparsity of the language, where each Arabizi word becomes eligible to be spelled in many ways, that, in addition to the mixing of other languages within the same textual context. The resulting high degree of lexical sparsity defies the very basics of sentiment analysis, classification of positive and negative words. Arabizi is even faced with a severe shortage of data resources that are required to set out any sentiment analysis approach.
In this thesis, we tackle this gap by conducting research on sentiment analysis for Arabizi. We addressed the sparsity challenge by harvesting Arabizi data from multi-lingual social media text using deep learning to build Arabizi resources for sentiment analysis. We developed six new morphologically and orthographically rich Arabizi sentiment lexicons and set the baseline for Arabizi sentiment analysis on social media
Coping with Data Scarcity: First Steps towards Word Expansion for a Chatbot in the Urban transportation Domain
Hizkuntzaren Prozesamenduan (HP) zenbait arlotan hitzak erabili izan dira tradizionalki
zabaltze-tekniken garapenean, hala nola Informazioaren Berreskurapenean (IB) edota
Galdera-Erantzun (GE) sistemetan. Master tesi honek bi hurbilpen aurkezten ditu
Elkarrizketa-Sistemen (ES) arloan zabaltze-teknikak garatze aldera, zehazkiago
Donostiako (Gipuzkoa) hiri-garraiorako chatbot baten ulertze-modulua garatzera
zuzendurik. Lehenengo hurbilpenak hitz-bektoreak erabiltzen ditu semantikoki antzekoak
diren terminoak erauzteko, kasu honetan FastText-eko aurre-entreinaturiko embedding
sorta espainieraz eta bigarren hurbiltzeak hitzen adiera-desanbiguazioa erabiltzen du
sinonimoak datu-base lexiko baten bidez erauzteko, kasu honetan espainierazko
WordNet-a. Horretarako, ataza kolaboratibo bat diseinatu da, non corpusa osatuko
baitugu balizko-egoera erreal baten sarrerak jasoz. Bestalde, domeinuz kanpo dauden
sarrerak identi katze aldera, bi esperimentu sorta garatu dira. Lehenengo fasean
kali katze sistema bat garatu da, non corpuseko terminoak Term Frequency-Inverse
Document Frequency (TF-IDF) erabiliz ordenatzen baitiren eta ondoren
kali katze-sistema kosinu-antzekotasunaren bidez osatzen da. Bigarren faseak aurreko
kali katze-sistema formalizatuko da, hiru datu-multzo prestatuz eta estrati katuz.
Datu-multzo hauek erregresore lineal bat eta Kernel linealarekin euskarri bektoredun
makina bat entreinatzeko erabili dira. Emaitzen arabera, aurre-entreinaturiko bektoreek
leialtasun handiagoa daukate input errealari dagokionez. Hala ere, datu-base lexikoek
estaldura linguistiko zabalagoa gehituko diote zabalduriko corpus hipotetikoari. Azkenik,
domeinuaren diskriminazioari dagokionez, emaitzek TF-IDF-tik erauzitako termino
gehienen zeukan datu-multzoa hobesten dute.Text expansion techniques have been used in some sub elds of Natural Language
Processing (NLP) such as Information Retrieval or Question-Answering Systems. This
Master's Thesis presents two approaches for expansion within the context of Dialogue
Systems (DS), more precisely for the Natural Language Understanding (NLU) module of
a chatbot for the urban transportation domain in San Sebastian (Gipuzkoa). The rst
approach uses word vectors to obtain semantically similar terms while the second one
involves synonym extraction from a lexical database. For this purpose, a corpus composed
of real case scenario inputs has been exploited. Furthermore, the qualitative analysis of
the implemented expansion techniques revealed a need to lter out-of-domain inputs. In
relation to this problem, two di erent sets of experiments have been carried out. First,
the feasibility of using Term Frequency-Inverse Document Frequency (TF-IDF) and
cosine similarity as discrimination features was explored. Then, linear regression and
Support Vector Machine (SVM) classi ers were trained and tested. Results show that
pre-trained word embedding expansion constitutes a more loyal representation of real case
scenario inputs, whereas lexical database expansion adds a wider linguistic coverage to a
hypothetically expanded version of the corpus. For out-of-domain detection, increasing
the number of features improves both, linear regression and SVM classi cation results
Coping with Data Scarcity: First Steps towards Word Expansion for a Chatbot in the Urban transportation Domain
Hizkuntzaren Prozesamenduan (HP) zenbait arlotan hitzak erabili izan dira tradizionalki
zabaltze-tekniken garapenean, hala nola Informazioaren Berreskurapenean (IB) edota
Galdera-Erantzun (GE) sistemetan. Master tesi honek bi hurbilpen aurkezten ditu
Elkarrizketa-Sistemen (ES) arloan zabaltze-teknikak garatze aldera, zehazkiago
Donostiako (Gipuzkoa) hiri-garraiorako chatbot baten ulertze-modulua garatzera
zuzendurik. Lehenengo hurbilpenak hitz-bektoreak erabiltzen ditu semantikoki antzekoak
diren terminoak erauzteko, kasu honetan FastText-eko aurre-entreinaturiko embedding
sorta espainieraz eta bigarren hurbiltzeak hitzen adiera-desanbiguazioa erabiltzen du
sinonimoak datu-base lexiko baten bidez erauzteko, kasu honetan espainierazko
WordNet-a. Horretarako, ataza kolaboratibo bat diseinatu da, non corpusa osatuko
baitugu balizko-egoera erreal baten sarrerak jasoz. Bestalde, domeinuz kanpo dauden
sarrerak identi katze aldera, bi esperimentu sorta garatu dira. Lehenengo fasean
kali katze sistema bat garatu da, non corpuseko terminoak Term Frequency-Inverse
Document Frequency (TF-IDF) erabiliz ordenatzen baitiren eta ondoren
kali katze-sistema kosinu-antzekotasunaren bidez osatzen da. Bigarren faseak aurreko
kali katze-sistema formalizatuko da, hiru datu-multzo prestatuz eta estrati katuz.
Datu-multzo hauek erregresore lineal bat eta Kernel linealarekin euskarri bektoredun
makina bat entreinatzeko erabili dira. Emaitzen arabera, aurre-entreinaturiko bektoreek
leialtasun handiagoa daukate input errealari dagokionez. Hala ere, datu-base lexikoek
estaldura linguistiko zabalagoa gehituko diote zabalduriko corpus hipotetikoari. Azkenik,
domeinuaren diskriminazioari dagokionez, emaitzek TF-IDF-tik erauzitako termino
gehienen zeukan datu-multzoa hobesten dute.Text expansion techniques have been used in some sub elds of Natural Language
Processing (NLP) such as Information Retrieval or Question-Answering Systems. This
Master's Thesis presents two approaches for expansion within the context of Dialogue
Systems (DS), more precisely for the Natural Language Understanding (NLU) module of
a chatbot for the urban transportation domain in San Sebastian (Gipuzkoa). The rst
approach uses word vectors to obtain semantically similar terms while the second one
involves synonym extraction from a lexical database. For this purpose, a corpus composed
of real case scenario inputs has been exploited. Furthermore, the qualitative analysis of
the implemented expansion techniques revealed a need to lter out-of-domain inputs. In
relation to this problem, two di erent sets of experiments have been carried out. First,
the feasibility of using Term Frequency-Inverse Document Frequency (TF-IDF) and
cosine similarity as discrimination features was explored. Then, linear regression and
Support Vector Machine (SVM) classi ers were trained and tested. Results show that
pre-trained word embedding expansion constitutes a more loyal representation of real case
scenario inputs, whereas lexical database expansion adds a wider linguistic coverage to a
hypothetically expanded version of the corpus. For out-of-domain detection, increasing
the number of features improves both, linear regression and SVM classi cation results
Otrouha: A Corpus of Arabic ETDs and a Framework for Automatic Subject Classification
Although the Arabic language is spoken by more than 300 million people and is one of the six official languages of the United Nations (UN), there has been less research done on Arabic text data (compared to English) in the realm of machine learning, especially in text classification. In the past decade, Arabic data such as news, tweets, etc. have begun to receive some attention. Although automatic text classification plays an important role in improving the browsability and accessibility of data, Electronic Theses and Dissertations (ETDs) have not received their fair share of attention, in spite of the huge number of benefits they provide to students, universities, and future generations of scholars. There are two main roadblocks to performing automatic subject classification on Arabic ETDs. The first is the unavailability of a public corpus of Arabic ETDs. The second is the linguistic complexity of the Arabic language; that complexity is particularly evident in academic documents such as ETDs. To address these roadblocks, this paper presents Otrouha, a framework for automatic subject classification of Arabic ETDs, which has two main goals. The first is building a Corpus of Arabic ETDs and their key metadata such as abstracts, keywords, and title to pave the way for more exploratory research on this valuable genre of data. The second is to provide a framework for automatic subject classification of Arabic ETDs through different classification models that use classical machine learning as well as deep learning techniques. The first goal is aided by searching the AskZad Digital Library, which is part of the Saudi Digital Library (SDL). AskZad provides other key metadata of Arabic ETDs, such as abstract, title, and keywords. The current search results consist of abstracts of Arabic ETDs. This raw data then undergoes a pre-processing phase that includes stop word removal using the Natural Language Tool Kit (NLTK), and word lemmatization using the Farasa API. To date, abstracts of 518 ETDs across 12 subjects have been collected. For the second goal, the preliminary results show that among the machine learning models, binary classification (one-vs.-all) performed better than multiclass classification. The maximum per subject accuracy is 95%, with an average accuracy of 68% across all subjects. It is noteworthy that the binary classification model performed better for some categories than others. For example, Applied Science and Technology shows 95% accuracy, while the category of Administration shows 36%. Deep learning models resulted in higher accuracy but lower F-measure; their overall performance is lower than machine learning models. This may be due to the small size of the dataset as well as the imbalance in the number of documents per category. Work to collect additional ETDs will be aided by collaborative contributions of data from additional sources
Chatbot de Suporte para Plataforma de Marketing Multicanal
E-goi is an organization which provides automated multichannel marketing possibilities. Given its system’s complexity, it requires a not so smooth learning curve, which means that sometimes costumers incur upon some difficulties which directs them towards appropriate Costumer Support resources. With an increase in the number of users, these Costumer Support requests are somewhat frequent and demand an increase in availability in Costumer Support channels which become inundated with simple, easily-resolvable requests. The organization idealized the possibility of automating significant portion of costumer generated tickets with the possibility of scaling to deal with other types of operations. This thesis aims to present a long-term solution to that request with the development of a chatbot system, fully integrated with the existing enterprise modules and data sources. In order to accomplish this, prototypes using several Chatbot management and Natural Language Processing frameworks were developed. Afterwards, their advantages and disadvantages were pondered, followed by the implementation of its accompanying system and testing of developed software and Natural Language Processing results. Although the developed overarching system achieved its designed functionalities, the master’s thesis could not offer a viable solution for the problem at hand given that the available data could not provide an intent mining model usable in a real-world context.A E-goi é uma organização que disponibiliza soluções de marketing digital automatizadas e multicanal. Dada a complexidade do seu Sistema, que requer uma curva de aprendizagem não muito suave, o que significa que os seus utilizadores por vezes têm dificuldades que os levam a recorrer aos canais de Apoio ao Cliente. Com um aumento de utilizadores, estes pedidos de Apoio ao Cliente tornam-se frequentes e requerem um aumento da disponibilidade nos canais apropriados que ficam inundados de pedidos simples e de fácil resolução. A organização idealizou a possibilidade de automatizar uma porção significativa de tais pedidos, podendo escalar para outro tipo de operações. Este trabalho de mestrado visa apresentar uma proposta de solução a longo prazo para este problema. Pretende-se o desenvolvimento de um sistema de chatbots, completamente integrado com o sistema existente da empresa e variadas fontes de dados. Para este efeito, foram desenvolvidos protótipos de várias frameworks para gestão de chatbots e de Natural Language Processing, ponderadas as suas vantagens e desvantagens, implementado o sistema englobante e realizados planos de testes ao software desenvolvido e aos resultados de Natural Language Processing. Apesar do sistema desenvolvido ter cumprido as funcionalidades pelas quais foi concebido, a tese de mestrado não foi capaz de obter uma solução viável para o problema dado que com os dados disponibilizados não foi possível produzir um modelo de deteção de intenções usável num contexto real
Survey on Publicly Available Sinhala Natural Language Processing Tools and Research
Sinhala is the native language of the Sinhalese people who make up the
largest ethnic group of Sri Lanka. The language belongs to the globe-spanning
language tree, Indo-European. However, due to poverty in both linguistic and
economic capital, Sinhala, in the perspective of Natural Language Processing
tools and research, remains a resource-poor language which has neither the
economic drive its cousin English has nor the sheer push of the law of numbers
a language such as Chinese has. A number of research groups from Sri Lanka have
noticed this dearth and the resultant dire need for proper tools and research
for Sinhala natural language processing. However, due to various reasons, these
attempts seem to lack coordination and awareness of each other. The objective
of this paper is to fill that gap of a comprehensive literature survey of the
publicly available Sinhala natural language tools and research so that the
researchers working in this field can better utilize contributions of their
peers. As such, we shall be uploading this paper to arXiv and perpetually
update it periodically to reflect the advances made in the field
Proceedings of the Fifth Italian Conference on Computational Linguistics CLiC-it 2018 : 10-12 December 2018, Torino
On behalf of the Program Committee, a very warm welcome to the Fifth Italian Conference on Computational Linguistics (CLiC-‐it 2018). This edition of the conference is held in Torino. The conference is locally organised by the University of Torino and hosted into its prestigious main lecture hall “Cavallerizza Reale”. The CLiC-‐it conference series is an initiative of the Italian Association for Computational Linguistics (AILC) which, after five years of activity, has clearly established itself as the premier national forum for research and development in the fields of Computational Linguistics and Natural Language Processing, where leading researchers and practitioners from academia and industry meet to share their research results, experiences, and challenges
Multilingual taxonomic text classification in an incremental number of languages
La classificazione tassonomica dei testi è una branca dell'elaborazione del linguaggio naturale (NLP) che mira a classificare i testi in uno schema organizzato gerarchicamente. Le sue applicazioni spaziano dalla ricerca pura a scopi industriali e commerciali.
In questa tesi, in particolare, l'attenzione è focalizzata sul processo di sviluppo di un modello multilingue per la classificazione gerarchica del testo indipendentemente dal suo idioma. Viene effettuata un'analisi comparativa per valutare quali tecniche di embedding e quali metodi di selezione delle feature siano più performanti.
Inoltre, poiché nelle applicazioni concrete può essere richiesto che un modello multilingue supporti nuove lingue nel corso del tempo, implementiamo e analizziamo una serie di tecniche, tra cui due algoritmi di apprendimento continuo, per estendere sequenzialmente la nostra rete a un numero incrementale di lingue.
Gli esperimenti condotti mostrano sia la solidità che le criticità del modello attuale e pongono le basi per ulteriori ricerche.Taxonomic text classification is the Natural Language Processing (NLP) branch that aims at classifying text into a hierarchically organized schema. Its applications space from pure research to industrial and commercial purposes.
In this thesis, in particular, the attention is focused on the process of developing a multilingual model for hierarchically classifying text independently from its idiom. Comparative analysis is performed to evaluate whose embedding techniques and feature selection methods perform better.
Moreover, since in real-life scenarios a multilingual model may be required to support new languages over time, we implement and benchmark a set of techniques, among which are two Continual Learning algorithms, to sequentially extend our network to an incremental number of languages.
The experiments carried out display both the solidity and criticalities of the current model and set the basis for further research
- …