12 research outputs found
Leveraging Feedback in Conversational Question Answering Systems
172 p.Tesi honen helburua martxan jarri eta geroko sistemek gizakiekin duten elkarregina erabiltzeada, gizakien feedbacka sistementzako ikasketa eta egokitzapen seinale bezala erabiliz.Elkarrizketa sistemek martxan jartzerakoan jasaten duten domeinu aldaketan jartzen dugufokua. Helburu honetarako, feedback bitar esplizituaren kasua aztertzen dugu, hau baitagizakientzat feedbacka emateko seinale errazena.Sistemak martxan jarri eta gero hobetzeko, lehenik eta behin DoQA izeneko galdera-erantzunmotako elkarriketez osatutako datu multzo bat eraiki dugu. Datu multzo honekcrowdsourcing bidez jasotako 2.437 dialogo ditu. Aurreko lanekin konparatuz gero, DoQAkbenetazko informazio beharrak islatzen ditu, datu multzo barneko elkarrizketak naturalagoaketa koherenteagoak izanik. Datu multzo sortu eta gero, feedback-weighted learning (FWL)izeneko algoritmo bat diseinatu dugu, feedback bitarra bakarrik erabiliz aurretikentrenatutako sistema gainbegiratu bat hobetzeko gai dena. Azkenik, algoritmo honen mugakaztertzen ditugu jasotako feedbacka zaratatsua den kasuetarako eta FWL moldatzen dugueszenatoki zaratsuari aurre egiteko. Kasu honetan lortzen ditugun emaitza negatiboakerakusten dute erabiltzaileetatik jasotako feedback zaratsua modelatzearen erronka, hauebaztea oraindik ikerkuntza galdera ireki bat delarik
Building a dialogue system for question-answer forum websites
[EU] Dialogo-sistemak gizakiak laguntzeko sistema automatikoak dira, eta beren ezaugarri
nagusia da komunikazioa hizkuntza naturalaren bidez gauzatzeko gai direla. Azken boladan bultzada handia jaso eta eguneroko tresnetan aurkitu daitezke (Siri, Cortana, Alexa, etab.). Sistema hauen erabilera handitu ahala, Community Question Answering (CQA) edo Frequently Asked Questions (FAQ) direlakoak dialogo bitartez atzitzeko interesa zeharo handitu da, bereziki enpresa munduan. Egungo dialogo sistemen elkarrizketarako ahalmena, ordea, oso mugatua da, eskuzko erregelen bidez definituta baitaude. Horrek domeinu berri batean ezartzeko edo behin produkzioan martxan dagoenean monitorizatu eta egokitzeko kostuak handitzen ditu. Bestalde, nahiz eta ikaskuntza sakona bezalako teknikek oso emaitza onak lortu dituzten Hizkuntzaren Prozesamenduko alor desberdinetan, asko sufritzen dute datu eskasiaren arazoa, datu kopuru izugarriak behar baitituzte ikasketarako. Hemen aurkeztutako proiektuaren helburu nagusia aipatutako mugak arintzea da, sare neuronaletan oinarritutako sistema bat inplementatuz eta sistema hauen etorkizuneko garapena bultzatu eta errazteko CQA datu multzo bat sortuz.[EN] Dialogue-systems are automatic systems developed for helping humans in their daily routines. The main characteristic of these systems is that they are able to communicate using natural language. Lately, dialogue agents are becoming increasingly trendy and are already part of our lives as they are implemented in many tools (Siri, Cortana, Alexa...). This incursion of voice agents has increased the interest of accessing Community Question Answering (CQA) and Frequently Asked Questions (FAQ) information by dialogue means, specially in the industrial world. Nowadays, dialogue systems have their conversational ability very limited as they are de ned by hand-crafted rules. This hand-crafted nature, makes domain adaptation an extremely costly and time consuming task. On the other hand, deep learning based techniques, that have achieved state-of-the-art results in many Natural Language Processing (NLP) tasks, sufer from lack of data as they need huge amounts of labelled records for training. So, the main aim of this project, is to develop a neural system together with a CQA dataset for enabling future research in CQA dialogue systems
Building a dialogue system for question-answer forum websites
[EU] Dialogo-sistemak gizakiak laguntzeko sistema automatikoak dira, eta beren ezaugarri
nagusia da komunikazioa hizkuntza naturalaren bidez gauzatzeko gai direla. Azken boladan bultzada handia jaso eta eguneroko tresnetan aurkitu daitezke (Siri, Cortana, Alexa, etab.). Sistema hauen erabilera handitu ahala, Community Question Answering (CQA) edo Frequently Asked Questions (FAQ) direlakoak dialogo bitartez atzitzeko interesa zeharo handitu da, bereziki enpresa munduan. Egungo dialogo sistemen elkarrizketarako ahalmena, ordea, oso mugatua da, eskuzko erregelen bidez definituta baitaude. Horrek domeinu berri batean ezartzeko edo behin produkzioan martxan dagoenean monitorizatu eta egokitzeko kostuak handitzen ditu. Bestalde, nahiz eta ikaskuntza sakona bezalako teknikek oso emaitza onak lortu dituzten Hizkuntzaren Prozesamenduko alor desberdinetan, asko sufritzen dute datu eskasiaren arazoa, datu kopuru izugarriak behar baitituzte ikasketarako. Hemen aurkeztutako proiektuaren helburu nagusia aipatutako mugak arintzea da, sare neuronaletan oinarritutako sistema bat inplementatuz eta sistema hauen etorkizuneko garapena bultzatu eta errazteko CQA datu multzo bat sortuz.[EN] Dialogue-systems are automatic systems developed for helping humans in their daily routines. The main characteristic of these systems is that they are able to communicate using natural language. Lately, dialogue agents are becoming increasingly trendy and are already part of our lives as they are implemented in many tools (Siri, Cortana, Alexa...). This incursion of voice agents has increased the interest of accessing Community Question Answering (CQA) and Frequently Asked Questions (FAQ) information by dialogue means, specially in the industrial world. Nowadays, dialogue systems have their conversational ability very limited as they are de ned by hand-crafted rules. This hand-crafted nature, makes domain adaptation an extremely costly and time consuming task. On the other hand, deep learning based techniques, that have achieved state-of-the-art results in many Natural Language Processing (NLP) tasks, sufer from lack of data as they need huge amounts of labelled records for training. So, the main aim of this project, is to develop a neural system together with a CQA dataset for enabling future research in CQA dialogue systems
Training Language Models with Language Feedback at Scale
Pretrained language models often generate outputs that are not in line with
human preferences, such as harmful text or factually incorrect summaries.
Recent work approaches the above issues by learning from a simple form of human
feedback: comparisons between pairs of model-generated outputs. However,
comparison feedback only conveys limited information about human preferences.
In this paper, we introduce Imitation learning from Language Feedback (ILF), a
new approach that utilizes more informative language feedback. ILF consists of
three steps that are applied iteratively: first, conditioning the language
model on the input, an initial LM output, and feedback to generate refinements.
Second, selecting the refinement incorporating the most feedback. Third,
finetuning the language model to maximize the likelihood of the chosen
refinement given the input. We show theoretically that ILF can be viewed as
Bayesian Inference, similar to Reinforcement Learning from human feedback. We
evaluate ILF's effectiveness on a carefully-controlled toy task and a realistic
summarization task. Our experiments demonstrate that large language models
accurately incorporate feedback and that finetuning with ILF scales well with
the dataset size, even outperforming finetuning on human summaries. Learning
from both language and comparison feedback outperforms learning from each
alone, achieving human-level summarization performance
NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark
In this position paper, we argue that the classical evaluation on Natural
Language Processing (NLP) tasks using annotated benchmarks is in trouble. The
worst kind of data contamination happens when a Large Language Model (LLM) is
trained on the test split of a benchmark, and then evaluated in the same
benchmark. The extent of the problem is unknown, as it is not straightforward
to measure. Contamination causes an overestimation of the performance of a
contaminated model in a target benchmark and associated task with respect to
their non-contaminated counterparts. The consequences can be very harmful, with
wrong scientific conclusions being published while other correct ones are
discarded. This position paper defines different levels of data contamination
and argues for a community effort, including the development of automatic and
semi-automatic measures to detect when data from a benchmark was exposed to a
model, and suggestions for flagging papers with conclusions that are
compromised by data contamination.Comment: Accepted at EMNLP2024-Finding
Improving Code Generation by Training with Natural Language Feedback
The potential for pre-trained large language models (LLMs) to use natural
language feedback at inference time has been an exciting recent development. We
build upon this observation by formalizing an algorithm for learning from
natural language feedback at training time instead, which we call Imitation
learning from Language Feedback (ILF). ILF requires only a small amount of
human-written feedback during training and does not require the same feedback
at test time, making it both user-friendly and sample-efficient. We further
show that ILF can be seen as a form of minimizing the KL divergence to the
ground truth distribution and demonstrate a proof-of-concept on a neural
program synthesis task. We use ILF to improve a Codegen-Mono 6.1B model's
pass@1 rate by 38% relative (and 10% absolute) on the Mostly Basic Python
Problems (MBPP) benchmark, outperforming both fine-tuning on MBPP and
fine-tuning on repaired programs written by humans. Overall, our results
suggest that learning from human-written natural language feedback is both more
effective and sample-efficient than training exclusively on demonstrations for
improving an LLM's performance on code generation tasks
Spot The Bot: A Robust and Efficient Framework for the Evaluation of Conversational Dialogue Systems
The lack of time-efficient and reliable evaluation methods hamper the
development of conversational dialogue systems (chatbots). Evaluations
requiring humans to converse with chatbots are time and cost-intensive, put
high cognitive demands on the human judges, and yield low-quality results. In
this work, we introduce \emph{Spot The Bot}, a cost-efficient and robust
evaluation framework that replaces human-bot conversations with conversations
between bots. Human judges then only annotate for each entity in a conversation
whether they think it is human or not (assuming there are humans participants
in these conversations). These annotations then allow us to rank chatbots
regarding their ability to mimic the conversational behavior of humans. Since
we expect that all bots are eventually recognized as such, we incorporate a
metric that measures which chatbot can uphold human-like behavior the longest,
i.e., \emph{Survival Analysis}. This metric has the ability to correlate a
bot's performance to certain of its characteristics (e.g., \ fluency or
sensibleness), yielding interpretable results. The comparably low cost of our
framework allows for frequent evaluations of chatbots during their evaluation
cycle. We empirically validate our claims by applying \emph{Spot The Bot} to
three domains, evaluating several state-of-the-art chatbots, and drawing
comparisons to related work. The framework is released as a ready-to-use tool
DoQA : accessing domain-specific FAQs via conversational QA
The goal of this work is to build conversational Question Answering (QA) interfaces for the large body of domain-specific information available in FAQ sites. We present DoQA, a dataset with 2,437 dialogues and 10,917 QA pairs. The dialogues are collected from three Stack Exchange sites using the Wizard of Oz method with crowdsourcing. Compared to previous work, DoQA comprises well-defined information needs, leading to more coherent and natural conversations with less factoid questions and is multi-domain. In addition, we introduce a more realistic information retrieval (IR) scenario where the system needs to find the answer in any of the FAQ documents. The results of an existing, strong, system show that, thanks to transfer learning from a Wikipedia QA dataset and fine tuning on a single FAQ domain, it is possible to build high quality conversational QA systems for FAQs without in-domain training data. The good results carry over into the more challenging IR scenario. In both cases, there is still ample room for improvement, as indicated by the higher human upperbound