12 research outputs found

    Leveraging Feedback in Conversational Question Answering Systems

    Get PDF
    172 p.Tesi honen helburua martxan jarri eta geroko sistemek gizakiekin duten elkarregina erabiltzeada, gizakien feedbacka sistementzako ikasketa eta egokitzapen seinale bezala erabiliz.Elkarrizketa sistemek martxan jartzerakoan jasaten duten domeinu aldaketan jartzen dugufokua. Helburu honetarako, feedback bitar esplizituaren kasua aztertzen dugu, hau baitagizakientzat feedbacka emateko seinale errazena.Sistemak martxan jarri eta gero hobetzeko, lehenik eta behin DoQA izeneko galdera-erantzunmotako elkarriketez osatutako datu multzo bat eraiki dugu. Datu multzo honekcrowdsourcing bidez jasotako 2.437 dialogo ditu. Aurreko lanekin konparatuz gero, DoQAkbenetazko informazio beharrak islatzen ditu, datu multzo barneko elkarrizketak naturalagoaketa koherenteagoak izanik. Datu multzo sortu eta gero, feedback-weighted learning (FWL)izeneko algoritmo bat diseinatu dugu, feedback bitarra bakarrik erabiliz aurretikentrenatutako sistema gainbegiratu bat hobetzeko gai dena. Azkenik, algoritmo honen mugakaztertzen ditugu jasotako feedbacka zaratatsua den kasuetarako eta FWL moldatzen dugueszenatoki zaratsuari aurre egiteko. Kasu honetan lortzen ditugun emaitza negatiboakerakusten dute erabiltzaileetatik jasotako feedback zaratsua modelatzearen erronka, hauebaztea oraindik ikerkuntza galdera ireki bat delarik

    Building a dialogue system for question-answer forum websites

    Get PDF
    [EU] Dialogo-sistemak gizakiak laguntzeko sistema automatikoak dira, eta beren ezaugarri nagusia da komunikazioa hizkuntza naturalaren bidez gauzatzeko gai direla. Azken boladan bultzada handia jaso eta eguneroko tresnetan aurkitu daitezke (Siri, Cortana, Alexa, etab.). Sistema hauen erabilera handitu ahala, Community Question Answering (CQA) edo Frequently Asked Questions (FAQ) direlakoak dialogo bitartez atzitzeko interesa zeharo handitu da, bereziki enpresa munduan. Egungo dialogo sistemen elkarrizketarako ahalmena, ordea, oso mugatua da, eskuzko erregelen bidez definituta baitaude. Horrek domeinu berri batean ezartzeko edo behin produkzioan martxan dagoenean monitorizatu eta egokitzeko kostuak handitzen ditu. Bestalde, nahiz eta ikaskuntza sakona bezalako teknikek oso emaitza onak lortu dituzten Hizkuntzaren Prozesamenduko alor desberdinetan, asko sufritzen dute datu eskasiaren arazoa, datu kopuru izugarriak behar baitituzte ikasketarako. Hemen aurkeztutako proiektuaren helburu nagusia aipatutako mugak arintzea da, sare neuronaletan oinarritutako sistema bat inplementatuz eta sistema hauen etorkizuneko garapena bultzatu eta errazteko CQA datu multzo bat sortuz.[EN] Dialogue-systems are automatic systems developed for helping humans in their daily routines. The main characteristic of these systems is that they are able to communicate using natural language. Lately, dialogue agents are becoming increasingly trendy and are already part of our lives as they are implemented in many tools (Siri, Cortana, Alexa...). This incursion of voice agents has increased the interest of accessing Community Question Answering (CQA) and Frequently Asked Questions (FAQ) information by dialogue means, specially in the industrial world. Nowadays, dialogue systems have their conversational ability very limited as they are de ned by hand-crafted rules. This hand-crafted nature, makes domain adaptation an extremely costly and time consuming task. On the other hand, deep learning based techniques, that have achieved state-of-the-art results in many Natural Language Processing (NLP) tasks, sufer from lack of data as they need huge amounts of labelled records for training. So, the main aim of this project, is to develop a neural system together with a CQA dataset for enabling future research in CQA dialogue systems

    Building a dialogue system for question-answer forum websites

    Get PDF
    [EU] Dialogo-sistemak gizakiak laguntzeko sistema automatikoak dira, eta beren ezaugarri nagusia da komunikazioa hizkuntza naturalaren bidez gauzatzeko gai direla. Azken boladan bultzada handia jaso eta eguneroko tresnetan aurkitu daitezke (Siri, Cortana, Alexa, etab.). Sistema hauen erabilera handitu ahala, Community Question Answering (CQA) edo Frequently Asked Questions (FAQ) direlakoak dialogo bitartez atzitzeko interesa zeharo handitu da, bereziki enpresa munduan. Egungo dialogo sistemen elkarrizketarako ahalmena, ordea, oso mugatua da, eskuzko erregelen bidez definituta baitaude. Horrek domeinu berri batean ezartzeko edo behin produkzioan martxan dagoenean monitorizatu eta egokitzeko kostuak handitzen ditu. Bestalde, nahiz eta ikaskuntza sakona bezalako teknikek oso emaitza onak lortu dituzten Hizkuntzaren Prozesamenduko alor desberdinetan, asko sufritzen dute datu eskasiaren arazoa, datu kopuru izugarriak behar baitituzte ikasketarako. Hemen aurkeztutako proiektuaren helburu nagusia aipatutako mugak arintzea da, sare neuronaletan oinarritutako sistema bat inplementatuz eta sistema hauen etorkizuneko garapena bultzatu eta errazteko CQA datu multzo bat sortuz.[EN] Dialogue-systems are automatic systems developed for helping humans in their daily routines. The main characteristic of these systems is that they are able to communicate using natural language. Lately, dialogue agents are becoming increasingly trendy and are already part of our lives as they are implemented in many tools (Siri, Cortana, Alexa...). This incursion of voice agents has increased the interest of accessing Community Question Answering (CQA) and Frequently Asked Questions (FAQ) information by dialogue means, specially in the industrial world. Nowadays, dialogue systems have their conversational ability very limited as they are de ned by hand-crafted rules. This hand-crafted nature, makes domain adaptation an extremely costly and time consuming task. On the other hand, deep learning based techniques, that have achieved state-of-the-art results in many Natural Language Processing (NLP) tasks, sufer from lack of data as they need huge amounts of labelled records for training. So, the main aim of this project, is to develop a neural system together with a CQA dataset for enabling future research in CQA dialogue systems

    Training Language Models with Language Feedback at Scale

    Full text link
    Pretrained language models often generate outputs that are not in line with human preferences, such as harmful text or factually incorrect summaries. Recent work approaches the above issues by learning from a simple form of human feedback: comparisons between pairs of model-generated outputs. However, comparison feedback only conveys limited information about human preferences. In this paper, we introduce Imitation learning from Language Feedback (ILF), a new approach that utilizes more informative language feedback. ILF consists of three steps that are applied iteratively: first, conditioning the language model on the input, an initial LM output, and feedback to generate refinements. Second, selecting the refinement incorporating the most feedback. Third, finetuning the language model to maximize the likelihood of the chosen refinement given the input. We show theoretically that ILF can be viewed as Bayesian Inference, similar to Reinforcement Learning from human feedback. We evaluate ILF's effectiveness on a carefully-controlled toy task and a realistic summarization task. Our experiments demonstrate that large language models accurately incorporate feedback and that finetuning with ILF scales well with the dataset size, even outperforming finetuning on human summaries. Learning from both language and comparison feedback outperforms learning from each alone, achieving human-level summarization performance

    NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark

    Full text link
    In this position paper, we argue that the classical evaluation on Natural Language Processing (NLP) tasks using annotated benchmarks is in trouble. The worst kind of data contamination happens when a Large Language Model (LLM) is trained on the test split of a benchmark, and then evaluated in the same benchmark. The extent of the problem is unknown, as it is not straightforward to measure. Contamination causes an overestimation of the performance of a contaminated model in a target benchmark and associated task with respect to their non-contaminated counterparts. The consequences can be very harmful, with wrong scientific conclusions being published while other correct ones are discarded. This position paper defines different levels of data contamination and argues for a community effort, including the development of automatic and semi-automatic measures to detect when data from a benchmark was exposed to a model, and suggestions for flagging papers with conclusions that are compromised by data contamination.Comment: Accepted at EMNLP2024-Finding

    Improving Code Generation by Training with Natural Language Feedback

    Full text link
    The potential for pre-trained large language models (LLMs) to use natural language feedback at inference time has been an exciting recent development. We build upon this observation by formalizing an algorithm for learning from natural language feedback at training time instead, which we call Imitation learning from Language Feedback (ILF). ILF requires only a small amount of human-written feedback during training and does not require the same feedback at test time, making it both user-friendly and sample-efficient. We further show that ILF can be seen as a form of minimizing the KL divergence to the ground truth distribution and demonstrate a proof-of-concept on a neural program synthesis task. We use ILF to improve a Codegen-Mono 6.1B model's pass@1 rate by 38% relative (and 10% absolute) on the Mostly Basic Python Problems (MBPP) benchmark, outperforming both fine-tuning on MBPP and fine-tuning on repaired programs written by humans. Overall, our results suggest that learning from human-written natural language feedback is both more effective and sample-efficient than training exclusively on demonstrations for improving an LLM's performance on code generation tasks

    Spot The Bot: A Robust and Efficient Framework for the Evaluation of Conversational Dialogue Systems

    Get PDF
    The lack of time-efficient and reliable evaluation methods hamper the development of conversational dialogue systems (chatbots). Evaluations requiring humans to converse with chatbots are time and cost-intensive, put high cognitive demands on the human judges, and yield low-quality results. In this work, we introduce \emph{Spot The Bot}, a cost-efficient and robust evaluation framework that replaces human-bot conversations with conversations between bots. Human judges then only annotate for each entity in a conversation whether they think it is human or not (assuming there are humans participants in these conversations). These annotations then allow us to rank chatbots regarding their ability to mimic the conversational behavior of humans. Since we expect that all bots are eventually recognized as such, we incorporate a metric that measures which chatbot can uphold human-like behavior the longest, i.e., \emph{Survival Analysis}. This metric has the ability to correlate a bot's performance to certain of its characteristics (e.g., \ fluency or sensibleness), yielding interpretable results. The comparably low cost of our framework allows for frequent evaluations of chatbots during their evaluation cycle. We empirically validate our claims by applying \emph{Spot The Bot} to three domains, evaluating several state-of-the-art chatbots, and drawing comparisons to related work. The framework is released as a ready-to-use tool

    DoQA : accessing domain-specific FAQs via conversational QA

    Get PDF
    The goal of this work is to build conversational Question Answering (QA) interfaces for the large body of domain-specific information available in FAQ sites. We present DoQA, a dataset with 2,437 dialogues and 10,917 QA pairs. The dialogues are collected from three Stack Exchange sites using the Wizard of Oz method with crowdsourcing. Compared to previous work, DoQA comprises well-defined information needs, leading to more coherent and natural conversations with less factoid questions and is multi-domain. In addition, we introduce a more realistic information retrieval (IR) scenario where the system needs to find the answer in any of the FAQ documents. The results of an existing, strong, system show that, thanks to transfer learning from a Wikipedia QA dataset and fine tuning on a single FAQ domain, it is possible to build high quality conversational QA systems for FAQs without in-domain training data. The good results carry over into the more challenging IR scenario. In both cases, there is still ample room for improvement, as indicated by the higher human upperbound
    corecore