41 research outputs found
Astraea: Grammar-based Fairness Testing
Software often produces biased outputs. In particular, machine learning (ML)
based software are known to produce erroneous predictions when processing
discriminatory inputs. Such unfair program behavior can be caused by societal
bias. In the last few years, Amazon, Microsoft and Google have provided
software services that produce unfair outputs, mostly due to societal bias
(e.g. gender or race). In such events, developers are saddled with the task of
conducting fairness testing. Fairness testing is challenging; developers are
tasked with generating discriminatory inputs that reveal and explain biases.
We propose a grammar-based fairness testing approach (called ASTRAEA) which
leverages context-free grammars to generate discriminatory inputs that reveal
fairness violations in software systems. Using probabilistic grammars, ASTRAEA
also provides fault diagnosis by isolating the cause of observed software bias.
ASTRAEA's diagnoses facilitate the improvement of ML fairness.
ASTRAEA was evaluated on 18 software systems that provide three major natural
language processing (NLP) services. In our evaluation, ASTRAEA generated
fairness violations with a rate of ~18%. ASTRAEA generated over 573K
discriminatory test cases and found over 102K fairness violations. Furthermore,
ASTRAEA improves software fairness by ~76%, via model-retraining
Software-based dialogue systems: Survey, taxonomy and challenges
The use of natural language interfaces in the field of human-computer interaction is undergoing intense study through dedicated scientific and industrial research. The latest contributions in the field, including deep learning approaches like recurrent neural networks, the potential of context-aware strategies and user-centred design approaches, have brought back the attention of the community to software-based dialogue systems, generally known as conversational agents or chatbots. Nonetheless, and given the novelty of the field, a generic, context-independent overview on the current state of research of conversational agents covering all research perspectives involved is missing. Motivated by this context, this paper reports a survey of the current state of research of conversational agents through a systematic literature review of secondary studies. The conducted research is designed to develop an exhaustive perspective through a clear presentation of the aggregated knowledge published by recent literature within a variety of domains, research focuses and contexts. As a result, this research proposes a holistic taxonomy of the different dimensions involved in the conversational agents’ field, which is expected to help researchers and to lay the groundwork for future research in the field of natural language interfaces.With the support from the Secretariat for Universities and Research of the Ministry of Business and Knowledge of the Government of Catalonia and the European Social Fund. The corresponding author gratefully acknowledges the Universitat Politècnica de Catalunya and Banco Santander for the inancial support of his predoctoral grant FPI-UPC. This paper has been funded by the Spanish Ministerio de Ciencia e Innovación under project / funding scheme PID2020-117191RB-I00 / AEI/10.13039/501100011033.Peer ReviewedPostprint (author's final draft
Automated Testing and Improvement of Named Entity Recognition Systems
Named entity recognition (NER) systems have seen rapid progress in recent
years due to the development of deep neural networks. These systems are widely
used in various natural language processing applications, such as information
extraction, question answering, and sentiment analysis. However, the complexity
and intractability of deep neural networks can make NER systems unreliable in
certain circumstances, resulting in incorrect predictions. For example, NER
systems may misidentify female names as chemicals or fail to recognize the
names of minority groups, leading to user dissatisfaction. To tackle this
problem, we introduce TIN, a novel, widely applicable approach for
automatically testing and repairing various NER systems. The key idea for
automated testing is that the NER predictions of the same named entities under
similar contexts should be identical. The core idea for automated repairing is
that similar named entities should have the same NER prediction under the same
context. We use TIN to test two SOTA NER models and two commercial NER APIs,
i.e., Azure NER and AWS NER. We manually verify 784 of the suspicious issues
reported by TIN and find that 702 are erroneous issues, leading to high
precision (85.0%-93.4%) across four categories of NER errors: omission,
over-labeling, incorrect category, and range error. For automated repairing,
TIN achieves a high error reduction rate (26.8%-50.6%) over the four systems
under test, which successfully repairs 1,056 out of the 1,877 reported NER
errors.Comment: Accepted by ESEC/FSE'2
HANDLING CHANGE IN A PRODUCTION TASKBOT. EFFICIENTLY MANAGING THE GROWTH OF TWIZ, AN ALEXA ASSISTANT
A Conversational Agent aims to converse with users, with a focus on natural behaviour
and responses. They can be extremely complex as there are several parts which constitute
it, several courses of action and infinite possible inputs. As so, behaviour checking is
essential, especially if used in a production context, as wrong behaviour can have big
consequences. Nevertheless, developing a robust and correctly behaving Task Bot, should
not hinder research and must allow for continuous improvement of vanguard solutions.
Hence, manual testing of such a complex system is bound to encounter several limits,
either on the extension of the testing or on the time consumption of developers’ work.
As so, we propose the development of a tool to automatically test, with a much broader
test surface, these highly sophisticated systems. We introduce a solution, which leverages
past conversation replay and mimicking to generate synthetic conversations. This allows
for time-savings on quality assurance and better change handling.
A key part of a Conversational Agent is the retrieval component. This is responsible
for the correct retrieval of information, that is useful to the user. In task-guiding assistants,
the retrieval element should not narrow the user’s behaviour, by omitting tasks that
could be relevant. However, achieving perfect information matching to a user’s query is
arduous, since there could be a plethora of words the user could say in order to attempt
to accomplish an objective. To tackle this, we make use of a semantic retrieval algorithm
adapting it to this domain by generating a synthetic dataset.Um Agente Conversacional visa ter conversas com utilizadores, focando-se no comportamento
e nas respostas naturais. Estes podem ser, no entanto, extremamente complexos.
São várias as partes que os constituem, os fluxos possíveis e os pedidos que o utilizador
pode fazer. Assim, a verificação de comportamento é essencial, especialmente se usada em
um contexto de produção, pois o comportamento errado pode ter grandes consequências.
No entanto, o desenvolvimento de um Task Bot robusto e de comportamento correto não
deve prejudicar a pesquisa e deve permitir a melhoria contínua das soluções. Portanto,
testagem manual de um sistema tão complexo depara-se com vários limites, seja na extensão
do teste ou no consumo de tempo do trabalho dos developers. Assim, propomos
também o desenvolvimento de uma ferramenta para testes automáticos, com uma frente
de teste muito mais ampla, para estes sistemas sofisticados. Apresentamos uma solução
que aproveita a repetição e a simulação de conversas anteriores para gerar conversas sintéticas.
Isso permite reduzir o tempo gasto na verificação de qualidade e permite melhor
adaptação a mudanças.
Uma parte fundamental de um agente conversacional é o retriever. Esta é a componente
responsável pela obtenção de informação relevante. Nos assistentes que têm como
objetivo a orientação de tarefas, o retriever não deve restringir o comportamento do utilizador,
ao omitir tarefas que possam ser relevantes. No entanto, obter uma correspondência
perfeita de informações com o pedido do utilizador é árduo, pois pode haver uma infinidade
de formas que o utilizador pode formular o seu pedido pretendendo o mesmo
objetivo. Para ultrupassar este problema, utilizamos um algoritmo de retrieval semântico,
adaptando-o ao domínio em questão através da geração de um conjunto de dados
sintético
Chatbots for Modelling, Modelling of Chatbots
Tesis Doctoral inédita leída en la Universidad Autónoma de Madrid, Escuela Politécnica Superior, Departamento de Ingeniería Informática. Fecha de Lectura: 28-03-202
Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study
Large Language Models (LLMs), like ChatGPT, have demonstrated vast potential
but also introduce challenges related to content constraints and potential
misuse. Our study investigates three key research questions: (1) the number of
different prompt types that can jailbreak LLMs, (2) the effectiveness of
jailbreak prompts in circumventing LLM constraints, and (3) the resilience of
ChatGPT against these jailbreak prompts. Initially, we develop a classification
model to analyze the distribution of existing prompts, identifying ten distinct
patterns and three categories of jailbreak prompts. Subsequently, we assess the
jailbreak capability of prompts with ChatGPT versions 3.5 and 4.0, utilizing a
dataset of 3,120 jailbreak questions across eight prohibited scenarios.
Finally, we evaluate the resistance of ChatGPT against jailbreak prompts,
finding that the prompts can consistently evade the restrictions in 40 use-case
scenarios. The study underscores the importance of prompt structures in
jailbreaking LLMs and discusses the challenges of robust jailbreak prompt
generation and prevention
Knowledge-driven Natural Language Understanding of English Text and its Applications
Understanding the meaning of a text is a fundamental challenge of natural
language understanding (NLU) research. An ideal NLU system should process a
language in a way that is not exclusive to a single task or a dataset. Keeping
this in mind, we have introduced a novel knowledge driven semantic
representation approach for English text. By leveraging the VerbNet lexicon, we
are able to map syntax tree of the text to its commonsense meaning represented
using basic knowledge primitives. The general purpose knowledge represented
from our approach can be used to build any reasoning based NLU system that can
also provide justification. We applied this approach to construct two NLU
applications that we present here: SQuARE (Semantic-based Question Answering
and Reasoning Engine) and StaCACK (Stateful Conversational Agent using
Commonsense Knowledge). Both these systems work by "truly understanding" the
natural language text they process and both provide natural language
explanations for their responses while maintaining high accuracy.Comment: Preprint. Accepted by the 35th AAAI Conference (AAAI-21) Main Track
Software engineering for AI-based systems: A survey
AI-based systems are software systems with functionalities enabled by at least one AI component (e.g., for image-, speech-recognition, and autonomous driving). AI-based systems are becoming pervasive in society due to advances in AI. However, there is limited synthesized knowledge on Software Engineering (SE) approaches for building, operating, and maintaining AI-based systems.
To collect and analyze state-of-the-art knowledge about SE for AI-based systems, we conducted a systematic mapping study.
We considered 248 studies published between January 2010 and March 2020.
SE for AI-based systems is an emerging research area, where more than 2/3 of the studies have been published since 2018. The most studied properties of AI-based systems are dependability and safety. We identified multiple SE approaches for AI-based systems, which we classified according to the SWEBOK areas. Studies related to software testing and software quality are very prevalent, while areas like software maintenance seem neglected. Data-related issues are the most recurrent challenges.
Our results are valuable for: researchers, to quickly understand the state-of-the-art and learn which topics need more research; practitioners, to learn about the approaches and challenges that SE entails for AI-based systems; and, educators, to bridge the gap among SE and AI in their curricula.This work has been partially funded by the “Beatriz Galindo” Spanish Program BEAGAL18/00064 and by the DOGO4ML Spanish research project (ref. PID2020-117191RB-I00)Peer ReviewedPostprint (author's final draft