41 research outputs found

    Astraea: Grammar-based Fairness Testing

    Get PDF
    Software often produces biased outputs. In particular, machine learning (ML) based software are known to produce erroneous predictions when processing discriminatory inputs. Such unfair program behavior can be caused by societal bias. In the last few years, Amazon, Microsoft and Google have provided software services that produce unfair outputs, mostly due to societal bias (e.g. gender or race). In such events, developers are saddled with the task of conducting fairness testing. Fairness testing is challenging; developers are tasked with generating discriminatory inputs that reveal and explain biases. We propose a grammar-based fairness testing approach (called ASTRAEA) which leverages context-free grammars to generate discriminatory inputs that reveal fairness violations in software systems. Using probabilistic grammars, ASTRAEA also provides fault diagnosis by isolating the cause of observed software bias. ASTRAEA's diagnoses facilitate the improvement of ML fairness. ASTRAEA was evaluated on 18 software systems that provide three major natural language processing (NLP) services. In our evaluation, ASTRAEA generated fairness violations with a rate of ~18%. ASTRAEA generated over 573K discriminatory test cases and found over 102K fairness violations. Furthermore, ASTRAEA improves software fairness by ~76%, via model-retraining

    Software-based dialogue systems: Survey, taxonomy and challenges

    Get PDF
    The use of natural language interfaces in the field of human-computer interaction is undergoing intense study through dedicated scientific and industrial research. The latest contributions in the field, including deep learning approaches like recurrent neural networks, the potential of context-aware strategies and user-centred design approaches, have brought back the attention of the community to software-based dialogue systems, generally known as conversational agents or chatbots. Nonetheless, and given the novelty of the field, a generic, context-independent overview on the current state of research of conversational agents covering all research perspectives involved is missing. Motivated by this context, this paper reports a survey of the current state of research of conversational agents through a systematic literature review of secondary studies. The conducted research is designed to develop an exhaustive perspective through a clear presentation of the aggregated knowledge published by recent literature within a variety of domains, research focuses and contexts. As a result, this research proposes a holistic taxonomy of the different dimensions involved in the conversational agents’ field, which is expected to help researchers and to lay the groundwork for future research in the field of natural language interfaces.With the support from the Secretariat for Universities and Research of the Ministry of Business and Knowledge of the Government of Catalonia and the European Social Fund. The corresponding author gratefully acknowledges the Universitat Politècnica de Catalunya and Banco Santander for the inancial support of his predoctoral grant FPI-UPC. This paper has been funded by the Spanish Ministerio de Ciencia e Innovación under project / funding scheme PID2020-117191RB-I00 / AEI/10.13039/501100011033.Peer ReviewedPostprint (author's final draft

    Automated Testing and Improvement of Named Entity Recognition Systems

    Full text link
    Named entity recognition (NER) systems have seen rapid progress in recent years due to the development of deep neural networks. These systems are widely used in various natural language processing applications, such as information extraction, question answering, and sentiment analysis. However, the complexity and intractability of deep neural networks can make NER systems unreliable in certain circumstances, resulting in incorrect predictions. For example, NER systems may misidentify female names as chemicals or fail to recognize the names of minority groups, leading to user dissatisfaction. To tackle this problem, we introduce TIN, a novel, widely applicable approach for automatically testing and repairing various NER systems. The key idea for automated testing is that the NER predictions of the same named entities under similar contexts should be identical. The core idea for automated repairing is that similar named entities should have the same NER prediction under the same context. We use TIN to test two SOTA NER models and two commercial NER APIs, i.e., Azure NER and AWS NER. We manually verify 784 of the suspicious issues reported by TIN and find that 702 are erroneous issues, leading to high precision (85.0%-93.4%) across four categories of NER errors: omission, over-labeling, incorrect category, and range error. For automated repairing, TIN achieves a high error reduction rate (26.8%-50.6%) over the four systems under test, which successfully repairs 1,056 out of the 1,877 reported NER errors.Comment: Accepted by ESEC/FSE'2

    HANDLING CHANGE IN A PRODUCTION TASKBOT. EFFICIENTLY MANAGING THE GROWTH OF TWIZ, AN ALEXA ASSISTANT

    Get PDF
    A Conversational Agent aims to converse with users, with a focus on natural behaviour and responses. They can be extremely complex as there are several parts which constitute it, several courses of action and infinite possible inputs. As so, behaviour checking is essential, especially if used in a production context, as wrong behaviour can have big consequences. Nevertheless, developing a robust and correctly behaving Task Bot, should not hinder research and must allow for continuous improvement of vanguard solutions. Hence, manual testing of such a complex system is bound to encounter several limits, either on the extension of the testing or on the time consumption of developers’ work. As so, we propose the development of a tool to automatically test, with a much broader test surface, these highly sophisticated systems. We introduce a solution, which leverages past conversation replay and mimicking to generate synthetic conversations. This allows for time-savings on quality assurance and better change handling. A key part of a Conversational Agent is the retrieval component. This is responsible for the correct retrieval of information, that is useful to the user. In task-guiding assistants, the retrieval element should not narrow the user’s behaviour, by omitting tasks that could be relevant. However, achieving perfect information matching to a user’s query is arduous, since there could be a plethora of words the user could say in order to attempt to accomplish an objective. To tackle this, we make use of a semantic retrieval algorithm adapting it to this domain by generating a synthetic dataset.Um Agente Conversacional visa ter conversas com utilizadores, focando-se no comportamento e nas respostas naturais. Estes podem ser, no entanto, extremamente complexos. São várias as partes que os constituem, os fluxos possíveis e os pedidos que o utilizador pode fazer. Assim, a verificação de comportamento é essencial, especialmente se usada em um contexto de produção, pois o comportamento errado pode ter grandes consequências. No entanto, o desenvolvimento de um Task Bot robusto e de comportamento correto não deve prejudicar a pesquisa e deve permitir a melhoria contínua das soluções. Portanto, testagem manual de um sistema tão complexo depara-se com vários limites, seja na extensão do teste ou no consumo de tempo do trabalho dos developers. Assim, propomos também o desenvolvimento de uma ferramenta para testes automáticos, com uma frente de teste muito mais ampla, para estes sistemas sofisticados. Apresentamos uma solução que aproveita a repetição e a simulação de conversas anteriores para gerar conversas sintéticas. Isso permite reduzir o tempo gasto na verificação de qualidade e permite melhor adaptação a mudanças. Uma parte fundamental de um agente conversacional é o retriever. Esta é a componente responsável pela obtenção de informação relevante. Nos assistentes que têm como objetivo a orientação de tarefas, o retriever não deve restringir o comportamento do utilizador, ao omitir tarefas que possam ser relevantes. No entanto, obter uma correspondência perfeita de informações com o pedido do utilizador é árduo, pois pode haver uma infinidade de formas que o utilizador pode formular o seu pedido pretendendo o mesmo objetivo. Para ultrupassar este problema, utilizamos um algoritmo de retrieval semântico, adaptando-o ao domínio em questão através da geração de um conjunto de dados sintético

    Chatbots for Modelling, Modelling of Chatbots

    Full text link
    Tesis Doctoral inédita leída en la Universidad Autónoma de Madrid, Escuela Politécnica Superior, Departamento de Ingeniería Informática. Fecha de Lectura: 28-03-202

    Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study

    Full text link
    Large Language Models (LLMs), like ChatGPT, have demonstrated vast potential but also introduce challenges related to content constraints and potential misuse. Our study investigates three key research questions: (1) the number of different prompt types that can jailbreak LLMs, (2) the effectiveness of jailbreak prompts in circumventing LLM constraints, and (3) the resilience of ChatGPT against these jailbreak prompts. Initially, we develop a classification model to analyze the distribution of existing prompts, identifying ten distinct patterns and three categories of jailbreak prompts. Subsequently, we assess the jailbreak capability of prompts with ChatGPT versions 3.5 and 4.0, utilizing a dataset of 3,120 jailbreak questions across eight prohibited scenarios. Finally, we evaluate the resistance of ChatGPT against jailbreak prompts, finding that the prompts can consistently evade the restrictions in 40 use-case scenarios. The study underscores the importance of prompt structures in jailbreaking LLMs and discusses the challenges of robust jailbreak prompt generation and prevention

    Knowledge-driven Natural Language Understanding of English Text and its Applications

    Full text link
    Understanding the meaning of a text is a fundamental challenge of natural language understanding (NLU) research. An ideal NLU system should process a language in a way that is not exclusive to a single task or a dataset. Keeping this in mind, we have introduced a novel knowledge driven semantic representation approach for English text. By leveraging the VerbNet lexicon, we are able to map syntax tree of the text to its commonsense meaning represented using basic knowledge primitives. The general purpose knowledge represented from our approach can be used to build any reasoning based NLU system that can also provide justification. We applied this approach to construct two NLU applications that we present here: SQuARE (Semantic-based Question Answering and Reasoning Engine) and StaCACK (Stateful Conversational Agent using Commonsense Knowledge). Both these systems work by "truly understanding" the natural language text they process and both provide natural language explanations for their responses while maintaining high accuracy.Comment: Preprint. Accepted by the 35th AAAI Conference (AAAI-21) Main Track

    Software engineering for AI-based systems: A survey

    Get PDF
    AI-based systems are software systems with functionalities enabled by at least one AI component (e.g., for image-, speech-recognition, and autonomous driving). AI-based systems are becoming pervasive in society due to advances in AI. However, there is limited synthesized knowledge on Software Engineering (SE) approaches for building, operating, and maintaining AI-based systems. To collect and analyze state-of-the-art knowledge about SE for AI-based systems, we conducted a systematic mapping study. We considered 248 studies published between January 2010 and March 2020. SE for AI-based systems is an emerging research area, where more than 2/3 of the studies have been published since 2018. The most studied properties of AI-based systems are dependability and safety. We identified multiple SE approaches for AI-based systems, which we classified according to the SWEBOK areas. Studies related to software testing and software quality are very prevalent, while areas like software maintenance seem neglected. Data-related issues are the most recurrent challenges. Our results are valuable for: researchers, to quickly understand the state-of-the-art and learn which topics need more research; practitioners, to learn about the approaches and challenges that SE entails for AI-based systems; and, educators, to bridge the gap among SE and AI in their curricula.This work has been partially funded by the “Beatriz Galindo” Spanish Program BEAGAL18/00064 and by the DOGO4ML Spanish research project (ref. PID2020-117191RB-I00)Peer ReviewedPostprint (author's final draft

    Typology of risks of generative text-to-image models

    Get PDF
    corecore