46 research outputs found
Autonomous Large Language Model Agents Enabling Intent-Driven Mobile GUI Testing
GUI testing checks if a software system behaves as expected when users
interact with its graphical interface, e.g., testing specific functionality or
validating relevant use case scenarios. Currently, deciding what to test at
this high level is a manual task since automated GUI testing tools target lower
level adequacy metrics such as structural code coverage or activity coverage.
We propose DroidAgent, an autonomous GUI testing agent for Android, for
semantic, intent-driven automation of GUI testing. It is based on Large
Language Models and support mechanisms such as long- and short-term memory.
Given an Android app, DroidAgent sets relevant task goals and subsequently
tries to achieve them by interacting with the app. Our empirical evaluation of
DroidAgent using 15 apps from the Themis benchmark shows that it can set up and
perform realistic tasks, with a higher level of autonomy. For example, when
testing a messaging app, DroidAgent created a second account and added a first
account as a friend, testing a realistic use case, without human intervention.
On average, DroidAgent achieved 61% activity coverage, compared to 51% for
current state-of-the-art GUI testing techniques. Further, manual analysis shows
that 317 out of the 374 autonomously created tasks are realistic and relevant
to app functionalities, and also that DroidAgent interacts deeply with the apps
and covers more features.Comment: 10 page
Mobile GUI Testing Fragility: A Study on Open-Source Android Applications
Android applications do not seem to be tested as thoroughly as desktop ones. In particular, GUI testing appears generally limited. Like webbased applications, mobile apps suffer from GUI
test fragility, i.e. GUI test classes failing or needing updates due to even minor modifications in the GUI or in the Application Under Test.
The objective of our study is to estimate the adoption of GUI testing frameworks among Android opensource applications, the quantity of modifications needed to keep test classes up to date, and the amount of them due to GUI test fragility. We introduce a set of 21 metrics to measure the adoption of testing tools, the evolution of test classes and test methods, and to estimate the fragility of test suites.
We computed our metrics for six GUI testing frameworks, none of which achieved a significant adoption among Android projects hosted on GitHub. When present, GUI test methods associated with the considered tools are modified often and a relevant portion (70% on average) of those modifications is induced by GUI-related fragilities. On average for the projects considered, more than 7% of the total modified lines of code between consecutive releases belong to test classes developed with the analysed testing frameworks. The measured percentage was higher on average than the one required by other generic test code, based on the JUnit testing framework.
Fragility of GUI tests constitute a relevant concern, probably an obstacle for developers to adopt test automation. This first evaluation of the fragility of Android scripted GUI testing can constitute a benchmark for developers and testers leveraging the analysed test tools, and the basis for the definition of a taxonomy of fragility causes and guidelines to mitigate the issue
A Metric Framework for the Gamification of Web and Mobile GUI Testing
System testing through the Graphical User Interface (GUI) is a valuable form of Verification & Validation for modern applications, especially in graphically-intensive domains like web and mobile applications. However, the practice is often overlooked by developers mostly because of its costly nature and the absence of immediate feedback about the quality of test sequence. This paper describes a proposal for the Gamification of exploratory GUI testing. We define - in a tool and domain- agnostic way - the basic concepts, a set of metrics, a scoring scheme and visual feedbacks to enable a gamified approach to the practice; we finally discuss the potential implications and envision a roadmap for the evaluation of the approach
Fill in the Blank: Context-aware Automated Text Input Generation for Mobile GUI Testing
Automated GUI testing is widely used to help ensure the quality of mobile
apps. However, many GUIs require appropriate text inputs to proceed to the next
page which remains a prominent obstacle for testing coverage. Considering the
diversity and semantic requirement of valid inputs (e.g., flight departure,
movie name), it is challenging to automate the text input generation. Inspired
by the fact that the pre-trained Large Language Model (LLM) has made
outstanding progress in text generation, we propose an approach named QTypist
based on LLM for intelligently generating semantic input text according to the
GUI context. To boost the performance of LLM in the mobile testing scenario, we
develop a prompt-based data construction and tuning method which automatically
extracts the prompts and answers for model tuning. We evaluate QTypist on 106
apps from Google Play and the result shows that the passing rate of QTypist is
87%, which is 93% higher than the best baseline. We also integrate QTypist with
the automated GUI testing tools and it can cover 42% more app activities, 52%
more pages, and subsequently help reveal 122% more bugs compared with the raw
tool.Comment: Accepted by IEEE/ACM International Conference on Software Engineering
2023 (ICSE 2023
Make LLM a Testing Expert: Bringing Human-like Interaction to Mobile GUI Testing via Functionality-aware Decisions
Automated Graphical User Interface (GUI) testing plays a crucial role in
ensuring app quality, especially as mobile applications have become an integral
part of our daily lives. Despite the growing popularity of learning-based
techniques in automated GUI testing due to their ability to generate human-like
interactions, they still suffer from several limitations, such as low testing
coverage, inadequate generalization capabilities, and heavy reliance on
training data. Inspired by the success of Large Language Models (LLMs) like
ChatGPT in natural language understanding and question answering, we formulate
the mobile GUI testing problem as a Q&A task. We propose GPTDroid, asking LLM
to chat with the mobile apps by passing the GUI page information to LLM to
elicit testing scripts, and executing them to keep passing the app feedback to
LLM, iterating the whole process. Within this framework, we have also
introduced a functionality-aware memory prompting mechanism that equips the LLM
with the ability to retain testing knowledge of the whole process and conduct
long-term, functionality-based reasoning to guide exploration. We evaluate it
on 93 apps from Google Play and demonstrate that it outperforms the best
baseline by 32% in activity coverage, and detects 31% more bugs at a faster
rate. Moreover, GPTDroid identify 53 new bugs on Google Play, of which 35 have
been confirmed and fixed.Comment: Accepted by IEEE/ACM International Conference on Software Engineering
2024 (ICSE 2024). arXiv admin note: substantial text overlap with
arXiv:2305.0943
Software Testing with Large Language Model: Survey, Landscape, and Vision
Pre-trained large language models (LLMs) have recently emerged as a
breakthrough technology in natural language processing and artificial
intelligence, with the ability to handle large-scale datasets and exhibit
remarkable performance across a wide range of tasks. Meanwhile, software
testing is a crucial undertaking that serves as a cornerstone for ensuring the
quality and reliability of software products. As the scope and complexity of
software systems continue to grow, the need for more effective software testing
techniques becomes increasingly urgent, and making it an area ripe for
innovative approaches such as the use of LLMs. This paper provides a
comprehensive review of the utilization of LLMs in software testing. It
analyzes 52 relevant studies that have used LLMs for software testing, from
both the software testing and LLMs perspectives. The paper presents a detailed
discussion of the software testing tasks for which LLMs are commonly used,
among which test case preparation and program repair are the most
representative ones. It also analyzes the commonly used LLMs, the types of
prompt engineering that are employed, as well as the accompanied techniques
with these LLMs. It also summarizes the key challenges and potential
opportunities in this direction. This work can serve as a roadmap for future
research in this area, highlighting potential avenues for exploration, and
identifying gaps in our current understanding of the use of LLMs in software
testing.Comment: 20 pages, 11 figure
Gamified Exploratory GUI Testing of Web Applications: a Preliminary Evaluation
In the context of Software Engineering, testing is a well-known phase that plays a critical role, as is needed to ensure that the designed and produced code provides the expected results, avoiding faults and crashes. Exploratory GUI testing allows the tester to manually define test cases by directly interacting with the user interface of the finite system. However, testers often loosely perform exploratory GUI testing, as they perceive it as a time-consuming, repetitive and unappealing activity. We defined a gamified framework for GUI testing to address this issue, which we developed and integrated into the Augmented testing tool, Scout. Gamification is perceived as a means to enhance the performance of human testers by stimulating competition and encouraging them to achieve better results in terms of both efficiency and effectiveness. We performed a preliminary evaluation of the gamification layer with a small sample of testers to assess the benefits of the technique compared with the standard version of the same tool. Test sequences defined with the gamified tool achieved higher coverage (i.e., higher efficiency) and a slightly higher percentage of bugs found. The user's opinion was almost unanimously in favor of the gamified version of the tool
LLM4TDD: Best Practices for Test Driven Development Using Large Language Models
In today's society, we are becoming increasingly dependent on software
systems. However, we also constantly witness the negative impacts of buggy
software. Program synthesis aims to improve software correctness by
automatically generating the program given an outline of the expected behavior.
For decades, program synthesis has been an active research field, with recent
approaches looking to incorporate Large Language Models to help generate code.
This paper explores the concept of LLM4TDD, where we guide Large Language
Models to generate code iteratively using a test-driven development
methodology. We conduct an empirical evaluation using ChatGPT and coding
problems from LeetCode to investigate the impact of different test, prompt and
problem attributes on the efficacy of LLM4TDD