300 research outputs found

    Reflow: Automatically Improving Touch Interactions in Mobile Applications through Pixel-based Refinements

    Full text link
    Touch is the primary way that users interact with smartphones. However, building mobile user interfaces where touch interactions work well for all users is a difficult problem, because users have different abilities and preferences. We propose a system, Reflow, which automatically applies small, personalized UI adaptations, called refinements -- to mobile app screens to improve touch efficiency. Reflow uses a pixel-based strategy to work with existing applications, and improves touch efficiency while minimally disrupting the design intent of the original application. Our system optimizes a UI by (i) extracting its layout from its screenshot, (ii) refining its layout, and (iii) re-rendering the UI to reflect these modifications. We conducted a user study with 10 participants and a heuristic evaluation with 6 experts and found that applications optimized by Reflow led to, on average, 9% faster selection time with minimal layout disruption. The results demonstrate that Reflow's refinements useful UI adaptations to improve touch interactions

    Screen Correspondence: Mapping Interchangeable Elements between UIs

    Full text link
    Understanding user interface (UI) functionality is a useful yet challenging task for both machines and people. In this paper, we investigate a machine learning approach for screen correspondence, which allows reasoning about UIs by mapping their elements onto previously encountered examples with known functionality and properties. We describe and implement a model that incorporates element semantics, appearance, and text to support correspondence computation without requiring any labeled examples. Through a comprehensive performance evaluation, we show that our approach improves upon baselines by incorporating multi-modal properties of UIs. Finally, we show three example applications where screen correspondence facilitates better UI understanding for humans and machines: (i) instructional overlay generation, (ii) semantic UI element search, and (iii) automated interface testing

    Spotlight: Mobile UI Understanding using Vision-Language Models with a Focus

    Full text link
    Mobile UI understanding is important for enabling various interaction tasks such as UI automation and accessibility. Previous mobile UI modeling often depends on the view hierarchy information of a screen, which directly provides the structural data of the UI, with the hope to bypass challenging tasks of visual modeling from screen pixels. However, view hierarchies are not always available, and are often corrupted with missing object descriptions or misaligned structure information. As a result, despite the use of view hierarchies could offer short-term gains, it may ultimately hinder the applicability and performance of the model. In this paper, we propose \textit{Spotlight}, a vision-only approach for mobile UI understanding. Specifically, we enhance a vision-language model that only takes the screenshot of the UI and a region of interest on the screen -- the focus -- as the input. This general architecture is easily scalable and capable of performing a range of UI modeling tasks. Our experiments show that our model establishes SoTA results on several representative UI tasks and outperforms previous methods that use both screenshots and view hierarchies as inputs. Furthermore, we explore multi-task learning and few-shot prompting capacities of the proposed models, demonstrating promising results in the multi-task learning direction

    Deep reinforcement learning for multi-modal embodied navigation

    Full text link
    Ce travail se concentre sur une tâche de micro-navigation en plein air où le but est de naviguer vers une adresse de rue spécifiée en utilisant plusieurs modalités (par exemple, images, texte de scène et GPS). La tâche de micro-navigation extérieure s’avère etre un défi important pour de nombreuses personnes malvoyantes, ce que nous démontrons à travers des entretiens et des études de marché, et nous limitons notre définition des problèmes à leurs besoins. Nous expérimentons d’abord avec un monde en grille partiellement observable (Grid-Street et Grid City) contenant des maisons, des numéros de rue et des régions navigables. Ensuite, nous introduisons le Environnement de Trottoir pour la Navigation Visuelle (ETNV), qui contient des images panoramiques avec des boîtes englobantes pour les numéros de maison, les portes et les panneaux de nom de rue, et des formulations pour plusieurs tâches de navigation. Dans SEVN, nous formons un modèle de politique pour fusionner des observations multimodales sous la forme d’images à résolution variable, de texte visible et de données GPS simulées afin de naviguer vers une porte d’objectif. Nous entraînons ce modèle en utilisant l’algorithme d’apprentissage par renforcement, Proximal Policy Optimization (PPO). Nous espérons que cette thèse fournira une base pour d’autres recherches sur la création d’agents pouvant aider les membres de la communauté des gens malvoyantes à naviguer le monde.This work focuses on an Outdoor Micro-Navigation (OMN) task in which the goal is to navigate to a specified street address using multiple modalities including images, scene-text, and GPS. This task is a significant challenge to many Blind and Visually Impaired (BVI) people, which we demonstrate through interviews and market research. To investigate the feasibility of solving this task with Deep Reinforcement Learning (DRL), we first introduce two partially observable grid-worlds, Grid-Street and Grid City, containing houses, street numbers, and navigable regions. In these environments, we train an agent to find specific houses using local observations under a variety of training procedures. We parameterize our agent with a neural network and train using reinforcement learning methods. Next, we introduce the Sidewalk Environment for Visual Navigation (SEVN), which contains panoramic images with labels for house numbers, doors, and street name signs, and formulations for several navigation tasks. In SEVN, we train another neural network model using Proximal Policy Optimization (PPO) to fuse multi-modal observations in the form of variable resolution images, visible text, and simulated GPS data, and to use this representation to navigate to goal doors. Our best model used all available modalities and was able to navigate to over 100 goals with an 85% success rate. We found that models with access to only a subset of these modalities performed significantly worse, supporting the need for a multi-modal approach to the OMN task. We hope that this thesis provides a foundation for further research into the creation of agents to assist members of the BVI community to safely navigate

    ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots

    Full text link
    We present a new task and dataset, ScreenQA, for screen content understanding via question answering. The existing screen datasets are focused either on structure and component-level understanding, or on a much higher-level composite task such as navigation and task completion. We attempt to bridge the gap between these two by annotating 80,000+ question-answer pairs over the RICO dataset in hope to benchmark the screen reading comprehension capacity

    GreaseVision: Rewriting the Rules of the Interface

    Get PDF
    Digital harms can manifest across any interface. Key problems in addressing these harms include the high individuality of harms and the fast-changing nature of digital systems. As a result, we still lack a systematic approach to study harms and produce interventions for end-users. We put forward GreaseVision, a new framework that enables end-users to collaboratively develop interventions against harms in software using a no-code approach and recent advances in few-shot machine learning. The contribution of the framework and tool allow individual end-users to study their usage history and create personalized interventions. Our contribution also enables researchers to study the distribution of harms and interventions at scale

    A competencies framework of visual impairments for enabling shared understanding in design

    Get PDF
    Existing work in Human Computer Interaction and accessibility research has long sought to investigate the experiences of people with visual impairments in order to address their needs through technology design and integrate their participation into different stages of the design process. Yet challenges remain regarding how disabilities are framed in technology design and the extent of involvement of disabled people within it. Furthermore, accessibility is often considered a specialised job and misunderstandings or assumptions about visually impaired people’s experiences and needs occur outside dedicated fields. This thesis presents an ethnomethodology-informed design critique for supporting awareness and shared understanding of visual impairments and accessibility that centres on their experiences, abilities, and participation in early-stage design. This work is rooted in an in-depth empirical investigation of the interactional competencies that people with visual impairments exhibit through their use of technology, which informs and shapes the concept of a Competencies Framework of Visual Impairments. Although past research has established stances for considering the individual abilities of disabled people and other social and relational factors in technology design, by drawing on ethnomethodology and its interest in situated competence this thesis employs an interactional perspective to investigate the practical accomplishments of visually impaired people. Thus, this thesis frames visual impairments in terms of competencies to be considered in the design process, rather than a deficiency or problem to be fixed through technology. Accordingly, this work favours supporting awareness and reflection rather than the design of particular solutions, which are also strongly needed for advancing accessible design at large. This PhD thesis comprises two main empirical studies branched into three different investigations. The first and second investigations are based on a four-month ethnographic study with visually impaired participants examining their everyday technology practices. The third investigation comprises the design and implementation of a workshop study developed to include people with and without visual impairments in collaborative reflections about technology and accessibility. As such, each investigation informed the ones that followed, revisiting and refining concepts and design materials throughout the thesis. Although ethnomethodology is the overarching approach running through this PhD project, each investigation has a different focus of enquiry: • The first is focused on analysing participants’ technology practices and unearthing the interactional competencies enabling them. • The second is focused on analysing technology demonstrations, which were a pervasive phenomenon recorded during fieldwork, and the work of demonstrating as exhibited by visually impaired participants. • Lastly, the third investigation defines a workshop approach employing video demonstrations and a deck of reflective design cards as building blocks for enabling shared understanding among people with and without visual impairments from different technology backgrounds; that is, users, technologists, designers, and researchers. Overall, this thesis makes several contributions to audiences within and outside academia, such as the detailed accounts of some of the main technology practices of people with visual impairments and the methodological analysis of demonstrations in empirical Human Computer Interaction and accessibility research. Moreover, the main contribution lies in the conceptualisation of a Competencies Framework of Visual Impairments from the empirical analysis of interactional competencies and their practical exhibition through demonstrations, as well as the creation and use of a deck of cards that encapsulates the competencies and external elements involved in the everyday interactional accomplishments of people with visual impairments. All these contributions are lastly brought together in the implementation of the workshop approach that enabled participants to interact with and learn from each other. Thus, this thesis builds upon and advances contemporary strands of work in Human Computer Interaction that call for re-orienting how visual impairments and, overall, disabilities are framed in technology design, and ultimately for re-shaping the design practice itself

    Demonstrating Interaction: The Case of Assistive Technology

    Get PDF
    Technology 'demos' have become a staple in technology design practice, especially for showcasing prototypes or systems. However, demonstrations are also commonplace and multifaceted phenomena in everyday life, and thus have found their way into empirical research of technology use. In spite of their presence in HCI, their methodical character as a research tool has so far received little attention in our community. We analysed 102 video-recorded demonstrations performed by visually impaired people, captured in the context of a larger ethnographic study investigating their technology use. In doing so, we exhibit core features of demonstrational work and discuss the relevance of the meta-activities occurring around and within demonstrations. We reflect on their value as an approach to doing HCI research on assistive technologies, for enabling shared understanding and letting us identify opportunities for design. Lastly, we discuss their implications as a research instrument for accessibility and HCI research more broadly

    Automating Software Development for Mobile Computing Platforms

    Get PDF
    Mobile devices such as smartphones and tablets have become ubiquitous in today\u27s computing landscape. These devices have ushered in entirely new populations of users, and mobile operating systems are now outpacing more traditional desktop systems in terms of market share. The applications that run on these mobile devices (often referred to as apps ) have become a primary means of computing for millions of users and, as such, have garnered immense developer interest. These apps allow for unique, personal software experiences through touch-based UIs and a complex assortment of sensors. However, designing and implementing high quality mobile apps can be a difficult process. This is primarily due to challenges unique to mobile development including change-prone APIs and platform fragmentation, just to name a few. in this dissertation we develop techniques that aid developers in overcoming these challenges by automating and improving current software design and testing practices for mobile apps. More specifically, we first introduce a technique, called Gvt, that improves the quality of graphical user interfaces (GUIs) for mobile apps by automatically detecting instances where a GUI was not implemented to its intended specifications. Gvt does this by constructing hierarchal models of mobile GUIs from metadata associated with both graphical mock-ups (i.e., created by designers using photo-editing software) and running instances of the GUI from the corresponding implementation. Second, we develop an approach that completely automates prototyping of GUIs for mobile apps. This approach, called ReDraw, is able to transform an image of a mobile app GUI into runnable code by detecting discrete GUI-components using computer vision techniques, classifying these components into proper functional categories (e.g., button, dropdown menu) using a Convolutional Neural Network (CNN), and assembling these components into realistic code. Finally, we design a novel approach for automated testing of mobile apps, called CrashScope, that explores a given android app using systematic input generation with the intrinsic goal of triggering crashes. The GUI-based input generation engine is driven by a combination of static and dynamic analyses that create a model of an app\u27s GUI and targets common, empirically derived root causes of crashes in android apps. We illustrate that the techniques presented in this dissertation represent significant advancements in mobile development processes through a series of empirical investigations, user studies, and industrial case studies that demonstrate the effectiveness of these approaches and the benefit they provide developers
    • …
    corecore