73 research outputs found
Understanding and improving subjective measures in human-computer interaction
In Human-Computer Interaction (HCI), research has shifted from a focus on usability and performance towards the holistic notion of User Experience (UX). Research into UX places special emphasis on concepts from psychology, such as emotion, trust, and motivation. Under this paradigm, elaborate methods to capture the richness and diversity of subjective experiences are needed. Although psychology offers a long-standing tradition of developing self-reported scales, it is currently undergoing radical changes in research and reporting practice. Hence, UX research is facing several challenges, such as the widespread use of ad-hoc questionnaires with unknown or unsatisfactory psychometric properties, or a lack of replication and transparency. Therefore, this thesis contributes to several gaps in the research by developing and validating self-reported scales in the domain of user motivation (manuscript 1), perceived user interface language quality (manuscript 2), and user trust (manuscript 3). Furthermore, issues of online research and practical considerations to ensure data quality are empirically examined (manuscript 4). Overall, this thesis provides well-documented templates for scale development, and may help improve scientific rigor in HCI
How to Measure the Game Experience? Analysis of the Factor Structure of Two Questionnaires
We describe and report the analysis of two widely used questionnaires to measure the player experience in digital games. In order to contribute to the further validation and meaningful application of the PENS and GEQ we examined the underlying factorial structure of both questionnaires. Four hundred and forty-seven participants played two different games and rated them on a set of various variables including the PENS and GEQ. Consistent with previous research we gained additional insight into optimization of both measurements. While the factor structure of the PENS appears to be consistent and invariant across two different games, the GEQ reveals weaknesses in fulfilling these requirements
Measuring user rated language quality: Development and validation of the user interface Language Quality Survey (LQS)
Written text plays a special role in user interfaces. Key information in interaction elements and content are mostly conveyed through text. The global context, where software has to run in multiple geographical and cultural regions, requires software developers to translate their interfaces into many different languages. This translation process is prone to errors – therefore the question of how language quality can be measured is important. This paper presents the development of a questionnaire to measure user interface language quality (LQS). After a first validation of the instrument with 843 participants, a final set of 10 items remained, which was tested again (). The survey showed a high internal consistency (Cronbach׳s α) of .82, acceptable discriminatory power coefficients (.34–.47), as well as a moderate average homogeneity of .36. The LQS also showed moderate correlation to UMUX, an established usability metric (convergent validity), and it successfully distinguished high and low language quality (discriminative validity). The application to three different products (YouTube, Google Analytics, Google AdWords) revealed similar key statistics, providing evidence that this survey is product-independent. Meanwhile, the survey has been translated and applied to more than 60 languages
Breaking immersion: A theoretical framework of alienated play to facilitate critical reflection on interactive media
There is a growing interest in understanding how to best represent complexity using IDNs. We conceptualize this as the aim to make players of such IDNs reflect critically on the complexity being represented. We argue that current understandings of player experience do not lend themselves to this aim. Research on interactive media has assumed immersion to be a universal positive for the player experience. However, in this article we argue that immersion into the Magic Circle of an IDN could be antagonistic to a critical experience. This is because immersion persuades players into suspending their disbelief, rather than facilitating critical reflection. Instead we propose, on the basis of the Epic Theater, an alternative form of play called alienated play. Meaning, a form of play in which the player is playing, while also observing themselves play. This form of play should allow for players to benefit from the enjoyable nature of play, while simultaneously remaining at a critical distance. To illustrate our theory we design two models, one for immersed play and one for alienated play. Furthermore, we present examples of the design for alienation in commercial video games, as well as hypotheses to test out theory in future research. Therefore, this work contributes an initial theoretical and practical informed form of play, specifically designed to facilitate critical reflection on IDNs representing complexity
Online Playtesting With Crowdsourcing: Advantages and Challenges
Answering important design questions and delivering actionable insights within a couple of days is invaluable. Traditional playtests are often time consuming, expensive and deliver insights based on only a small sample of participants. Crowdsourced playtests may deliver comparable quality of feedback with less resources. However, several aspects have to be considered in order to receive meaningful and actionable results. Based on our experience, we provide five recommendations to ensure data quality and prevent fraud. Taken together, this suggests that crowd-sourced playtesting is a promising alternative for indie, non-profit and academic Games User Research
The quality of data collected online: An investigation of careless responding in a crowdsourced sample
Despite recent concerns about data quality, various academic fields rely increasingly on crowdsourced samples. Thus, the goal of this study was to systematically assess carelessness in a crowdsourced sample (N = 394) by applying various measures and detection methods. A Latent Profile Analysis revealed that 45.9% of the participants showed some form of careless behavior. Excluding these participants increased the effect size in an experiment included in the survey. Based on our findings, several recommendations of easy to apply measures for assessing data quality are given
Certification Labels for Trustworthy AI: Insights From an Empirical Mixed-Method Study
Auditing plays a pivotal role in the development of trustworthy AI. However,
current research primarily focuses on creating auditable AI documentation,
which is intended for regulators and experts rather than end-users affected by
AI decisions. How to communicate to members of the public that an AI has been
audited and considered trustworthy remains an open challenge. This study
empirically investigated certification labels as a promising solution. Through
interviews (N = 12) and a census-representative survey (N = 302), we
investigated end-users' attitudes toward certification labels and their
effectiveness in communicating trustworthiness in low- and high-stakes AI
scenarios. Based on the survey results, we demonstrate that labels can
significantly increase end-users' trust and willingness to use AI in both low-
and high-stakes scenarios. However, end-users' preferences for certification
labels and their effect on trust and willingness to use AI were more pronounced
in high-stake scenarios. Qualitative content analysis of the interviews
revealed opportunities and limitations of certification labels, as well as
facilitators and inhibitors for the effective use of labels in the context of
AI. For example, while certification labels can mitigate data-related concerns
expressed by end-users (e.g., privacy and data protection), other concerns
(e.g., model performance) are more challenging to address. Our study provides
valuable insights and recommendations for designing and implementing
certification labels as a promising constituent within the trustworthy AI
ecosystem
To Trust or Distrust Trust Measures: Validating Questionnaires for Trust in AI
Despite the importance of trust in human-AI interactions, researchers must
adopt questionnaires from other disciplines that lack validation in the AI
context. Motivated by the need for reliable and valid measures, we investigated
the psychometric quality of two trust questionnaires, the Trust between People
and Automation scale (TPA) by Jian et al. (2000) and the Trust Scale for the AI
Context (TAI) by Hoffman et al. (2023). In a pre-registered online experiment
(N = 1485), participants observed interactions with trustworthy and
untrustworthy AI (autonomous vehicle and chatbot). Results support the
psychometric quality of the TAI while revealing opportunities to improve the
TPA, which we outline in our recommendations for using the two questionnaires.
Furthermore, our findings provide additional empirical evidence of trust and
distrust as two distinct constructs that may coexist independently. Building on
our findings, we highlight the opportunities and added value of measuring both
trust and distrust in human-AI research and advocate for further work on both
constructs
Exploring the effects of human-centered AI explanations on trust and reliance
Transparency is widely regarded as crucial for the responsible real-world deployment of artificial intelligence (AI) and is considered an essential prerequisite to establishing trust in AI. There are several approaches to enabling transparency, with one promising attempt being human-centered explanations. However, there is little research into the effectiveness of human-centered explanations on end-users' trust. What complicates the comparison of existing empirical work is that trust is measured in different ways. Some researchers measure subjective trust using questionnaires, while others measure objective trust-related behavior such as reliance. To bridge these gaps, we investigated the effects of two promising human-centered post-hoc explanations, feature importance and counterfactuals, on trust and reliance. We compared these two explanations with a control condition in a decision-making experiment (N = 380). Results showed that human-centered explanations can significantly increase reliance but the type of decision-making (increasing a price vs. decreasing a price) had an even greater influence. This challenges the presumed importance of transparency over other factors in human decision-making involving AI, such as potential heuristics and biases. We conclude that trust does not necessarily equate to reliance and emphasize the importance of appropriate, validated, and agreed-upon metrics to design and evaluate human-centered AI
Recommended from our members
Many Labs 5: Testing Pre-Data-Collection Peer Review as an Intervention to Increase Replicability
Replication studies in psychological science sometimes fail to reproduce prior findings. If these studies use methods that are unfaithful to the original study or ineffective in eliciting the phenomenon of interest, then a failure to replicate may be a failure of the protocol rather than a challenge to the original finding. Formal pre-data-collection peer review by experts may address shortcomings and increase replicability rates. We selected 10 replication studies from the Reproducibility Project: Psychology (RP:P; Open Science Collaboration, 2015) for which the original authors had expressed concerns about the replication designs before data collection; only one of these studies had yielded a statistically significant effect (p <.05). Commenters suggested that lack of adherence to expert review and low-powered tests were the reasons that most of these RP:P studies failed to replicate the original effects. We revised the replication protocols and received formal peer review prior to conducting new replication studies. We administered the RP:P and revised protocols in multiple laboratories (median number of laboratories per original study = 6.5, range = 3–9; median total sample = 1,279.5, range = 276–3,512) for high-powered tests of each original finding with both protocols. Overall, following the preregistered analysis plan, we found that the revised protocols produced effect sizes similar to those of the RP:P protocols (Δr =.002 or.014, depending on analytic approach). The median effect size for the revised protocols (r =.05) was similar to that of the RP:P protocols (r =.04) and the original RP:P replications (r =.11), and smaller than that of the original studies (r =.37). Analysis of the cumulative evidence across the original studies and the corresponding three replication attempts provided very precise estimates of the 10 tested effects and indicated that their effect sizes (median r =.07, range =.00–.15) were 78% smaller, on average, than the original effect sizes (median r =.37, range =.19–.50)
- …