1,790 research outputs found
Reading Between the Lines: Modeling User Behavior and Costs in AI-Assisted Programming
Code-recommendation systems, such as Copilot and CodeWhisperer, have the
potential to improve programmer productivity by suggesting and auto-completing
code. However, to fully realize their potential, we must understand how
programmers interact with these systems and identify ways to improve that
interaction. To make progress, we studied GitHub Copilot, a code-recommendation
system used by millions of programmers daily. We developed CUPS, a taxonomy of
common programmer activities when interacting with Copilot. Our study of 21
programmers, who completed coding tasks and retrospectively labeled their
sessions with CUPS, showed that CUPS can help us understand how programmers
interact with code-recommendation systems, revealing inefficiencies and time
costs. Our insights reveal how programmers interact with Copilot and motivate
new interface designs and metrics
When to Show a Suggestion? Integrating Human Feedback in AI-Assisted Programming
AI powered code-recommendation systems, such as Copilot and CodeWhisperer,
provide code suggestions inside a programmer's environment (e.g., an IDE) with
the aim to improve their productivity. Since, in these scenarios, programmers
accept and reject suggestions, ideally, such a system should use this feedback
in furtherance of this goal. In this work we leverage prior data of programmers
interacting with Copilot to develop interventions that can save programmer
time. We propose a utility theory framework, which models this interaction with
programmers and decides when and which suggestions to display. Our framework
Conditional suggestion Display from Human Feedback (CDHF) is based on
predictive models of programmer actions. Using data from 535 programmers we
build models that predict the likelihood of suggestion acceptance. In a
retrospective evaluation on real-world programming tasks solved with
AI-assisted programming, we find that CDHF can achieve favorable tradeoffs. Our
findings show the promise of integrating human feedback to improve interaction
with large language models in scenarios such as programming and possibly
writing tasks.Comment: arXiv admin note: text overlap with arXiv:2210.1430
Is the Most Accurate AI the Best Teammate? Optimizing AI for Teamwork
AI practitioners typically strive to develop the most accurate systems,
making an implicit assumption that the AI system will function autonomously.
However, in practice, AI systems often are used to provide advice to people in
domains ranging from criminal justice and finance to healthcare. In such
AI-advised decision making, humans and machines form a team, where the human is
responsible for making final decisions. But is the most accurate AI the best
teammate? We argue "No" -- predictable performance may be worth a slight
sacrifice in AI accuracy. Instead, we argue that AI systems should be trained
in a human-centered manner, directly optimized for team performance. We study
this proposal for a specific type of human-AI teaming, where the human overseer
chooses to either accept the AI recommendation or solve the task themselves. To
optimize the team performance for this setting we maximize the team's expected
utility, expressed in terms of the quality of the final decision, cost of
verifying, and individual accuracies of people and machines. Our experiments
with linear and non-linear models on real-world, high-stakes datasets show that
the most accuracy AI may not lead to highest team performance and show the
benefit of modeling teamwork during training through improvements in expected
team utility across datasets, considering parameters such as human skill and
the cost of mistakes. We discuss the shortcoming of current optimization
approaches beyond well-studied loss functions such as log-loss, and encourage
future work on AI optimization problems motivated by human-AI collaboration.Comment: v
Generation Probabilities Are Not Enough: Exploring the Effectiveness of Uncertainty Highlighting in AI-Powered Code Completions
Large-scale generative models enabled the development of AI-powered code
completion tools to assist programmers in writing code. However, much like
other AI-powered tools, AI-powered code completions are not always accurate,
potentially introducing bugs or even security vulnerabilities into code if not
properly detected and corrected by a human programmer. One technique that has
been proposed and implemented to help programmers identify potential errors is
to highlight uncertain tokens. However, there have been no empirical studies
exploring the effectiveness of this technique-- nor investigating the different
and not-yet-agreed-upon notions of uncertainty in the context of generative
models. We explore the question of whether conveying information about
uncertainty enables programmers to more quickly and accurately produce code
when collaborating with an AI-powered code completion tool, and if so, what
measure of uncertainty best fits programmers' needs. Through a mixed-methods
study with 30 programmers, we compare three conditions: providing the AI
system's code completion alone, highlighting tokens with the lowest likelihood
of being generated by the underlying generative model, and highlighting tokens
with the highest predicted likelihood of being edited by a programmer. We find
that highlighting tokens with the highest predicted likelihood of being edited
leads to faster task completion and more targeted edits, and is subjectively
preferred by study participants. In contrast, highlighting tokens according to
their probability of being generated does not provide any benefit over the
baseline with no highlighting. We further explore the design space of how to
convey uncertainty in AI-powered code completion tools, and find that
programmers prefer highlights that are granular, informative, interpretable,
and not overwhelming
Aligning Offline Metrics and Human Judgments of Value for Code Generation Models
Large language models have demonstrated great potential to assist programmers
in generating code. For such human-AI pair programming scenarios, we
empirically demonstrate that while generated code is most often evaluated in
terms of their functional correctness (i.e., whether generations pass available
unit tests), correctness does not fully capture (e.g., may underestimate) the
productivity gains these models may provide. Through a user study with N = 49
experienced programmers, we show that while correctness captures high-value
generations, programmers still rate code that fails unit tests as valuable if
it reduces the overall effort needed to complete a coding task. Finally, we
propose a hybrid metric that combines functional correctness and syntactic
similarity and show that it achieves a 14% stronger correlation with value and
can therefore better represent real-world gains when evaluating and comparing
models.Comment: Accepted at ACL 2023 (Findings
Comparative Evaluation of Modified Furlow Palatoplasty and Intravelar Veloplasty in Cleft Palate Repair
Introduction: The purpose of this study was to comparatively assess the two techniques of cleft palate repair i.e. Kriens intravelar veloplasty (IVV) and modified Furlow Palatoplasty (MFP) for post-operative fistula formation, wound dehiscence at suture line, nasal regurgitation, velopharyngeal insufficiency, soft palate lengthening and speech.Method: This prospective study was conducted on 60 patients having primary cleft palate. They were assigned either to IVV group or MFP group randomly so that both the groups consisted of 30 patients each. The two groups were operated under general anesthesia. Measurements at the time of operation were made with the help of soft ruler and Castroviejo caliper. Follow up of patient's was done 1 week, 1 month, 3 month, 6 months and complication is present was noted. Five year post operatively speech was recorded and assessed by the speech language pathologist. Post-operative Nasoendoscopy was also performed to assess the velopharyngeal insufficiency.Result: The MFP group showed more percentage elongation of the soft palate and less incidence of post-operative palatal fistula formation than IVV group. Total speech scores were superior in MFP patients but the differences were less robust. Velopharyngeal incompetence was present in both groups but was less severe in MFP group than the IVV group. Conclusion: The MFP group showed comparatively superior results than the IVV group but required an increased surgical time. Therefore MFP can be used as an alternative technique for cleft palate repair
Proxy Tasks and Subjective Measures Can Be Misleading in Evaluating Explainable AI Systems
Explainable artificially intelligent (XAI) systems form part of
sociotechnical systems, e.g., human+AI teams tasked with making decisions. Yet,
current XAI systems are rarely evaluated by measuring the performance of
human+AI teams on actual decision-making tasks. We conducted two online
experiments and one in-person think-aloud study to evaluate two currently
common techniques for evaluating XAI systems: (1) using proxy, artificial tasks
such as how well humans predict the AI's decision from the given explanations,
and (2) using subjective measures of trust and preference as predictors of
actual performance. The results of our experiments demonstrate that evaluations
with proxy tasks did not predict the results of the evaluations with the actual
decision-making tasks. Further, the subjective measures on evaluations with
actual decision-making tasks did not predict the objective performance on those
same tasks. Our results suggest that by employing misleading evaluation
methods, our field may be inadvertently slowing its progress toward developing
human+AI teams that can reliably perform better than humans or AIs alone
- …