1,647 research outputs found
Semantic bottleneck for computer vision tasks
This paper introduces a novel method for the representation of images that is
semantic by nature, addressing the question of computation intelligibility in
computer vision tasks. More specifically, our proposition is to introduce what
we call a semantic bottleneck in the processing pipeline, which is a crossing
point in which the representation of the image is entirely expressed with
natural language , while retaining the efficiency of numerical representations.
We show that our approach is able to generate semantic representations that
give state-of-the-art results on semantic content-based image retrieval and
also perform very well on image classification tasks. Intelligibility is
evaluated through user centered experiments for failure detection
Investigating the Quality Aspects of Crowd-Sourced Developer Forum: A Case Study of Stack Overflow
Technical question and answer (Q&A) websites have changed how developers seek information on the web and become more popular due to the shortcomings in official documentation and alternative knowledge sharing resources. Stack Overflow (SO) is one of the largest and most popular online Q&A websites for developers where they can share knowledge by answering questions and learn new skills by asking questions. Unfortunately, a large number of questions (up to 29%) are not answered at all, which might hurt the quality or purpose of this community-oriented knowledge base. In this thesis, we first attempt to detect the potentially unanswered questions during their submission using machine learning models. We compare unanswered and answered questions quantitatively and qualitatively. The quantitative analysis suggests that topics discussed in the question, the experience of the question submitter, and readability of question texts could often determine whether a question would be answered or not. Our qualitative study also reveals why the questions remain unanswered that could guide novice users to improve their questions. During analyzing the questions of SO, we see that many of them remain unanswered and unresolved because they contain such code segments that could potentially have programming issues (e.g., error, unexpected behavior); unfortunately, the issues could always not be reproduced by other users. This irreproducibility of issues might prevent questions of SO from getting answers or appropriate answers. In our second study, we thus conduct an exploratory study on the reproducibility of the issues discussed in questions and the correlation between issue reproducibility status (of questions) and corresponding answer meta-data such as the presence of an accepted answer. According to our analysis, a question with reproducible issues has at least three times higher chance of receiving an accepted answer than the question with irreproducible issues. However, users can improve the quality of questions and answers by editing. Unfortunately, such edits may be rejected (i.e., rollback) due to undesired modifications and ambiguities. We thus offer a comprehensive overview of reasons and ambiguities in the SO rollback edits. We identify 14 reasons for rollback edits and eight ambiguities that are often present in those edits. We also develop algorithms to detect ambiguities automatically. During the above studies, we find that about half of the questions that received working solutions have negative scores. About 18\% of the accepted answers also do not score the maximum votes. Furthermore, many users are complaining against the downvotes that are cast to their questions and answers. All these findings cast serious doubts on the reliability of the evaluation mechanism employed at SO. We thus concentrate on the assessment mechanism of SO to ensure a non-biased, reliable quality assessment mechanism of SO. This study compares the subjective assessment of questions with their objective assessment using 2.5 million questions and ten text analysis metrics. We also develop machine learning models to classify the promoted and discouraged questions and predict them during their submission time.
We believe that the findings from our studies and proposed techniques have the potential to (1) help the users to ask better questions with appropriate code examples, and (2) improve the editing and assessment mechanism of SO to promote better content quality
How do you feel? Measuring User-Perceived Value for Rejecting Machine Decisions in Hate Speech Detection
Hate speech moderation remains a challenging task for social media platforms.
Human-AI collaborative systems offer the potential to combine the strengths of
humans' reliability and the scalability of machine learning to tackle this
issue effectively. While methods for task handover in human-AI collaboration
exist that consider the costs of incorrect predictions, insufficient attention
has been paid to accurately estimating these costs. In this work, we propose a
value-sensitive rejection mechanism that automatically rejects machine
decisions for human moderation based on users' value perceptions regarding
machine decisions. We conduct a crowdsourced survey study with 160 participants
to evaluate their perception of correct and incorrect machine decisions in the
domain of hate speech detection, as well as occurrences where the system
rejects making a prediction. Here, we introduce Magnitude Estimation, an
unbounded scale, as the preferred method for measuring user (dis)agreement with
machine decisions. Our results show that Magnitude Estimation can provide a
reliable measurement of participants' perception of machine decisions. By
integrating user-perceived value into human-AI collaboration, we further show
that it can guide us in 1) determining when to accept or reject machine
decisions to obtain the optimal total value a model can deliver and 2)
selecting better classification models as compared to the more widely used
target of model accuracy.Comment: To appear at AIES '23. Philippe Lammerts, Philip Lippmann, Yen-Chia
Hsu, Fabio Casati, and Jie Yang. 2023. How do you feel? Measuring
User-Perceived Value for Rejecting Machine Decisions in Hate Speech
Detection. In AAAI/ACM Conference on AI, Ethics, and Society (AIES '23),
August 8.10, 2023, Montreal, QC, Canada. ACM, New York, NY, USA. 11 page
Conformal Language Modeling
We propose a novel approach to conformal prediction for generative language
models (LMs). Standard conformal prediction produces prediction sets -- in
place of single predictions -- that have rigorous, statistical performance
guarantees. LM responses are typically sampled from the model's predicted
distribution over the large, combinatorial output space of natural language.
Translating this process to conformal prediction, we calibrate a stopping rule
for sampling different outputs from the LM that get added to a growing set of
candidates until we are confident that the output set is sufficient. Since some
samples may be low-quality, we also simultaneously calibrate and apply a
rejection rule for removing candidates from the output set to reduce noise.
Similar to conformal prediction, we prove that the sampled set returned by our
procedure contains at least one acceptable answer with high probability, while
still being empirically precise (i.e., small) on average. Furthermore, within
this set of candidate responses, we show that we can also accurately identify
subsets of individual components -- such as phrases or sentences -- that are
each independently correct (e.g., that are not "hallucinations"), again with
statistical guarantees. We demonstrate the promise of our approach on multiple
tasks in open-domain question answering, text summarization, and radiology
report generation using different LM variants
- …