579 research outputs found
ATOMIC: An Atlas of Machine Commonsense for If-Then Reasoning
We present ATOMIC, an atlas of everyday commonsense reasoning, organized
through 877k textual descriptions of inferential knowledge. Compared to
existing resources that center around taxonomic knowledge, ATOMIC focuses on
inferential knowledge organized as typed if-then relations with variables
(e.g., "if X pays Y a compliment, then Y will likely return the compliment").
We propose nine if-then relation types to distinguish causes vs. effects,
agents vs. themes, voluntary vs. involuntary events, and actions vs. mental
states. By generatively training on the rich inferential knowledge described in
ATOMIC, we show that neural models can acquire simple commonsense capabilities
and reason about previously unseen events. Experimental results demonstrate
that multitask models that incorporate the hierarchical structure of if-then
relation types lead to more accurate inference compared to models trained in
isolation, as measured by both automatic and human evaluation.Comment: AAAI 2019 C
Towards Accountable AI: Hybrid Human-Machine Analyses for Characterizing System Failure
As machine learning systems move from computer-science laboratories into the
open world, their accountability becomes a high priority problem.
Accountability requires deep understanding of system behavior and its failures.
Current evaluation methods such as single-score error metrics and confusion
matrices provide aggregate views of system performance that hide important
shortcomings. Understanding details about failures is important for identifying
pathways for refinement, communicating the reliability of systems in different
settings, and for specifying appropriate human oversight and engagement.
Characterization of failures and shortcomings is particularly complex for
systems composed of multiple machine learned components. For such systems,
existing evaluation methods have limited expressiveness in describing and
explaining the relationship among input content, the internal states of system
components, and final output quality. We present Pandora, a set of hybrid
human-machine methods and tools for describing and explaining system failures.
Pandora leverages both human and system-generated observations to summarize
conditions of system malfunction with respect to the input content and system
architecture. We share results of a case study with a machine learning pipeline
for image captioning that show how detailed performance views can be beneficial
for analysis and debugging
Learning Visual Importance for Graphic Designs and Data Visualizations
Knowing where people look and click on visual designs can provide clues about
how the designs are perceived, and where the most important or relevant content
lies. The most important content of a visual design can be used for effective
summarization or to facilitate retrieval from a database. We present automated
models that predict the relative importance of different elements in data
visualizations and graphic designs. Our models are neural networks trained on
human clicks and importance annotations on hundreds of designs. We collected a
new dataset of crowdsourced importance, and analyzed the predictions of our
models with respect to ground truth importance and human eye movements. We
demonstrate how such predictions of importance can be used for automatic design
retargeting and thumbnailing. User studies with hundreds of MTurk participants
validate that, with limited post-processing, our importance-driven applications
are on par with, or outperform, current state-of-the-art methods, including
natural image saliency. We also provide a demonstration of how our importance
predictions can be built into interactive design tools to offer immediate
feedback during the design process
Do humans and machines have the same eyes? Human-machine perceptual differences on image classification
Trained computer vision models are assumed to solve vision tasks by imitating
human behavior learned from training labels. Most efforts in recent vision
research focus on measuring the model task performance using standardized
benchmarks. Limited work has been done to understand the perceptual difference
between humans and machines. To fill this gap, our study first quantifies and
analyzes the statistical distributions of mistakes from the two sources. We
then explore human vs. machine expertise after ranking tasks by difficulty
levels. Even when humans and machines have similar overall accuracies, the
distribution of answers may vary. Leveraging the perceptual difference between
humans and machines, we empirically demonstrate a post-hoc human-machine
collaboration that outperforms humans or machines alone.Comment: Paper under revie
R4C: A Benchmark for Evaluating RC Systems to Get the Right Answer for the Right Reason
Recent studies have revealed that reading comprehension (RC) systems learn to
exploit annotation artifacts and other biases in current datasets. This
prevents the community from reliably measuring the progress of RC systems. To
address this issue, we introduce R4C, a new task for evaluating RC systems'
internal reasoning. R4C requires giving not only answers but also derivations:
explanations that justify predicted answers. We present a reliable,
crowdsourced framework for scalably annotating RC datasets with derivations. We
create and publicly release the R4C dataset, the first, quality-assured dataset
consisting of 4.6k questions, each of which is annotated with 3 reference
derivations (i.e. 13.8k derivations). Experiments show that our automatic
evaluation metrics using multiple reference derivations are reliable, and that
R4C assesses different skills from an existing benchmark.Comment: Accepted by ACL2020. See https://naoya-i.github.io/r4c/ for more
informatio
Agreement Between Experts and an Untrained Crowd for Identifying Dermoscopic Features Using a Gamified App: Reader Feasibility Study
Background
Dermoscopy is commonly used for the evaluation of pigmented lesions, but agreement between experts for identification of dermoscopic structures is known to be relatively poor. Expert labeling of medical data is a bottleneck in the development of machine learning (ML) tools, and crowdsourcing has been demonstrated as a cost- and time-efficient method for the annotation of medical images.
Objective
The aim of this study is to demonstrate that crowdsourcing can be used to label basic dermoscopic structures from images of pigmented lesions with similar reliability to a group of experts.
Methods
First, we obtained labels of 248 images of melanocytic lesions with 31 dermoscopic âsubfeaturesâ labeled by 20 dermoscopy experts. These were then collapsed into 6 dermoscopic âsuperfeaturesâ based on structural similarity, due to low interrater reliability (IRR): dots, globules, lines, network structures, regression structures, and vessels. These images were then used as the gold standard for the crowd study. The commercial platform DiagnosUs was used to obtain annotations from a nonexpert crowd for the presence or absence of the 6 superfeatures in each of the 248 images. We replicated this methodology with a group of 7 dermatologists to allow direct comparison with the nonexpert crowd. The Cohen Îș value was used to measure agreement across raters.
Results
In total, we obtained 139,731 ratings of the 6 dermoscopic superfeatures from the crowd. There was relatively lower agreement for the identification of dots and globules (the median Îș values were 0.526 and 0.395, respectively), whereas network structures and vessels showed the highest agreement (the median Îș values were 0.581 and 0.798, respectively). This pattern was also seen among the expert raters, who had median Îș values of 0.483 and 0.517 for dots and globules, respectively, and 0.758 and 0.790 for network structures and vessels. The median Îș values between nonexperts and thresholded averageâexpert readers were 0.709 for dots, 0.719 for globules, 0.714 for lines, 0.838 for network structures, 0.818 for regression structures, and 0.728 for vessels.
Conclusions
This study confirmed that IRR for different dermoscopic features varied among a group of experts; a similar pattern was observed in a nonexpert crowd. There was good or excellent agreement for each of the 6 superfeatures between the crowd and the experts, highlighting the similar reliability of the crowd for labeling dermoscopic images. This confirms the feasibility and dependability of using crowdsourcing as a scalable solution to annotate large sets of dermoscopic images, with several potential clinical and educational applications, including the development of novel, explainable ML tools
EXP-Crowd: A Gamified Crowdsourcing Framework for Explainability
The spread of AI and black-box machine learning models made it necessary to explain their behavior. Consequently, the research field of Explainable AI was born. The main objective of an Explainable AI system is to be understood by a human as the final beneficiary of the model. In our research, we frame the explainability problem from the crowds point of view and engage both users and AI researchers through a gamified crowdsourcing framework. We research whether it's possible to improve the crowds understanding of black-box models and the quality of the crowdsourced content by engaging users in a set of gamified activities through a gamified crowdsourcing framework named EXP-Crowd. While users engage in such activities, AI researchers organize and share AI- and explainability-related knowledge to educate users. We present the preliminary design of a game with a purpose (G.W.A.P.) to collect features describing real-world entities which can be used for explainability purposes. Future works will concretise and improve the current design of the framework to cover specific explainability-related needs
Visual Entailment: A Novel Task for Fine-Grained Image Understanding
Existing visual reasoning datasets such as Visual Question Answering (VQA),
often suffer from biases conditioned on the question, image or answer
distributions. The recently proposed CLEVR dataset addresses these limitations
and requires fine-grained reasoning but the dataset is synthetic and consists
of similar objects and sentence structures across the dataset.
In this paper, we introduce a new inference task, Visual Entailment (VE) -
consisting of image-sentence pairs whereby a premise is defined by an image,
rather than a natural language sentence as in traditional Textual Entailment
tasks. The goal of a trained VE model is to predict whether the image
semantically entails the text. To realize this task, we build a dataset SNLI-VE
based on the Stanford Natural Language Inference corpus and Flickr30k dataset.
We evaluate various existing VQA baselines and build a model called Explainable
Visual Entailment (EVE) system to address the VE task. EVE achieves up to 71%
accuracy and outperforms several other state-of-the-art VQA based models.
Finally, we demonstrate the explainability of EVE through cross-modal attention
visualizations. The SNLI-VE dataset is publicly available at
https://github.com/ necla-ml/SNLI-VE
IMPACT OF DATA COLLECTION ON ML MODELS: ANALYZING DIFFERENCES OF BIASES BETWEEN LOW- VS. HIGH-SKILLED ANNOTATORS
Labeled data is crucial for the success of machine learning-based artificial intelligence. However, companies often face a choice between collecting few annotations from high- or low-skilled annotators, possibly exhibiting different biases. This study investigates differences in biases between datasets labeled by said annotator groups and their impact on machine learning models. Therefore, we created high- and low-skilled annotated datasets measured the contained biases through entropy and trained different machine learning models to examine bias inheritance effects. Our findings on text sentiment annotations show both groups exhibit a considerable amount of bias in their annotations, although there is a significant difference regarding the error types commonly encountered. Models trained on biased annotations produce significantly different predictions, indicating bias propagation and tend to make more extreme errors than humans. As partial mitigation, we propose and show the efficiency of a hybrid approach where data is labeled by low-skilled and high-skilled workers
- âŠ