9 research outputs found
Recommended from our members
Design and Empirical Evaluation of Interactive and Interpretable Machine Learning
Machine learning is ubiquitous in making predictions that affect people's decisions. While most of the research in machine learning focuses on improving the performance of the models on held-out data sets, this is not enough to convince end-users that these models are trustworthy or reliable in the wild. To address this problem, a new line of research has emerged that focuses on developing interpretable machine learning methods and helping end-users make informed decisions. Despite the growing body of research in developing interpretable models, there is still no consensus on the definition and quantification of interpretability. We argue that to understand interpretability, we need to bring humans in the loop and run human-subject experiments to understand the effect of interpretability on human behavior. This thesis approaches the problem of interpretability from an interdisciplinary perspective which builds on decades of research in psychology, cognitive science, and social science to understand human behavior and trust. Through controlled user experiments, we manipulate various design factors in supervised models that are commonly thought to make models more or less interpretable and measure their influence on user behavior, performance, and trust. Additionally, we develop interpretable and interactive machine learning based systems that exploit unsupervised machine learning models to bring humans in the loop and help them in completing real-world tasks. By bringing humans and machines together, we can empower humans to understand and organize large document collections better and faster. Our findings and insights from these experiments can guide the development of next-generation machine learning models that can be used effectively and trusted by humans
Aligning Offline Metrics and Human Judgments of Value for Code Generation Models
Large language models have demonstrated great potential to assist programmers
in generating code. For such human-AI pair programming scenarios, we
empirically demonstrate that while generated code is most often evaluated in
terms of their functional correctness (i.e., whether generations pass available
unit tests), correctness does not fully capture (e.g., may underestimate) the
productivity gains these models may provide. Through a user study with N = 49
experienced programmers, we show that while correctness captures high-value
generations, programmers still rate code that fails unit tests as valuable if
it reduces the overall effort needed to complete a coding task. Finally, we
propose a hybrid metric that combines functional correctness and syntactic
similarity and show that it achieves a 14% stronger correlation with value and
can therefore better represent real-world gains when evaluating and comparing
models.Comment: Accepted at ACL 2023 (Findings
Uncertainty in current and future health wearables
Expect inherent uncertainties in health-wearables data to complicate future decision making concerning user health
A data-driven analysis of workers' earnings on Amazon Mechanical Turk
A growing number of people are working as part of on-line crowd work. Crowd work is often thought to be low wage work. However, we know little about the wage distribution in practice and what causes low/high earnings in this setting. We recorded 2,676 workers performing 3.8 million tasks on Amazon Mechanical Turk. Our task-level analysis revealed that workers earned a median hourly wage of only ~2 USD/h, and only 4% earned more than 7.25 USD/h. While the average requester pays more than 11 USD/h, lower-paying requesters post much more work. Our wage calculations are influenced by how unpaid work is accounted for, e.g., time spent searching for tasks, working on tasks that are rejected, and working on tasks that are ultimately not submitted. We further explore the characteristics of tasks and working patterns that yield higher hourly wages. Our analysis informs platform design and worker tools to create a more positive future for crowd work
Introducing v0.5 of the AI Safety Benchmark from MLCommons
This paper introduces v0.5 of the AI Safety Benchmark, which has been created by the MLCommons AI Safety Working Group. The AI Safety Benchmark has been designed to assess the safety risks of AI systems that use chat-tuned language models. We introduce a principled approach to specifying and constructing the benchmark, which for v0.5 covers only a single use case (an adult chatting to a general-purpose assistant in English), and a limited set of personas (i.e., typical users, malicious users, and vulnerable users). We created a new taxonomy of 13 hazard categories, of which 7 have tests in the v0.5 benchmark. We plan to release version 1.0 of the AI Safety Benchmark by the end of 2024. The v1.0 benchmark will provide meaningful insights into the safety of AI systems. However, the v0.5 benchmark should not be used to assess the safety of AI systems. We have sought to fully document the limitations, flaws, and challenges of v0.5. This release of v0.5 of the AI Safety Benchmark includes (1) a principled approach to specifying and constructing the benchmark, which comprises use cases, types of systems under test (SUTs), language and context, personas, tests, and test items; (2) a taxonomy of 13 hazard categories with definitions and subcategories; (3) tests for seven of the hazard categories, each comprising a unique set of test items, i.e., prompts. There are 43,090 test items in total, which we created with templates; (4) a grading system for AI systems against the benchmark; (5) an openly available platform, and downloadable tool, called ModelBench that can be used to evaluate the safety of AI systems on the benchmark; (6) an example evaluation report which benchmarks the performance of over a dozen openly available chat-tuned language models; (7) a test specification for the benchmark
Introducing v0.5 of the AI Safety Benchmark from MLCommons
This paper introduces v0.5 of the AI Safety Benchmark, which has been created by the MLCommons AI Safety Working Group. The AI Safety Benchmark has been designed to assess the safety risks of AI systems that use chat-tuned language models. We introduce a principled approach to specifying and constructing the benchmark, which for v0.5 covers only a single use case (an adult chatting to a general-purpose assistant in English), and a limited set of personas (i.e., typical users, malicious users, and vulnerable users). We created a new taxonomy of 13 hazard categories, of which 7 have tests in the v0.5 benchmark. We plan to release version 1.0 of the AI Safety Benchmark by the end of 2024. The v1.0 benchmark will provide meaningful insights into the safety of AI systems. However, the v0.5 benchmark should not be used to assess the safety of AI systems. We have sought to fully document the limitations, flaws, and challenges of v0.5. This release of v0.5 of the AI Safety Benchmark includes (1) a principled approach to specifying and constructing the benchmark, which comprises use cases, types of systems under test (SUTs), language and context, personas, tests, and test items; (2) a taxonomy of 13 hazard categories with definitions and subcategories; (3) tests for seven of the hazard categories, each comprising a unique set of test items, i.e., prompts. There are 43,090 test items in total, which we created with templates; (4) a grading system for AI systems against the benchmark; (5) an openly available platform, and downloadable tool, called ModelBench that can be used to evaluate the safety of AI systems on the benchmark; (6) an example evaluation report which benchmarks the performance of over a dozen openly available chat-tuned language models; (7) a test specification for the benchmark
Introducing v0.5 of the AI Safety Benchmark from MLCommons
This paper introduces v0.5 of the AI Safety Benchmark, which has been created by the MLCommons AI Safety Working Group. The AI Safety Benchmark has been designed to assess the safety risks of AI systems that use chat-tuned language models. We introduce a principled approach to specifying and constructing the benchmark, which for v0.5 covers only a single use case (an adult chatting to a general-purpose assistant in English), and a limited set of personas (i.e., typical users, malicious users, and vulnerable users). We created a new taxonomy of 13 hazard categories, of which 7 have tests in the v0.5 benchmark. We plan to release version 1.0 of the AI Safety Benchmark by the end of 2024. The v1.0 benchmark will provide meaningful insights into the safety of AI systems. However, the v0.5 benchmark should not be used to assess the safety of AI systems. We have sought to fully document the limitations, flaws, and challenges of v0.5. This release of v0.5 of the AI Safety Benchmark includes (1) a principled approach to specifying and constructing the benchmark, which comprises use cases, types of systems under test (SUTs), language and context, personas, tests, and test items; (2) a taxonomy of 13 hazard categories with definitions and subcategories; (3) tests for seven of the hazard categories, each comprising a unique set of test items, i.e., prompts. There are 43,090 test items in total, which we created with templates; (4) a grading system for AI systems against the benchmark; (5) an openly available platform, and downloadable tool, called ModelBench that can be used to evaluate the safety of AI systems on the benchmark; (6) an example evaluation report which benchmarks the performance of over a dozen openly available chat-tuned language models; (7) a test specification for the benchmark