Search CORE

9 research outputs found

Recommended from our members

Design and Empirical Evaluation of Interactive and Interpretable Machine Learning

Author: Poursabzi-Sangdeh Forough
Publication venue: University of Colorado Boulder
Publication date: 01/01/2018
Field of study

Machine learning is ubiquitous in making predictions that affect people's decisions. While most of the research in machine learning focuses on improving the performance of the models on held-out data sets, this is not enough to convince end-users that these models are trustworthy or reliable in the wild. To address this problem, a new line of research has emerged that focuses on developing interpretable machine learning methods and helping end-users make informed decisions. Despite the growing body of research in developing interpretable models, there is still no consensus on the definition and quantification of interpretability. We argue that to understand interpretability, we need to bring humans in the loop and run human-subject experiments to understand the effect of interpretability on human behavior. This thesis approaches the problem of interpretability from an interdisciplinary perspective which builds on decades of research in psychology, cognitive science, and social science to understand human behavior and trust. Through controlled user experiments, we manipulate various design factors in supervised models that are commonly thought to make models more or less interpretable and measure their influence on user behavior, performance, and trust. Additionally, we develop interpretable and interactive machine learning based systems that exploit unsupervised machine learning models to bring humans in the loop and help them in completing real-world tasks. By bringing humans and machines together, we can empower humans to understand and organize large document collections better and faster. Our findings and insights from these experiments can guide the development of next-generation machine learning models that can be used effectively and trusted by humans

CU Scholar Institutional Repository

Aligning Offline Metrics and Human Judgments of Value for Code Generation Models

Author: Amershi Saleema
Bansal Gagan
Dibia Victor
Fourney Adam
Liu Han
Poursabzi-Sangdeh Forough
Publication venue
Publication date: 13/06/2023
Field of study

Large language models have demonstrated great potential to assist programmers in generating code. For such human-AI pair programming scenarios, we empirically demonstrate that while generated code is most often evaluated in terms of their functional correctness (i.e., whether generations pass available unit tests), correctness does not fully capture (e.g., may underestimate) the productivity gains these models may provide. Through a user study with N = 49 experienced programmers, we show that while correctness captures high-value generations, programmers still rate code that fails unit tests as valuable if it reduces the overall effort needed to complete a coding task. Finally, we propose a hybrid metric that combines functional correctness and syntactic similarity and show that it achieves a 14% stronger correlation with value and can therefore better represent real-world gains when evaluating and comparing models.Comment: Accepted at ACL 2023 (Findings

arXiv.org e-Print Archive

Uncertainty in current and future health wearables

Author: Alabi Halimat
Knowles Brandin Hanson
Lu Di
Poursabzi-Sangdeh Forough
Smith-Renner Alison
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/12/2018
Field of study

Expect inherent uncertainties in health-wearables data to complicate future decision making concerning user health

Lancaster E-Prints

A data-driven analysis of workers' earnings on Amazon Mechanical Turk

Author: Blei David M
Brault Matthew W.
Callison-Burch Chris
Chuang Jason
Harris Seth D.
Hitlin Paul
James Gareth
Juan
Kaufmann Nicolas
Marcadent Philippe
Poursabzi-Sangdeh Forough
Rehurek Radim
Turk Participation Agreement Amazon Mechanical
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 28/12/2017
Field of study

A growing number of people are working as part of on-line crowd work. Crowd work is often thought to be low wage work. However, we know little about the wage distribution in practice and what causes low/high earnings in this setting. We recorded 2,676 workers performing 3.8 million tasks on Amazon Mechanical Turk. Our task-level analysis revealed that workers earned a median hourly wage of only ~2 USD/h, and only 4% earned more than 7.25 USD/h. While the average requester pays more than 11 USD/h, lower-paying requesters post much more work. Our wage calculations are influenced by how unpaid work is accounted for, e.g., time spent searching for tasks, working on tasks that are rejected, and working on tasks that are ultimately not submitted. We further explore the characteristics of tasks and working patterns that yield higher hourly wages. Our analysis informs platform design and worker tools to create a more positive future for crowd work

arXiv.org e-Print Archive

Crossref

Institutional Knowledge at Singapore Management University

Oxford University Research Archive

Introducing v0.5 of the AI Safety Benchmark from MLCommons

Author: Agrawal Adarsh
Akinwande Victor
Al-Nuaimi Namir
Alfaraj Najla
Alhajjar Elie
Aroyo Lora
Bavalatti Trupti
Blili-Hamelin Borhane
Bollacker Kurt
Bomassani Rishi
Boston Marisa Ferrara
Campos Siméon
Chakra Kal
Chen Canyu
Coleman Cody
Coudert Zacharie Delpierre
Derczynski Leon
Dutta Debojyoti
Eisenberg Ian
Ezick James
Frase Heather
Fuller Brian
Gandikota Ram
Gangavarapu Agasthya
Gangavarapu Ananya
Gealy James
Ghosh Rajat
Goel James
Gohar Usman
Goswami Sujata
Hale Scott A.
Hutiri Wiebke
Imperial Joseph Marvin
Jandial Surgan
Judd Nick
Juefei-Xu Felix
Kailkhura Bhavya
Khomh Foutse
Kirk Hannah Rose
Klyman Kevin
Knotz Chris
Kuchnik Michael
Kumar Shachi H.
Lengerich Chris
Liang Percy
Liao Zeyi
Long Eileen Peters
Lu Victor
Mai Yifan
Mammen Priyanka Mary
Manyeki Kelvin
Mattson Peter
McGregor Sean
Mehta Virendra
Mohammed Shafee
Moss Emanuel
Nachman Lama
Naganna Dinesh Jinenhally
Nikanjam Amin
Nushi Besmira
Oala Luis
Orr Iftach
Parrish Alicia
Patlak Cigdem
Pietri William
Poursabzi-Sangdeh Forough
Presani Eleonora
Puletti Fabrizio
Röttger Paul
Sahay Saurav
Santos Tim
Scherrer Nino
Schramowski Patrick
Sebag Alice Schoenauer
Shahbazi Abolfazl
Sharma Vin
Shen Xudong
Sistla Vamsi
Tang Leonard
Testuggine Davide
Thangarasa Vithursan
Vanschoren Joaquin
Vidgen Bertie
Watkins Elizabeth Anne
Weiss Rebecca
Welty Chris
Wilbers Tyler
Williams Adina
Wu Carole-Jean
Yadav Poonam
Yang Xianjun
Zeng Yi
Zhang Wenhui
Zhdanov Fedor
Zhu Jiacheng
Publication venue: 'Center for Open Science'
Publication date: 18/04/2024
Field of study

This paper introduces v0.5 of the AI Safety Benchmark, which has been created by the MLCommons AI Safety Working Group. The AI Safety Benchmark has been designed to assess the safety risks of AI systems that use chat-tuned language models. We introduce a principled approach to specifying and constructing the benchmark, which for v0.5 covers only a single use case (an adult chatting to a general-purpose assistant in English), and a limited set of personas (i.e., typical users, malicious users, and vulnerable users). We created a new taxonomy of 13 hazard categories, of which 7 have tests in the v0.5 benchmark. We plan to release version 1.0 of the AI Safety Benchmark by the end of 2024. The v1.0 benchmark will provide meaningful insights into the safety of AI systems. However, the v0.5 benchmark should not be used to assess the safety of AI systems. We have sought to fully document the limitations, flaws, and challenges of v0.5. This release of v0.5 of the AI Safety Benchmark includes (1) a principled approach to specifying and constructing the benchmark, which comprises use cases, types of systems under test (SUTs), language and context, personas, tests, and test items; (2) a taxonomy of 13 hazard categories with definitions and subcategories; (3) tests for seven of the hazard categories, each comprising a unique set of test items, i.e., prompts. There are 43,090 test items in total, which we created with templates; (4) a grading system for AI systems against the benchmark; (5) an openly available platform, and downloadable tool, called ModelBench that can be used to evaluate the safety of AI systems on the benchmark; (6) an example evaluation report which benchmarks the performance of over a dozen openly available chat-tuned language models; (7) a test specification for the benchmark

OPUS