30 research outputs found
Stable Bias: Analyzing Societal Representations in Diffusion Models
As machine learning-enabled Text-to-Image (TTI) systems are becoming
increasingly prevalent and seeing growing adoption as commercial services,
characterizing the social biases they exhibit is a necessary first step to
lowering their risk of discriminatory outcomes. This evaluation, however, is
made more difficult by the synthetic nature of these systems' outputs: common
definitions of diversity are grounded in social categories of people living in
the world, whereas the artificial depictions of fictive humans created by these
systems have no inherent gender or ethnicity. To address this need, we propose
a new method for exploring the social biases in TTI systems. Our approach
relies on characterizing the variation in generated images triggered by
enumerating gender and ethnicity markers in the prompts, and comparing it to
the variation engendered by spanning different professions. This allows us to
(1) identify specific bias trends, (2) provide targeted scores to directly
compare models in terms of diversity and representation, and (3) jointly model
interdependent social variables to support a multidimensional analysis. We
leverage this method to analyze images generated by 3 popular TTI systems
(Dall-E 2, Stable Diffusion v 1.4 and 2) and find that while all of their
outputs show correlations with US labor demographics, they also consistently
under-represent marginalized identities to different extents. We also release
the datasets and low-code interactive bias exploration platforms developed for
this work, as well as the necessary tools to similarly evaluate additional TTI
systems.Comment: Accepted to NeurIPS Datasets and Benchmarks 2023 (spotlight
Towards Openness Beyond Open Access: User Journeys through 3 Open AI Collaboratives
Open Artificial Intelligence (Open source AI) collaboratives offer
alternative pathways for how AI can be developed beyond well-resourced
technology companies and who can be a part of the process. To understand how
and why they work and what additionality they bring to the landscape, we focus
on three such communities, each focused on a different kind of activity around
AI: building models (BigScience workshop), tools and ways of working (The
Turing Way), and ecosystems (Mozilla Festival's Building Trustworthy AI Working
Group). First, we document the community structures that facilitate these
distributed, volunteer-led teams, comparing the collaboration styles that drive
each group towards their specific goals. Through interviews with community
leaders, we map user journeys for how members discover, join, contribute, and
participate. Ultimately, this paper aims to highlight the diversity of AI work
and workers that have come forth through these collaborations and how they
offer a broader practice of openness to the AI space.Comment: Presented at the 2022 NeurIPS Workshop on Broadening Research
Collaborations in M
BigScience: A Case Study in the Social Construction of a Multilingual Large Language Model
The BigScience Workshop was a value-driven initiative that spanned one and
half years of interdisciplinary research and culminated in the creation of
ROOTS, a 1.6TB multilingual dataset that was used to train BLOOM, one of the
largest multilingual language models to date. In addition to the technical
outcomes and artifacts, the workshop fostered multidisciplinary collaborations
around large models, datasets, and their analysis. This in turn led to a wide
range of research publications spanning topics from ethics to law, data
governance, modeling choices and distributed training. This paper focuses on
the collaborative research aspects of BigScience and takes a step back to look
at the challenges of large-scale participatory research, with respect to
participant diversity and the tasks required to successfully carry out such a
project. Our main goal is to share the lessons we learned from this experience,
what we could have done better and what we did well. We show how the impact of
such a social approach to scientific research goes well beyond the technical
artifacts that were the basis of its inception.Comment: Presented at the 2022 NeurIPS Workshop on Broadening Research
Collaborations in M
GAIA Search: Hugging Face and Pyserini Interoperability for NLP Training Data Exploration
Noticing the urgent need to provide tools for fast and user-friendly
qualitative analysis of large-scale textual corpora of the modern NLP, we
propose to turn to the mature and well-tested methods from the domain of
Information Retrieval (IR) - a research field with a long history of tackling
TB-scale document collections. We discuss how Pyserini - a widely used toolkit
for reproducible IR research can be integrated with the Hugging Face ecosystem
of open-source AI libraries and artifacts. We leverage the existing
functionalities of both platforms while proposing novel features further
facilitating their integration. Our goal is to give NLP researchers tools that
will allow them to develop retrieval-based instrumentation for their data
analytics needs with ease and agility. We include a Jupyter Notebook-based walk
through the core interoperability features, available on GitHub at
https://github.com/huggingface/gaia. We then demonstrate how the ideas we
present can be operationalized to create a powerful tool for qualitative data
analysis in NLP. We present GAIA Search - a search engine built following
previously laid out principles, giving access to four popular large-scale text
collections. GAIA serves a dual purpose of illustrating the potential of
methodologies we discuss but also as a standalone qualitative analysis tool
that can be leveraged by NLP researchers aiming to understand datasets prior to
using them in training. GAIA is hosted live on Hugging Face Spaces -
https://huggingface.co/spaces/spacerini/gaia
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
Large language models (LLMs) have been shown to be able to perform new tasks
based on a few demonstrations or natural language instructions. While these
capabilities have led to widespread adoption, most LLMs are developed by
resource-rich organizations and are frequently kept from the public. As a step
towards democratizing this powerful technology, we present BLOOM, a
176B-parameter open-access language model designed and built thanks to a
collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer
language model that was trained on the ROOTS corpus, a dataset comprising
hundreds of sources in 46 natural and 13 programming languages (59 in total).
We find that BLOOM achieves competitive performance on a wide variety of
benchmarks, with stronger results after undergoing multitask prompted
finetuning. To facilitate future research and applications using LLMs, we
publicly release our models and code under the Responsible AI License
Toward a Musical Sentiment (MuSe) Dataset for Affective Distant Hearing
In this short paper we present work in progress that tries to leverage crowdsourced music metadata
and crowdsourced affective word norms to create a comprehensive dataset of music emotions, which
can be used for sentiment analyses in the music domain. We combine a mixture of different data
sources to create a new dataset of 90,408 songs with their associated embeddings in Russell’s model
of affect, with the dimensions valence, dominance and arousal. In addition, we provide a Spotify ID
for the songs, which can be used to add more metadata to the dataset via the Spotify API
Toward a Musical Sentiment (MuSe) Dataset for Affective Distant Hearing
In this short paper we present work in progress that tries to leverage crowdsourced music metadata
and crowdsourced affective word norms to create a comprehensive dataset of music emotions, which
can be used for sentiment analyses in the music domain. We combine a mixture of different data
sources to create a new dataset of 90,408 songs with their associated embeddings in Russell’s model
of affect, with the dimensions valence, dominance and arousal. In addition, we provide a Spotify ID
for the songs, which can be used to add more metadata to the dataset via the Spotify API
Toward a Musical Sentiment (MuSe) Dataset for Affective Distant Hearing
In this short paper we present work in progress that tries to leverage crowdsourced music metadata
and crowdsourced affective word norms to create a comprehensive dataset of music emotions, which
can be used for sentiment analyses in the music domain. We combine a mixture of different data
sources to create a new dataset of 90,408 songs with their associated embeddings in Russell’s model
of affect, with the dimensions valence, dominance and arousal. In addition, we provide a Spotify ID
for the songs, which can be used to add more metadata to the dataset via the Spotify API
The effect of local chain stiffness on the mechanism of crystal nucleation in an oligomer melt
While the process by which a polymer crystal nucleates from the melt has been extensively studied via molecular simulation, differences in polymer models and simulated crystallization conditions have led to contradictory results. We make steps to resolve this controversy by computing low-temperature phase diagrams of oligomer melts using Wang Landau Monte Carlo simulations. Two qualitatively different crystallization mechanisms are possible depending on the local bending stiffness potential. Polymers with a discrete bending potential crystallize via a single-step mechanism, whereas polymers with a continuous bending potential can crystallize via a two-step mechanism that includes an intermediate nematic phase. Other model differences can be quantitatively accounted for using an effective volume fraction and a temperature scaled by the bending stiffness. These results suggest that at least two universality classes of nucleation exist for melts and that local chain stiffness is a key determining factor in the mechanism of nucleation
The effect of local chain stiffness on oligomer crystallization from a melt
While the process by which a polymer crystal nucleates from the melt has been extensively studied via molecular simulation, differences in polymer models and simulated crystallization conditions have led to seemingly contradictory results. We make steps to resolve this controversy by computing low-temperature phase diagrams of oligomer melts using Wang-Landau Monte Carlo simulations. Two qualitatively different crystallization mechanisms are possible depending on the local bending stiffness potential. Polymers with a discrete bending potential crystallize via a single-step mechanism, whereas polymers with a continuous bending potential can crystallize via a two-step mechanism that includes an intermediate nematic phase. Other model differences can be quantitatively accounted for using an effective volume fraction and a temperature scaled by the bending stiffness. These results suggest that at least two universality classes of nucleation exist for melts and that local chain stiffness is a key determining factor in the mechanism of nucleation