5,528 research outputs found
Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning
A growing body of work shows that many problems in fairness, accountability,
transparency, and ethics in machine learning systems are rooted in decisions
surrounding the data collection and annotation process. In spite of its
fundamental nature however, data collection remains an overlooked part of the
machine learning (ML) pipeline. In this paper, we argue that a new
specialization should be formed within ML that is focused on methodologies for
data collection and annotation: efforts that require institutional frameworks
and procedures. Specifically for sociocultural data, parallels can be drawn
from archives and libraries. Archives are the longest standing communal effort
to gather human information and archive scholars have already developed the
language and procedures to address and discuss many challenges pertaining to
data collection such as consent, power, inclusivity, transparency, and ethics &
privacy. We discuss these five key approaches in document collection practices
in archives that can inform data collection in sociocultural ML. By showing
data collection practices from another field, we encourage ML research to be
more cognizant and systematic in data collection and draw from
interdisciplinary expertise.Comment: To be published in Conference on Fairness, Accountability, and
Transparency FAT* '20, January 27-30, 2020, Barcelona, Spain. ACM, New York,
NY, USA, 11 page
Lost in Translation: Large Language Models in Non-English Content Analysis
In recent years, large language models (e.g., Open AI's GPT-4, Meta's LLaMa,
Google's PaLM) have become the dominant approach for building AI systems to
analyze and generate language online. However, the automated systems that
increasingly mediate our interactions online -- such as chatbots, content
moderation systems, and search engines -- are primarily designed for and work
far more effectively in English than in the world's other 7,000 languages.
Recently, researchers and technology companies have attempted to extend the
capabilities of large language models into languages other than English by
building what are called multilingual language models.
In this paper, we explain how these multilingual language models work and
explore their capabilities and limits. Part I provides a simple technical
explanation of how large language models work, why there is a gap in available
data between English and other languages, and how multilingual language models
attempt to bridge that gap. Part II accounts for the challenges of doing
content analysis with large language models in general and multilingual
language models in particular. Part III offers recommendations for companies,
researchers, and policymakers to keep in mind when considering researching,
developing and deploying large and multilingual language models.Comment: 50 pages, 4 figure
Digital Language Death
Of the approximately 7,000 languages spoken today, some 2,500 are generally considered endangered. Here we argue that this consensus figure vastly underestimates the danger of digital language death, in that less than 5% of all languages can still ascend to the digital realm. We present evidence of a massive die-off caused by the digital divide
- …