Search CORE

5,528 research outputs found

Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning

Author: Buolamwini Joy
Duggan Maeve
Grimm Tracy B.
Horton Valerie
Koehn Philipp
MacNeil Heather
Magazine Life
SAA.
Tripathi Aditya
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 22/12/2019
Field of study

A growing body of work shows that many problems in fairness, accountability, transparency, and ethics in machine learning systems are rooted in decisions surrounding the data collection and annotation process. In spite of its fundamental nature however, data collection remains an overlooked part of the machine learning (ML) pipeline. In this paper, we argue that a new specialization should be formed within ML that is focused on methodologies for data collection and annotation: efforts that require institutional frameworks and procedures. Specifically for sociocultural data, parallels can be drawn from archives and libraries. Archives are the longest standing communal effort to gather human information and archive scholars have already developed the language and procedures to address and discuss many challenges pertaining to data collection such as consent, power, inclusivity, transparency, and ethics & privacy. We discuss these five key approaches in document collection practices in archives that can inform data collection in sociocultural ML. By showing data collection practices from another field, we encourage ML research to be more cognizant and systematic in data collection and draw from interdisciplinary expertise.Comment: To be published in Conference on Fairness, Accountability, and Transparency FAT* '20, January 27-30, 2020, Barcelona, Spain. ACM, New York, NY, USA, 11 page

arXiv.org e-Print Archive

Lost in Translation: Large Language Models in Non-English Content Analysis

Author: Bhatia Aliya
Nicholas Gabriel
Publication venue
Publication date: 12/06/2023
Field of study

In recent years, large language models (e.g., Open AI's GPT-4, Meta's LLaMa, Google's PaLM) have become the dominant approach for building AI systems to analyze and generate language online. However, the automated systems that increasingly mediate our interactions online -- such as chatbots, content moderation systems, and search engines -- are primarily designed for and work far more effectively in English than in the world's other 7,000 languages. Recently, researchers and technology companies have attempted to extend the capabilities of large language models into languages other than English by building what are called multilingual language models. In this paper, we explain how these multilingual language models work and explore their capabilities and limits. Part I provides a simple technical explanation of how large language models work, why there is a gap in available data between English and other languages, and how multilingual language models attempt to bridge that gap. Part II accounts for the challenges of doing content analysis with large language models in general and multilingual language models in particular. Part III offers recommendations for companies, researchers, and policymakers to keep in mind when considering researching, developing and deploying large and multilingual language models.Comment: 50 pages, 4 figure

arXiv.org e-Print Archive

Communicating across cultures in cyberspace

Author: Doff Sabine
Macfadyen Leah P.
Roche Jörg
Publication venue: Ludwig-Maximilians-Universität München
Publication date: 01/01/2004
Field of study

Digital Language Death

Author: A Hiddinga
A Senghas
A Zséder
András Kornai
Eduardo G. Altmann
J Kegl
L Bloomfield
L Breiman
M Krauss
M Prensky
MP Lewis
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2013
Field of study

Of the approximately 7,000 languages spoken today, some 2,500 are generally considered endangered. Here we argue that this consensus figure vastly underestimates the danger of digital language death, in that less than 5% of all languages can still ascend to the digital realm. We present evidence of a massive die-off caused by the digital divide

SZTAKI Publication Repository

Directory of Open Access Journals

Twitter and society

Author
Publication venue: 'New York Botanical Garden'
Publication date: 01/01/2014
Field of study

The role of partecipant discourse in online community formation

Author: Rasulo Margherita
Publication venue
Publication date: 01/01/2007
Field of study

Public procurement and innovation: is defence different?

Author: Kundu Oishee
Publication venue
Publication date: 01/08/2022
Field of study