122 research outputs found
Demographic Inference and Representative Population Estimates from Multilingual Social Media Data
Social media provide access to behavioural data at an unprecedented scale and
granularity. However, using these data to understand phenomena in a broader
population is difficult due to their non-representativeness and the bias of
statistical inference tools towards dominant languages and groups. While
demographic attribute inference could be used to mitigate such bias, current
techniques are almost entirely monolingual and fail to work in a global
environment. We address these challenges by combining multilingual demographic
inference with post-stratification to create a more representative population
sample. To learn demographic attributes, we create a new multimodal deep neural
architecture for joint classification of age, gender, and organization-status
of social media users that operates in 32 languages. This method substantially
outperforms current state of the art while also reducing algorithmic bias. To
correct for sampling biases, we propose fully interpretable multilevel
regression methods that estimate inclusion probabilities from inferred joint
population counts and ground-truth population counts. In a large experiment
over multilingual heterogeneous European regions, we show that our demographic
inference and bias correction together allow for more accurate estimates of
populations and make a significant step towards representative social sensing
in downstream applications with multilingual social media.Comment: 12 pages, 10 figures, Proceedings of the 2019 World Wide Web
Conference (WWW '19
Zero-Shot Robustification of Zero-Shot Models With Foundation Models
Zero-shot inference is a powerful paradigm that enables the use of large
pretrained models for downstream classification tasks without further training.
However, these models are vulnerable to inherited biases that can impact their
performance. The traditional solution is fine-tuning, but this undermines the
key advantage of pretrained models, which is their ability to be used
out-of-the-box. We propose RoboShot, a method that improves the robustness of
pretrained model embeddings in a fully zero-shot fashion. First, we use
zero-shot language models (LMs) to obtain useful insights from task
descriptions. These insights are embedded and used to remove harmful and boost
useful components in embeddings -- without any supervision. Theoretically, we
provide a simple and tractable model for biases in zero-shot embeddings and
give a result characterizing under what conditions our approach can boost
performance. Empirically, we evaluate RoboShot on nine image and NLP
classification tasks and show an average improvement of 15.98% over several
zero-shot baselines. Additionally, we demonstrate that RoboShot is compatible
with a variety of pretrained and language models
Demographic Inference and Representative Population Estimates from Multilingual Social Media Data
Social media provide access to behavioural data at an unprecedented scale and granularity. However, using these data to understand phenomena in a broader population is difficult due to their non-representativeness and the bias of statistical inference tools towards dominant languages and groups. While demographic attribute inference could be used to mitigate such bias, current techniques are almost entirely monolingual and fail to work in a global environment. We address these challenges by combining multilingual demographic inference with post-stratification to create a more representative population sample. To learn demographic attributes, we create a new multimodal deep neural architecture for joint classification of age, gender, and organization-status of social media users that operates in 32 languages. This method substantially outperforms current state of the art while also reducing algorithmic bias. To correct for sampling biases, we propose fully interpretable multilevel regression methods that estimate inclusion probabilities from inferred joint population counts and ground-truth population counts.
In a large experiment over multilingual heterogeneous European regions, we show that our demographic inference and bias correction together allow for more accurate estimates of populations and make a significant step towards representative social sensing in downstream applications with multilingual social media
Slovene and Croatian word embeddings in terms of gender occupational analogies
In recent years, the use of deep neural networks and dense vector embeddings for text representation have led to excellent results in the field of computational understanding of natural language. It has also been shown that word embeddings often capture gender, racial and other types of bias. The article focuses on evaluating Slovene and Croatian word embeddings in terms of gender bias using word analogy calculations. We compiled a list of masculine and feminine nouns for occupations in Slovene and evaluated the gender bias of fastText, word2vec and ELMo embeddings with different configurations and different approaches to analogy calculations. The lowest occupational gender bias was observed with the fastText embeddings. Similarly, we compared different fastText embeddings on Croatian occupational analogies
Model and Evaluation: Towards Fairness in Multilingual Text Classification
Recently, more and more research has focused on addressing bias in text
classification models. However, existing research mainly focuses on the
fairness of monolingual text classification models, and research on fairness
for multilingual text classification is still very limited. In this paper, we
focus on the task of multilingual text classification and propose a debiasing
framework for multilingual text classification based on contrastive learning.
Our proposed method does not rely on any external language resources and can be
extended to any other languages. The model contains four modules: multilingual
text representation module, language fusion module, text debiasing module, and
text classification module. The multilingual text representation module uses a
multilingual pre-trained language model to represent the text, the language
fusion module makes the semantic spaces of different languages tend to be
consistent through contrastive learning, and the text debiasing module uses
contrastive learning to make the model unable to identify sensitive attributes'
information. The text classification module completes the basic tasks of
multilingual text classification. In addition, the existing research on the
fairness of multilingual text classification is relatively simple in the
evaluation mode. The evaluation method of fairness is the same as the
monolingual equality difference evaluation method, that is, the evaluation is
performed on a single language. We propose a multi-dimensional fairness
evaluation framework for multilingual text classification, which evaluates the
model's monolingual equality difference, multilingual equality difference,
multilingual equality performance difference, and destructiveness of the
fairness strategy. We hope that our work can provide a more general debiasing
method and a more comprehensive evaluation framework for multilingual text
fairness tasks
自然言語符号化における言語横断的な転移学習、圧縮、白色化の効果の分析
Tohoku University博士(情報科学)thesi
- …