27 research outputs found
Co-occurrence patterns in diagnostic data
We demonstrate how graph decomposition techniques can be employed for the visualization of hierarchical co-occurrence patterns between medical data items. Our research is based on Gaifman graphs (a mathematical concept introduced in Logic), on specific variants of this concept, and on existing graph decomposition notions, specifically, graph modules and the clan decomposition of so-called 2-structures. The construction of the Gaifman graphs from a dataset is based on co-occurrence, or lack of it, of items in the dataset. We may select a discretization on the edge labels to aim at one among several Gaifman graph variants. Then, the decomposition of the graph may provide us with visual information about the data co-occurrences, after which one can proceed to more traditional statistical analysis.Partially supported by European Research Council (ERC) under the European Union's Horizon2020 research and innovation programme, grant agreement ERC-2014-CoG 648276 (AUTAR);by grant TIN2017-89244-R from Ministerio de Economia, Industria y Competitividad, and byConacyt (México). We acknowledge unfunded recognition 2017SGR-856 (MACDA) from AGAUR(Generalitat de Catalunya).Peer ReviewedPostprint (published version
Non-empirical problems in fair machine learning
The problem of fair machine learning has drawn much attention over the last few years and the bulk of offered solutions are, in principle, empirical. However, algorithmic fairness also raises important conceptual issues that would fail to be addressed if one relies entirely on empirical considerations. Herein, I will argue that the current debate has developed an empirical framework that has brought important contributions to the development of algorithmic decision-making, such as new techniques to discover and prevent discrimination, additional assessment criteria, and analyses of the interaction between fairness and predictive accuracy. However, the same framework has also suggested higher-order issues regarding the translation of fairness into metrics and quantifiable trade-offs. Although the (empirical) tools which have been developed so far are essential to address discrimination encoded in data and algorithms, their integration into society elicits key (conceptual) questions such as: What kind of assumptions and decisions underlies the empirical framework? How do the results of the empirical approach penetrate public debate? What kind of reflection and deliberation should stakeholders have over available fairness metrics? I will outline the empirical approach to fair machine learning, i.e. how the problem is framed and addressed, and suggest that there are important non-empirical issues that should be tackled. While this work will focus on the problem of algorithmic fairness, the lesson can extend to other conceptual problems in the analysis of algorithmic decision-making such as privacy and explainability
Advances in Intelligent Data Analysis XVII: 17th International Symposium, IDA 2018, ’s-Hertogenbosch, The Netherlands, October 24–26, 2018, Proceedings
Longitudinal data is ubiquitous in research, and often complemented by
broad collections of static background information. There is, however, a
shortage of general-purpose statistical tools for studying the temporal
dynamics of complex and stochastic dynamical systems especially when
data is scarce, and the underlying mechanisms that generate the
observation are poorly understood. Contemporary microbiome research
provides a topical example, where vast cross-sectional and longitudinal
collections of taxonomic profiling data from the human body and other
environments are now being collected in various research laboratories
world-wide. Many classical algorithms rely on long and densely sampled
time series, whereas human microbiome studies typically have more
limited sample sizes, short time spans, sparse sampling intervals, lack
of replicates and high levels of unaccounted technical and biological
variation. We demonstrate how non-parametric models can help to quantify
key properties of a dynamical system when the actual data-generating
mechanisms are largely unknown. Such properties include the locations of
stable states, resilience of the system, and the levels of stochastic
fluctuations. Moreover, we show how limited data availability can be
compensated by pooling statistical evidence across multiple individuals
or studies, and by incorporating prior information in the models. In
particular, we derive and implement a hierarchical Bayesian variant of
Ornstein-Uhlenbeck driven t-processes. This can be used to characterize
universal dynamics in univariate, unimodal, and mean reversible systems
based on multiple short time series. We validate the model with
simulated data and investigate its applicability in characterizing
temporal dynamics of human gut microbiome.</p
Advances in Intelligent Data Analysis XVII: 17th International Symposium, IDA 2018, ’s-Hertogenbosch, The Netherlands, October 24–26, 2018, Proceedings
The increasing openness of data, methods, and collaboration networks has
created new opportunities for research, citizen science, and industry.
Whereas openly licensed scientific, governmental, and institutional data
sets can now be accessed through programmatic interfaces, compressed
archives, and downloadable spreadsheets, realizing the full potential of
open data streams depends critically on the availability of targeted
data analytical methods, and on user communities that can derive value
from these digital resources. Interoperable software libraries have
become a central element in modern statistical data analysis, bridging
the gap between theory and practice, while open developer communities
have emerged as a powerful driver of research software development.
Drawing insights from a decade of community engagement, I propose the
concept of open data science, which refers to the new forms of research enabled by open data, open methods, and open collaboration.</p
Survey on Sociodemographic Bias in Natural Language Processing
Deep neural networks often learn unintended biases during training, which
might have harmful effects when deployed in real-world settings. This paper
surveys 209 papers on bias in NLP models, most of which address
sociodemographic bias. To better understand the distinction between bias and
real-world harm, we turn to ideas from psychology and behavioral economics to
propose a definition for sociodemographic bias. We identify three main
categories of NLP bias research: types of bias, quantifying bias, and
debiasing. We conclude that current approaches on quantifying bias face
reliability issues, that many of the bias metrics do not relate to real-world
biases, and that current debiasing techniques are superficial and hide bias
rather than removing it. Finally, we provide recommendations for future work.Comment: 23 pages, 1 figur