27 research outputs found

    Survey on Sociodemographic Bias in Natural Language Processing

    Full text link
    Deep neural networks often learn unintended biases during training, which might have harmful effects when deployed in real-world settings. This paper surveys 209 papers on bias in NLP models, most of which address sociodemographic bias. To better understand the distinction between bias and real-world harm, we turn to ideas from psychology and behavioral economics to propose a definition for sociodemographic bias. We identify three main categories of NLP bias research: types of bias, quantifying bias, and debiasing. We conclude that current approaches on quantifying bias face reliability issues, that many of the bias metrics do not relate to real-world biases, and that current debiasing techniques are superficial and hide bias rather than removing it. Finally, we provide recommendations for future work.Comment: 23 pages, 1 figur

    Advances in Intelligent Data Analysis XVII: 17th International Symposium, IDA 2018, ’s-Hertogenbosch, The Netherlands, October 24–26, 2018, Proceedings

    No full text
    Longitudinal data is ubiquitous in research, and often complemented by broad collections of static background information. There is, however, a shortage of general-purpose statistical tools for studying the temporal dynamics of complex and stochastic dynamical systems especially when data is scarce, and the underlying mechanisms that generate the observation are poorly understood. Contemporary microbiome research provides a topical example, where vast cross-sectional and longitudinal collections of taxonomic profiling data from the human body and other environments are now being collected in various research laboratories world-wide. Many classical algorithms rely on long and densely sampled time series, whereas human microbiome studies typically have more limited sample sizes, short time spans, sparse sampling intervals, lack of replicates and high levels of unaccounted technical and biological variation. We demonstrate how non-parametric models can help to quantify key properties of a dynamical system when the actual data-generating mechanisms are largely unknown. Such properties include the locations of stable states, resilience of the system, and the levels of stochastic fluctuations. Moreover, we show how limited data availability can be compensated by pooling statistical evidence across multiple individuals or studies, and by incorporating prior information in the models. In particular, we derive and implement a hierarchical Bayesian variant of Ornstein-Uhlenbeck driven t-processes. This can be used to characterize universal dynamics in univariate, unimodal, and mean reversible systems based on multiple short time series. We validate the model with simulated data and investigate its applicability in characterizing temporal dynamics of human gut microbiome.</p

    Advances in Intelligent Data Analysis XVII: 17th International Symposium, IDA 2018, ’s-Hertogenbosch, The Netherlands, October 24–26, 2018, Proceedings

    No full text
    The increasing openness of data, methods, and collaboration networks has created new opportunities for research, citizen science, and industry. Whereas openly licensed scientific, governmental, and institutional data sets can now be accessed through programmatic interfaces, compressed archives, and downloadable spreadsheets, realizing the full potential of open data streams depends critically on the availability of targeted data analytical methods, and on user communities that can derive value from these digital resources. Interoperable software libraries have become a central element in modern statistical data analysis, bridging the gap between theory and practice, while open developer communities have emerged as a powerful driver of research software development. Drawing insights from a decade of community engagement, I propose the concept of open data science, which refers to the new forms of research enabled by open data, open methods, and open collaboration.</p

    Early Modern Privacy

    Get PDF
    An examination of instances, experiences, and spaces of early modern privacy. It opens new avenues to understanding the structures and dynamics that shape early modern societies through examination of a wide array of sources, discourses, practices, and spatial programmes.; Readership: Because of its comprehensive disciplinary scope, this volume is of interest to scholars and students of early modern culture in all its facets. Keywords: early modern, intimacy, legal history, religious history, history of art, history of architecture, secrecy, theology, ego-documents, history of science, literary studies, China, Europe, private life, privacy, Jewish history, theory

    Non-empirical problems in fair machine learning

    Get PDF
    The problem of fair machine learning has drawn much attention over the last few years and the bulk of offered solutions are, in principle, empirical. However, algorithmic fairness also raises important conceptual issues that would fail to be addressed if one relies entirely on empirical considerations. Herein, I will argue that the current debate has developed an empirical framework that has brought important contributions to the development of algorithmic decision-making, such as new techniques to discover and prevent discrimination, additional assessment criteria, and analyses of the interaction between fairness and predictive accuracy. However, the same framework has also suggested higher-order issues regarding the translation of fairness into metrics and quantifiable trade-offs. Although the (empirical) tools which have been developed so far are essential to address discrimination encoded in data and algorithms, their integration into society elicits key (conceptual) questions such as: What kind of assumptions and decisions underlies the empirical framework? How do the results of the empirical approach penetrate public debate? What kind of reflection and deliberation should stakeholders have over available fairness metrics? I will outline the empirical approach to fair machine learning, i.e. how the problem is framed and addressed, and suggest that there are important non-empirical issues that should be tackled. While this work will focus on the problem of algorithmic fairness, the lesson can extend to other conceptual problems in the analysis of algorithmic decision-making such as privacy and explainability

    Tracking the Temporal-Evolution of Supernova Bubbles in Numerical Simulations

    Get PDF
    The study of low-dimensional, noisy manifolds embedded in a higher dimensional space has been extremely useful in many applications, from the chemical analysis of multi-phase flows to simulations of galactic mergers. Building a probabilistic model of the manifolds has helped in describing their essential properties and how they vary in space. However, when the manifold is evolving through time, a joint spatio-temporal modelling is needed, in order to fully comprehend its nature. We propose a first-order Markovian process that propagates the spatial probabilistic model of a manifold at fixed time, to its adjacent temporal stages. The proposed methodology is demonstrated using a particle simulation of an interacting dwarf galaxy to describe the evolution of a cavity generated by a Supernov

    Data analysis through graph decomposition

    Get PDF
    This work is developed within the field of data mining and data visualization. Under the premise that many of the algorithms give as result huge amounts of data impossible to handle for the users, we work with the decomposition of Gaifman graphs and its variations as an option for data visualization. In fact, we apply the decomposition method based on the so-called 2-structures. This decomposition method has been theoretical developed but has not any practical application yet in this field, being this part of our contribution. Thus, from a dataset we construct the Gaifman graph (and possible variations of it) that represents information about co-occurrence patterns. More precisely, the construction of the Gaifman graphs from the dataset is based on co-occurrence, or lack of it, of items in the dataset. That is, those pair of items that appear together in any transaction are connected and those items that never appear together are disconnected. We may do the natural completion of the graph adding the absent edges with a different kind of edges, in this way we get a complete graph with two equivalence classes on its edges. Now, think of the graph where the kind of edges are determined by the multiplicity of the items that they connect, that is, by the number of transactions that contains the pair of items that the edge connects. In this case we have as many equivalence relations as different multiplicities, and we may apply some discretization methods on them to get different variations of the graphs. All these variations can be seen as 2-structures. The application of the 2-structure decomposition method produces as result a hierarchical visualization of the co-occurrences on data. In fact, the decomposition method is based on clan decomposition. Given a 2-structure defined on U, a set of vertices C, C subset of U, is a clan if, for each z not in C, z may not distinguish among the elements of C. We connect this decomposition with an associated closure space, developing this intuition by introducing a construction of implication sets, named clan implications. Based on the definition of a clan, let x, y be elements of any clan C, if there is z such that sees in a different way x and y, that is the edges (x,z) and (x,y) are in different equivalence classes, so z in C; this is equivalent to C logically entails the implication xy then z. Throughout the thesis, in order to explain our work in a constructive way, we first work with the case of having only two equivalence classes and its corresponding nomenclature (modules), and then we extend the theory to work with more equivalence classes. Our main contributions are: an algorithm (with its full implementation) for the clan decomposition method; the theorems that support our approach, and examples of its application to demonstrate its usability.Este trabajo está desarrollado dentro del área de Mineria de Datos y Visualización de Datos. Bajo la premisa de que muchos algoritmos dan como resultados un gran número de datos imposibles de manejar por los usuarios, proponemos trabajar con la descomposición de grafos de Gaifman y sus variantes como una opción para visualizar datos. De hecho, aplicamos un método de descomposición basado en las llamadas 2-structures. Este método de descomposición ha sido teóricamente desarrollado pero hasta ahora no había tenido una aplicación práctica en esta área, siendo ésta parte de nuestra contribución. Así, partiendo de la base de datos contruimos un grafo de Gaifman (y posiblemente variantes de él) que representa información sobre los patrones de co-ocurrencias. Esto es, aquellos pares de items que aparecen juntos en cualquier transacción son conectados, mientras que aquellos que nunca aparecen juntos están desconectados. Podemos completar naturalmente el grafo añadiendo las aristas ausentes como un diferente tipo de arista, en este sentido, obtenemos un grafo completo con dos clases de equivalencia sobre sus aristas. Ahora, piense en el grafo donde el tipo de aristas está determinado por la multiplicidad de los items que las aristas conectan, esto es, el número de transacciones que contienen el par de items que la arista conecta. En este caso tenemos tantas relaciones de equivalencia como diferentes multiplicidades, podemos aplicar algunos métodos de discretización sobre ellos para así obtener diferentes variantes de grafos, todas estas variaciones pueden ser vistas como 2-structures. La aplicación del método de descomposición de 2-structures produce como resultados una visualización jerárquica de las co-ocurrencias de los datos. De hecho, el método de descomposición está basado en la descomposición de clanes. Dada una 2-structure definida sobre U, un conjunto de vertices C, C subconjunto de U, es un clan si para cada z que no está en C, z no distingue los elementos de C, esto es, z está conectado a los elementos de C con el mismo tipo de aristas. En nuestro trabajo, conectamos esta descomposición con un espacio de cerrados asociado, desarrollamos esta parte del trabajo introduciendo una construcción de un conjunto de implicaciones, llamado clan implications. Basándonos en la definición de clan, sea x, y elemento de cualquier clan C, si existe z tal que las aristas (x,z) y (y,z) están en diferentes clases de equivalencia, z deber estar en C; esto es equivalente a que C lógicamente ocasiona la implicación xy entonces z. A lo largo de esta tesis, con el fin de explicar nuestro trabajo de una manera constructiva, primero trabajamos con sólo dos clases de equivalencia y su nomenclatura correspondiente (modules, en lugar de clanes), para después extender la teoría a más clases de equivalencia. Nuestras contribuciones principales son: un algoritmo (que implementamos) para le método de descomposición de clanes; los teoremas que respaldan nuestro trabajo, y ejemplos de sus aplicaciones con el fin de ilustrar su usabilidad

    Co-occurrence patterns in diagnostic data

    Get PDF
    We demonstrate how graph decomposition techniques can be employed for the visualization of hierarchical co-occurrence patterns between medical data items. Our research is based on Gaifman graphs (a mathematical concept introduced in Logic), on specific variants of this concept, and on existing graph decomposition notions, specifically, graph modules and the clan decomposition of so-called 2-structures. The construction of the Gaifman graphs from a dataset is based on co-occurrence, or lack of it, of items in the dataset. We may select a discretization on the edge labels to aim at one among several Gaifman graph variants. Then, the decomposition of the graph may provide us with visual information about the data co-occurrences, after which one can proceed to more traditional statistical analysis.Partially supported by European Research Council (ERC) under the European Union's Horizon2020 research and innovation programme, grant agreement ERC-2014-CoG 648276 (AUTAR);by grant TIN2017-89244-R from Ministerio de Economia, Industria y Competitividad, and byConacyt (México). We acknowledge unfunded recognition 2017SGR-856 (MACDA) from AGAUR(Generalitat de Catalunya).Peer ReviewedPostprint (published version
    corecore