1,372 research outputs found

    Attraction-repulsion clustering: a way of promoting diversity linked to demographic parity in fair clustering

    Get PDF
    Producción CientíficaWe consider the problem of diversity enhancing clustering, i.e, developing clustering methods which produce clusters that favour diversity with respect to a set of pro- tected attributes such as race, sex, age, etc. In the context of fair clustering, diversity plays a major role when fairness is understood as demographic parity. To promote diversity, we introduce perturbations to the distance in the unprotected attributes that account for protected attributes in a way that resembles attraction-repulsion of charged particles in Physics. These perturbations are defined through dissimilarities with a tractable interpretation. Cluster analysis based on attraction-repulsion dissimilarities penalizes homogeneity of the clusters with respect to the protected attributes and leads to an improvement in diversity. An advantage of our approach, which falls into a pre- processing set-up, is its compatibility with a wide variety of clustering methods and whit non-Euclidean data. We illustrate the use of our procedures with both synthetic and real data and provide discussion about the relation between diversity, fairness, and cluster structure.Ministerio de Economía y Competencia and FEDER, (grant MTM2017-86061-C2-1-P)Junta de Castilla y León, (grants VA005P17 and VA002G18)Gobierno País Vasco a través del programa BERC 2018-2021Ministerio de Ciencia, Innovación, y Universidades (acreditación BCAM Severo Ochoa SEV-2017-0718)Publicación en abierto financiada por el Consorcio de Bibliotecas Universitarias de Castilla y León (BUCLE), con cargo al Programa Operativo 2014ES16RFOP009 FEDER 2014-2020 DE CASTILLA Y LEÓN, Actuación:20007-CL - Apoyo Consorcio BUCL

    Comparing connected structures in ensemble of random fields

    No full text
    International audienceVery different connectivity patterns may arise from using different simulation methods or sets of parameters, and therefore different flow properties. This paper proposes a systematic method to compare ensemble of categorical simulations from a static connectivity point of view. The differences of static connectivity cannot always be distinguished using two point statistics. In addition, multiple-point histograms only provide a statistical comparison of patterns regardless of the connectivity. Thus, we propose to characterize the static connectivity from a set of 12 indicators based on the connected components of the realizations. Some indicators describe the spatial repartition of the connected components, others their global shape or their topology through the component skeletons. We also gather all the indicators into dissimilarity values to easily compare hundreds of realizations. Heat maps and multidimensional scaling then facilitate the dissimilarity analysis. The application to a synthetic case highlights the impact of the grid size on the connectivity and the indicators. Such impact disappears when comparing samples of the realizations with the same sizes. The method is then able to rank realizations from a referring model based on their static connectivity. This application also gives rise to more practical advices. The multidimensional scaling appears as a powerful visualization tool, but it also induces dissimilarity misrepresentations: it should always be interpreted cautiously with a look at the point position confidence. The heat map displays the real dissimilarities and is more appropriate for a detailed analysis. The comparison with a multiple-point histogram method shows the benefit of the connected components: the large-scale connectivity seems better characterized by our indicators, especially the skeleton indicators

    A Survey of Dimension Reduction Methods for High-dimensional Data Analysis and Visualization

    Get PDF
    Dimension reduction is commonly defined as the process of mapping high-dimensional data to a lower-dimensional embedding. Applications of dimension reduction include, but are not limited to, filtering, compression, regression, classification, feature analysis, and visualization. We review methods that compute a point-based visual representation of high-dimensional data sets to aid in exploratory data analysis. The aim is not to be exhaustive but to provide an overview of basic approaches, as well as to review select state-of-the-art methods. Our survey paper is an introduction to dimension reduction from a visualization point of view. Subsequently, a comparison of state-of-the-art methods outlines relations and shared research foci

    BOOL-AN: A method for comparative sequence analysis and phylogenetic reconstruction

    Get PDF
    A novel discrete mathematical approach is proposed as an additional tool for molecular systematics which does not require prior statistical assumptions concerning the evolutionary process. The method is based on algorithms generating mathematical representations directly from DNA/RNA or protein sequences, followed by the output of numerical (scalar or vector) and visual characteristics (graphs). The binary encoded sequence information is transformed into a compact analytical form, called the Iterative Canonical Form (or ICF) of Boolean functions, which can then be used as a generalized molecular descriptor. The method provides raw vector data for calculating different distance matrices, which in turn can be analyzed by neighbor-joining or UPGMA to derive a phylogenetic tree, or by principal coordinates analysis to get an ordination scattergram. The new method and the associated software for inferring phylogenetic trees are called the Boolean analysis or BOOL-AN

    Data-Driven Supervised Learning for Life Science Data

    Get PDF
    Life science data are often encoded in a non-standard way by means of alpha-numeric sequences, graph representations, numerical vectors of variable length, or other formats. Domain-specific or data-driven similarity measures like alignment functions have been employed with great success. The vast majority of more complex data analysis algorithms require fixed-length vectorial input data, asking for substantial preprocessing of life science data. Data-driven measures are widely ignored in favor of simple encodings. These preprocessing steps are not always easy to perform nor particularly effective, with a potential loss of information and interpretability. We present some strategies and concepts of how to employ data-driven similarity measures in the life science context and other complex biological systems. In particular, we show how to use data-driven similarity measures effectively in standard learning algorithms

    Identifying the evolution of stock markets stochastic structure after the euro

    Get PDF
    Previous studies have investigated the comovements of international equity markets by using correlation, cointegration, common factor analysis, and other approaches. In this paper, we investigate the stochastic structure of major euro and non-euro area stock market series from 1994 to 2006, by using cluster analysis techniques for time series. We use an interpolated-periodogram based metric for level and squared returns in order to compute distances between the stock markets. This method captures the stochastic dependence structure of the time series and solves the shortcoming of unequal sample sizes found for different countries. The clusters of countries are formed by the dendrogram and the principal coordinates associated with the sample spectrum for both the series of returns and volatilities. The empirical results suggest that the cross-country groups have become considerably more homogeneous with the introduction of the euro as an electronic currency. For reference, we also explore the pairwise correlations among the series
    corecore