50 research outputs found

    Studying bias in visual features through the lens of optimal transport

    Get PDF
    Computer vision systems are employed in a variety of high-impact applications. However, making them trustworthy requires methods for the detection of potential biases in their training data, before models learn to harm already disadvantaged groups in downstream applications. Image data are typically represented via extracted features, which can be hand-crafted or pre-trained neural network embeddings. In this work, we introduce a framework for bias discovery given such features that is based on optimal transport theory; it uses the (quadratic) Wasserstein distance to quantify disparity between the feature distributions of two demographic groups (e.g., women vs men). In this context, we show that the Kantorovich potentials of the images, which are a byproduct of computing the Wasserstein distance and act as “transportation prices", can serve as bias scores by indicating which images might exhibit distinct biased characteristics. We thus introduce a visual dataset exploration pipeline that helps auditors identify common characteristics across high- or low-scored images as potential sources of bias. We conduct a case study to identify prospective gender biases and demonstrate theoretically-derived properties with experiments on the CelebA and Biased MNIST datasets

    Preference relations based unsupervised rank aggregation for metasearch

    Get PDF
    Rank aggregation mechanisms have been used in solving problems from various domains such as bioinformatics, natural language processing, information retrieval, etc. Metasearch is one such application where a user gives a query to the metasearch engine, and the metasearch engine forwards the query to multiple individual search engines. Results or rankings returned by these individual search engines are combined using rank aggregation algorithms to produce the final result to be displayed to the user. We identify few aspects that should be kept in mind for designing any rank aggregation algorithms for metasearch. For example, generally equal importance is given to the input rankings while performing the aggregation. However, depending on the indexed set of web pages, features considered for ranking, ranking functions used etc. by the individual search engines, the individual rankings may be of different qualities. So, the aggregation algorithm should give more weight to the better rankings while giving less weight to others. Also, since the aggregation is performed when the user is waiting for response, the operations performed in the algorithm need to be light weight. Moreover, getting supervised data for rank aggregation problem is often difficult. In this paper, we present an unsupervised rank aggregation algorithm that is suitable for metasearch and addresses the aspects mentioned above. We also perform detailed experimental evaluation of the proposed algorithm on four different benchmark datasets having ground truth information. Apart from the unsupervised Kendall-Tau distance measure, several supervised evaluation measures are used for performance comparison. Experimental results demonstrate the efficacy of the proposed algorithm over baseline methods in terms of supervised evaluation metrics. Through these experiments we also show that Kendall-Tau distance metric may not be suitable for evaluating rank aggregation algorithms for metasearch

    Ranking Median Regression: Learning to Order through Local Consensus

    Full text link
    This article is devoted to the problem of predicting the value taken by a random permutation Σ\Sigma, describing the preferences of an individual over a set of numbered items {1,  ,  n}\{1,\; \ldots,\; n\} say, based on the observation of an input/explanatory r.v. XX e.g. characteristics of the individual), when error is measured by the Kendall τ\tau distance. In the probabilistic formulation of the 'Learning to Order' problem we propose, which extends the framework for statistical Kemeny ranking aggregation developped in \citet{CKS17}, this boils down to recovering conditional Kemeny medians of Σ\Sigma given XX from i.i.d. training examples (X1,Σ1),  ,  (XN,ΣN)(X_1, \Sigma_1),\; \ldots,\; (X_N, \Sigma_N). For this reason, this statistical learning problem is referred to as \textit{ranking median regression} here. Our contribution is twofold. We first propose a probabilistic theory of ranking median regression: the set of optimal elements is characterized, the performance of empirical risk minimizers is investigated in this context and situations where fast learning rates can be achieved are also exhibited. Next we introduce the concept of local consensus/median, in order to derive efficient methods for ranking median regression. The major advantage of this local learning approach lies in its close connection with the widely studied Kemeny aggregation problem. From an algorithmic perspective, this permits to build predictive rules for ranking median regression by implementing efficient techniques for (approximate) Kemeny median computations at a local level in a tractable manner. In particular, versions of kk-nearest neighbor and tree-based methods, tailored to ranking median regression, are investigated. Accuracy of piecewise constant ranking median regression rules is studied under a specific smoothness assumption for Σ\Sigma's conditional distribution given XX

    Graph Priors, Optimal Transport, and Deep Learning in Biomedical Discovery

    Get PDF
    Recent advances in biomedical data collection allows the collection of massive datasets measuring thousands of features in thousands to millions of individual cells. This data has the potential to advance our understanding of biological mechanisms at a previously impossible resolution. However, there are few methods to understand data of this scale and type. While neural networks have made tremendous progress on supervised learning problems, there is still much work to be done in making them useful for discovery in data with more difficult to represent supervision. The flexibility and expressiveness of neural networks is sometimes a hindrance in these less supervised domains, as is the case when extracting knowledge from biomedical data. One type of prior knowledge that is more common in biological data comes in the form of geometric constraints. In this thesis, we aim to leverage this geometric knowledge to create scalable and interpretable models to understand this data. Encoding geometric priors into neural network and graph models allows us to characterize the models’ solutions as they relate to the fields of graph signal processing and optimal transport. These links allow us to understand and interpret this datatype. We divide this work into three sections. The first borrows concepts from graph signal processing to construct more interpretable and performant neural networks by constraining and structuring the architecture. The second borrows from the theory of optimal transport to perform anomaly detection and trajectory inference efficiently and with theoretical guarantees. The third examines how to compare distributions over an underlying manifold, which can be used to understand how different perturbations or conditions relate. For this we design an efficient approximation of optimal transport based on diffusion over a joint cell graph. Together, these works utilize our prior understanding of the data geometry to create more useful models of the data. We apply these methods to molecular graphs, images, single-cell sequencing, and health record data

    On the Evaluation of Generative Models in Distributed Learning Tasks

    Full text link
    The evaluation of deep generative models including generative adversarial networks (GANs) and diffusion models has been extensively studied in the literature. While the existing evaluation methods mainly target a centralized learning problem with training data stored by a single client, many applications of generative models concern distributed learning settings, e.g. the federated learning scenario, where training data are collected by and distributed among several clients. In this paper, we study the evaluation of generative models in distributed learning tasks with heterogeneous data distributions. First, we focus on the Fr\'echet inception distance (FID) and consider the following FID-based aggregate scores over the clients: 1) FID-avg as the mean of clients' individual FID scores, 2) FID-all as the FID distance of the trained model to the collective dataset containing all clients' data. We prove that the model rankings according to the FID-all and FID-avg scores could be inconsistent, which can lead to different optimal generative models according to the two aggregate scores. Next, we consider the kernel inception distance (KID) and similarly define the KID-avg and KID-all aggregations. Unlike the FID case, we prove that KID-all and KID-avg result in the same rankings of generative models. We perform several numerical experiments on standard image datasets and training schemes to support our theoretical findings on the evaluation of generative models in distributed learning problems.Comment: 17 pages, 10 figure

    Assessing taxonomic metagenome profilers with OPAL

    Get PDF
    Meyer F, Bremges A, Belmann P, Janssen S, McHardy AC, Koslicki D. Assessing taxonomic metagenome profilers with OPAL. Genome biology. 2019;20(1): 51.The explosive growth in taxonomic metagenome profiling methods over the past years has created a need for systematic comparisons using relevant performance criteria. The Open-community Profiling Assessment tooL (OPAL) implements commonly used performance metrics, including those of the first challenge of the initiative for the Critical Assessment of Metagenome Interpretation (CAMI), together with convenient visualizations. In addition, we perform in-depth performance comparisons with seven profilers on datasets of CAMI and the Human Microbiome Project. OPAL is freely available at https://github.com/CAMI-challenge/OPAL

    2022-2 A Task-Based Theory of Occupations with Multidimensional Heterogeneity

    Get PDF
    I develop an assignment model of occupations with multidimensional heterogeneity in production tasks and worker skills. Tasks are distributed continuously in the skill space, whereas workers have a discrete distribution with a finite number of types. Occupations arise endogenously as bundles of tasks optimally assigned to a type of worker. The model allows us to study how occupations respond to changes in the economic environment, making it useful for analyzing the implications of automation, skill-biased technical change, offshoring, and worker training. Using the model, I characterize how wages, the marginal product of workers, the substitutability between worker types, and the labor share depend on the assignment of tasks to workers. I introduce automation as the choice of the optimal size and location of a mass of identical robots in the task space. Automation displaces workers by replacing them in the performance of tasks, generating a cascading effect on other workers as the boundaries of occupations are redrawn

    Designing the Liver Allocation Hierarchy: Incorporating Equity and Uncertainty

    Get PDF
    Liver transplantation is the only available therapy for any acute or chronic condition resulting in irreversible liver dysfunction. The liver allocation system in the U.S. is administered by the United Network for Organ Sharing (UNOS), a scientific and educational nonprofit organization. The main components of the organ procurement and transplant network are Organ Procurement Organizations (OPOs), which are collections of transplant centers responsible for maintaining local waiting lists, harvesting donated organs and carrying out transplants. Currently in the U.S., OPOs are grouped into 11 regions to facilitate organ allocation, and a three-tier mechanism is utilized that aims to reduce organ preservation time and transport distance to maintain organ quality, while giving sicker patients higher priority. Livers are scarce and perishable resources that rapidly lose viability, which makes their transport distance a crucial factor in transplant outcomes. When a liver becomes available, it is matched with patients on the waiting list according to a complex mechanism that gives priority to patients within the harvesting OPO and region. Transplants at the regional level accounted for more than 50% of all transplants since 2000.This dissertation focuses on the design of regions for liver allocation hierarchy, and includes optimization models that incorporate geographic equity as well as uncertainty throughout the analysis. We employ multi-objective optimization algorithms that involve solving parametric integer programs to balance two possibly conflicting objectives in the system: maximizing efficiency, as measured by the number of viability adjusted transplants, and maximizing geographic equity, as measured by the minimum rate of organ flow into individual OPOs from outside of their own local area. Our results show that efficiency improvements of up to 6% or equity gains of about 70% can be achieved when compared to the current performance of the system by redesigning the regional configuration for the national liver allocation hierarchy.We also introduce a stochastic programming framework to capture the uncertainty of the system by considering scenarios that correspond to different snapshots of the national waiting list and maximize the expected benefit from liver transplants under this stochastic view of the system. We explore many algorithmic and computational strategies including sampling methods, column generation strategies, branching and integer-solution generation procedures, to aid the solution process of the resulting large-scale integer programs. We also explore an OPO-based extension to our two-stage stochastic programming framework that lends itself to more extensive computational testing. The regional configurations obtained using these models are estimated to increase expected life-time gained per transplant operation by up to 7% when compared to the current system.This dissertation also focuses on the general question of designing efficient algorithms that combine column and cut generation to solve large-scale two-stage stochastic linear programs. We introduce a flexible method to combine column generation and the L-shaped method for two-stage stochastic linear programming. We explore the performance of various algorithm designs that employ stabilization subroutines for strengthening both column and cut generation to effectively avoid degeneracy. We study two-stage stochastic versions of the cutting stock and multi-commodity network flow problems to analyze the performances of algorithms in this context
    corecore