50 research outputs found
Studying bias in visual features through the lens of optimal transport
Computer vision systems are employed in a variety of high-impact applications. However, making them trustworthy requires methods for the detection of potential biases in their training data, before models learn to harm already disadvantaged groups in downstream applications. Image data are typically represented via extracted features, which can be hand-crafted or pre-trained neural network embeddings. In this work, we introduce a framework for bias discovery given such features that is based on optimal transport theory; it uses the (quadratic) Wasserstein distance to quantify disparity between the feature distributions of two demographic groups (e.g., women vs men). In this context, we show that the Kantorovich potentials of the images, which are a byproduct of computing the Wasserstein distance and act as “transportation prices", can serve as bias scores by indicating which images might exhibit distinct biased characteristics. We thus introduce a visual dataset exploration pipeline that helps auditors identify common characteristics across high- or low-scored images as potential sources of bias. We conduct a case study to identify prospective gender biases and demonstrate theoretically-derived properties with experiments on the CelebA and Biased MNIST datasets
Preference relations based unsupervised rank aggregation for metasearch
Rank aggregation mechanisms have been used in solving problems from various domains such as bioinformatics, natural language processing, information retrieval, etc. Metasearch is one such application where a user gives a query to the metasearch engine, and the metasearch engine forwards the query to multiple individual search engines. Results or rankings returned by these individual search engines are combined using rank aggregation algorithms to produce the final result to be displayed to the user. We identify few aspects that should be kept in mind for designing any rank aggregation algorithms for metasearch. For example, generally equal importance is given to the input rankings while performing the aggregation. However, depending on the indexed set of web pages, features considered for ranking, ranking functions used etc. by the individual search engines, the individual rankings may be of different qualities. So, the aggregation algorithm should give more weight to the better rankings while giving less weight to others. Also, since the aggregation is performed when the user is waiting for response, the operations performed in the algorithm need to be light weight. Moreover, getting supervised data for rank aggregation problem is often difficult. In this paper, we present an unsupervised rank aggregation algorithm that is suitable for metasearch and addresses the aspects mentioned above.
We also perform detailed experimental evaluation of the proposed algorithm on four different benchmark datasets having ground truth information. Apart from the unsupervised Kendall-Tau distance measure, several supervised evaluation measures are used for performance comparison. Experimental results demonstrate the efficacy of the proposed algorithm over baseline methods in terms of supervised evaluation metrics. Through these experiments we also show that Kendall-Tau distance metric may not be suitable for evaluating rank aggregation algorithms for metasearch
Ranking Median Regression: Learning to Order through Local Consensus
This article is devoted to the problem of predicting the value taken by a
random permutation , describing the preferences of an individual over a
set of numbered items say, based on the observation of
an input/explanatory r.v. e.g. characteristics of the individual), when
error is measured by the Kendall distance. In the probabilistic
formulation of the 'Learning to Order' problem we propose, which extends the
framework for statistical Kemeny ranking aggregation developped in
\citet{CKS17}, this boils down to recovering conditional Kemeny medians of
given from i.i.d. training examples . For this reason, this statistical learning problem is
referred to as \textit{ranking median regression} here. Our contribution is
twofold. We first propose a probabilistic theory of ranking median regression:
the set of optimal elements is characterized, the performance of empirical risk
minimizers is investigated in this context and situations where fast learning
rates can be achieved are also exhibited. Next we introduce the concept of
local consensus/median, in order to derive efficient methods for ranking median
regression. The major advantage of this local learning approach lies in its
close connection with the widely studied Kemeny aggregation problem. From an
algorithmic perspective, this permits to build predictive rules for ranking
median regression by implementing efficient techniques for (approximate) Kemeny
median computations at a local level in a tractable manner. In particular,
versions of -nearest neighbor and tree-based methods, tailored to ranking
median regression, are investigated. Accuracy of piecewise constant ranking
median regression rules is studied under a specific smoothness assumption for
's conditional distribution given
Graph Priors, Optimal Transport, and Deep Learning in Biomedical Discovery
Recent advances in biomedical data collection allows the collection of massive datasets measuring thousands of features in thousands to millions of individual cells. This data has the potential to advance our understanding of biological mechanisms at a previously impossible resolution. However, there are few methods to understand data of this scale and type. While neural networks have made tremendous progress on supervised learning problems, there is still much work to be done in making them useful for discovery in data with more difficult to represent supervision. The flexibility and expressiveness of neural networks is sometimes a hindrance in these less supervised domains, as is the case when extracting knowledge from biomedical data. One type of prior knowledge that is more common in biological data comes in the form of geometric constraints. In this thesis, we aim to leverage this geometric knowledge to create scalable and interpretable models to understand this data. Encoding geometric priors into neural network and graph models allows us to characterize the models’ solutions as they relate to the fields of graph signal processing and optimal transport. These links allow us to understand and interpret this datatype. We divide this work into three sections. The first borrows concepts from graph signal processing to construct more interpretable and performant neural networks by constraining and structuring the architecture. The second borrows from the theory of optimal transport to perform anomaly detection and trajectory inference efficiently and with theoretical guarantees. The third examines how to compare distributions over an underlying manifold, which can be used to understand how different perturbations or conditions relate. For this we design an efficient approximation of optimal transport based on diffusion over a joint cell graph. Together, these works utilize our prior understanding of the data geometry to create more useful models of the data. We apply these methods to molecular graphs, images, single-cell sequencing, and health record data
On the Evaluation of Generative Models in Distributed Learning Tasks
The evaluation of deep generative models including generative adversarial
networks (GANs) and diffusion models has been extensively studied in the
literature. While the existing evaluation methods mainly target a centralized
learning problem with training data stored by a single client, many
applications of generative models concern distributed learning settings, e.g.
the federated learning scenario, where training data are collected by and
distributed among several clients. In this paper, we study the evaluation of
generative models in distributed learning tasks with heterogeneous data
distributions. First, we focus on the Fr\'echet inception distance (FID) and
consider the following FID-based aggregate scores over the clients: 1) FID-avg
as the mean of clients' individual FID scores, 2) FID-all as the FID distance
of the trained model to the collective dataset containing all clients' data. We
prove that the model rankings according to the FID-all and FID-avg scores could
be inconsistent, which can lead to different optimal generative models
according to the two aggregate scores. Next, we consider the kernel inception
distance (KID) and similarly define the KID-avg and KID-all aggregations.
Unlike the FID case, we prove that KID-all and KID-avg result in the same
rankings of generative models. We perform several numerical experiments on
standard image datasets and training schemes to support our theoretical
findings on the evaluation of generative models in distributed learning
problems.Comment: 17 pages, 10 figure
Assessing taxonomic metagenome profilers with OPAL
Meyer F, Bremges A, Belmann P, Janssen S, McHardy AC, Koslicki D. Assessing taxonomic metagenome profilers with OPAL. Genome biology. 2019;20(1): 51.The explosive growth in taxonomic metagenome profiling methods over the past years has created a need for systematic comparisons using relevant performance criteria. The Open-community Profiling Assessment tooL (OPAL) implements commonly used performance metrics, including those of the first challenge of the initiative for the Critical Assessment of Metagenome Interpretation (CAMI), together with convenient visualizations. In addition, we perform in-depth performance comparisons with seven profilers on datasets of CAMI and the Human Microbiome Project. OPAL is freely available at https://github.com/CAMI-challenge/OPAL
2022-2 A Task-Based Theory of Occupations with Multidimensional Heterogeneity
I develop an assignment model of occupations with multidimensional heterogeneity in production tasks and worker skills. Tasks are distributed continuously in the skill space, whereas workers have a discrete distribution with a finite number of types. Occupations arise endogenously as bundles of tasks optimally assigned to a type of worker. The model allows us to study how occupations respond to changes in the economic environment, making it useful for analyzing the implications of automation, skill-biased technical change, offshoring, and worker training. Using the model, I characterize how wages, the marginal product of workers, the substitutability between worker types, and the labor share depend on the assignment of tasks to workers. I introduce automation as the choice of the optimal size and location of a mass of identical robots in the task space. Automation displaces workers by replacing them in the performance of tasks, generating a cascading effect on other workers as the boundaries of occupations are redrawn
Designing the Liver Allocation Hierarchy: Incorporating Equity and Uncertainty
Liver transplantation is the only available therapy for any acute or chronic condition resulting in irreversible liver dysfunction. The liver allocation system in the U.S. is administered by the United Network for Organ Sharing (UNOS), a scientific and educational nonprofit organization. The main components of the organ procurement and transplant network are Organ Procurement Organizations (OPOs), which are collections of transplant centers responsible for maintaining local waiting lists, harvesting donated organs and carrying out transplants. Currently in the U.S., OPOs are grouped into 11 regions to facilitate organ allocation, and a three-tier mechanism is utilized that aims to reduce organ preservation time and transport distance to maintain organ quality, while giving sicker patients higher priority. Livers are scarce and perishable resources that rapidly lose viability, which makes their transport distance a crucial factor in transplant outcomes. When a liver becomes available, it is matched with patients on the waiting list according to a complex mechanism that gives priority to patients within the harvesting OPO and region. Transplants at the regional level accounted for more than 50% of all transplants since 2000.This dissertation focuses on the design of regions for liver allocation hierarchy, and includes optimization models that incorporate geographic equity as well as uncertainty throughout the analysis. We employ multi-objective optimization algorithms that involve solving parametric integer programs to balance two possibly conflicting objectives in the system: maximizing efficiency, as measured by the number of viability adjusted transplants, and maximizing geographic equity, as measured by the minimum rate of organ flow into individual OPOs from outside of their own local area. Our results show that efficiency improvements of up to 6% or equity gains of about 70% can be achieved when compared to the current performance of the system by redesigning the regional configuration for the national liver allocation hierarchy.We also introduce a stochastic programming framework to capture the uncertainty of the system by considering scenarios that correspond to different snapshots of the national waiting list and maximize the expected benefit from liver transplants under this stochastic view of the system. We explore many algorithmic and computational strategies including sampling methods, column generation strategies, branching and integer-solution generation procedures, to aid the solution process of the resulting large-scale integer programs. We also explore an OPO-based extension to our two-stage stochastic programming framework that lends itself to more extensive computational testing. The regional configurations obtained using these models are estimated to increase expected life-time gained per transplant operation by up to 7% when compared to the current system.This dissertation also focuses on the general question of designing efficient algorithms that combine column and cut generation to solve large-scale two-stage stochastic linear programs. We introduce a flexible method to combine column generation and the L-shaped method for two-stage stochastic linear programming. We explore the performance of various algorithm designs that employ stabilization subroutines for strengthening both column and cut generation to effectively avoid degeneracy. We study two-stage stochastic versions of the cutting stock and multi-commodity network flow problems to analyze the performances of algorithms in this context