3 research outputs found
Learning transport cost from subset correspondence
Learning to align multiple datasets is an important problem with many
applications, and it is especially useful when we need to integrate multiple
experiments or correct for confounding. Optimal transport (OT) is a principled
approach to align datasets, but a key challenge in applying OT is that we need
to specify a transport cost function that accurately captures how the two
datasets are related. Reliable cost functions are typically not available and
practitioners often resort to using hand-crafted or Euclidean cost even if it
may not be appropriate. In this work, we investigate how to learn the cost
function using a small amount of side information which is often available. The
side information we consider captures subset correspondence---i.e. certain
subsets of points in the two data sets are known to be related. For example, we
may have some images labeled as cars in both datasets; or we may have a common
annotated cell type in single-cell data from two batches. We develop an
end-to-end optimizer (OT-SI) that differentiates through the Sinkhorn algorithm
and effectively learns the suitable cost function from side information. On
systematic experiments in images, marriage-matching and single-cell RNA-seq,
our method substantially outperform state-of-the-art benchmarks
Learning Cost Functions for Optimal Transport
Inverse optimal transport (OT) refers to the problem of learning the cost
function for OT from observed transport plan or its samples. In this paper, we
derive an unconstrained convex optimization formulation of the inverse OT
problem, which can be further augmented by any customizable regularization. We
provide a comprehensive characterization of the properties of inverse OT,
including uniqueness of solutions. We also develop two numerical algorithms,
one is a fast matrix scaling method based on the Sinkhorn-Knopp algorithm for
discrete OT, and the other one is a learning based algorithm that parameterizes
the cost function as a deep neural network for continuous OT. The novel
framework proposed in the work avoids repeatedly solving a forward OT in each
iteration which has been a thorny computational bottleneck for the bi-level
optimization in existing inverse OT approaches. Numerical results demonstrate
promising efficiency and accuracy advantages of the proposed algorithms over
existing state-of-the-art methods
Diffusion Earth Mover's Distance and Distribution Embeddings
We propose a new fast method of measuring distances between large numbers of
related high dimensional datasets called the Diffusion Earth Mover's Distance
(EMD). We model the datasets as distributions supported on common data graph
that is derived from the affinity matrix computed on the combined data. In such
cases where the graph is a discretization of an underlying Riemannian closed
manifold, we prove that Diffusion EMD is topologically equivalent to the
standard EMD with a geodesic ground distance. Diffusion EMD can be computed in
time and is more accurate than similarly fast algorithms such as
tree-based EMDs. We also show Diffusion EMD is fully differentiable, making it
amenable to future uses in gradient-descent frameworks such as deep neural
networks. Finally, we demonstrate an application of Diffusion EMD to single
cell data collected from 210 COVID-19 patient samples at Yale New Haven
Hospital. Here, Diffusion EMD can derive distances between patients on the
manifold of cells at least two orders of magnitude faster than equally accurate
methods. This distance matrix between patients can be embedded into a higher
level patient manifold which uncovers structure and heterogeneity in patients.
More generally, Diffusion EMD is applicable to all datasets that are massively
collected in parallel in many medical and biological systems.Comment: Presented at ICML 202