144 research outputs found

    LIPIcs, Volume 251, ITCS 2023, Complete Volume

    Get PDF
    LIPIcs, Volume 251, ITCS 2023, Complete Volum

    Doubly Robust Proximal Causal Learning for Continuous Treatments

    Full text link
    Proximal causal learning is a promising framework for identifying the causal effect under the existence of unmeasured confounders. Within this framework, the doubly robust (DR) estimator was derived and has shown its effectiveness in estimation, especially when the model assumption is violated. However, the current form of the DR estimator is restricted to binary treatments, while the treatment can be continuous in many real-world applications. The primary obstacle to continuous treatments resides in the delta function present in the original DR estimator, making it infeasible in causal effect estimation and introducing a heavy computational burden in nuisance function estimation. To address these challenges, we propose a kernel-based DR estimator that can well handle continuous treatments. Equipped with its smoothness, we show that its oracle form is a consistent approximation of the influence function. Further, we propose a new approach to efficiently solve the nuisance functions. We then provide a comprehensive convergence analysis in terms of the mean square error. We demonstrate the utility of our estimator on synthetic datasets and real-world applications.Comment: Preprint, under revie

    Generalising weighted model counting

    Get PDF
    Given a formula in propositional or (finite-domain) first-order logic and some non-negative weights, weighted model counting (WMC) is a function problem that asks to compute the sum of the weights of the models of the formula. Originally used as a flexible way of performing probabilistic inference on graphical models, WMC has found many applications across artificial intelligence (AI), machine learning, and other domains. Areas of AI that rely on WMC include explainable AI, neural-symbolic AI, probabilistic programming, and statistical relational AI. WMC also has applications in bioinformatics, data mining, natural language processing, prognostics, and robotics. In this work, we are interested in revisiting the foundations of WMC and considering generalisations of some of the key definitions in the interest of conceptual clarity and practical efficiency. We begin by developing a measure-theoretic perspective on WMC, which suggests a new and more general way of defining the weights of an instance. This new representation can be as succinct as standard WMC but can also expand as needed to represent less-structured probability distributions. We demonstrate the performance benefits of the new format by developing a novel WMC encoding for Bayesian networks. We then show how existing WMC encodings for Bayesian networks can be transformed into this more general format and what conditions ensure that the transformation is correct (i.e., preserves the answer). Combining the strengths of the more flexible representation with the tricks used in existing encodings yields further efficiency improvements in Bayesian network probabilistic inference. Next, we turn our attention to the first-order setting. Here, we argue that the capabilities of practical model counting algorithms are severely limited by their inability to perform arbitrary recursive computations. To enable arbitrary recursion, we relax the restrictions that typically accompany domain recursion and generalise circuits (used to express a solution to a model counting problem) to graphs that are allowed to have cycles. These improvements enable us to find efficient solutions to counting fundamental structures such as injections and bijections that were previously unsolvable by any available algorithm. The second strand of this work is concerned with synthetic data generation. Testing algorithms across a wide range of problem instances is crucial to ensure the validity of any claim about one algorithm’s superiority over another. However, benchmarks are often limited and fail to reveal differences among the algorithms. First, we show how random instances of probabilistic logic programs (that typically use WMC algorithms for inference) can be generated using constraint programming. We also introduce a new constraint to control the independence structure of the underlying probability distribution and provide a combinatorial argument for the correctness of the constraint model. This model allows us to, for the first time, experimentally investigate inference algorithms on more than just a handful of instances. Second, we introduce a random model for WMC instances with a parameter that influences primal treewidth—the parameter most commonly used to characterise the difficulty of an instance. We show that the easy-hard-easy pattern with respect to clause density is different for algorithms based on dynamic programming and algebraic decision diagrams than for all other solvers. We also demonstrate that all WMC algorithms scale exponentially with respect to primal treewidth, although at differing rates

    Bounded-depth Frege complexity of Tseitin formulas for all graphs

    Get PDF
    We prove that there is a constant K such that Tseitin formulas for a connected graph G requires proofs of size 2tw(G)javax.xml.bind.JAXBElement@531a834b in depth-d Frege systems for [Formula presented], where tw(G) is the treewidth of G. This extends Håstad's recent lower bound from grid graphs to any graph. Furthermore, we prove tightness of our bound up to a multiplicative constant in the top exponent. Namely, we show that if a Tseitin formula for a graph G has size s, then for all large enough d, it has a depth-d Frege proof of size 2tw(G)javax.xml.bind.JAXBElement@25a4b51fpoly(s). Through this result we settle the question posed by M. Alekhnovich and A. Razborov of showing that the class of Tseitin formulas is quasi-automatizable for resolution

    Recent Advances in Optimal Transport for Machine Learning

    Full text link
    Recently, Optimal Transport has been proposed as a probabilistic framework in Machine Learning for comparing and manipulating probability distributions. This is rooted in its rich history and theory, and has offered new solutions to different problems in machine learning, such as generative modeling and transfer learning. In this survey we explore contributions of Optimal Transport for Machine Learning over the period 2012 -- 2022, focusing on four sub-fields of Machine Learning: supervised, unsupervised, transfer and reinforcement learning. We further highlight the recent development in computational Optimal Transport, and its interplay with Machine Learning practice.Comment: 20 pages,5 figures,under revie

    Advancing and Leveraging Tractable Likelihood Models

    Get PDF
    The past decade has seen a remarkable improvement in a variety of machine learning applications thanks to numerous advances in deep neural networks (DNN). These models are now the de facto standard in fields ranging from image/speech recognition to driverless cars and have begun to permeate aspects of modern science and everyday life. The deep learning revolution has also resulted in highly effective generative models such as score matching models, diffusion models, VAEs, GANs, and tractable likelihood models. These models are best known for their ability to create novel samples of impressive quality but are usually limited to highly structured data modalities. Expanding the capabilities and applications of likelihood models beyond conventional data formats and generative applications can increase functionality, interpretability, and intuition compared to conventional methods. This dissertation addresses shortcomings in likelihood models over less structured data and explores methods to exploit a learned density as part of a larger application. We begin by advancing the performance of likelihood models outside the standard, ordered data regime by developing methods that are applicable to sets, e.g., point clouds. Many data sources contain instances that are a collection of unordered points, such as points on the surface of scans from human organs, sets of images from a web page, or LiDAR observations commonly used in driverless cars or (hyper-spectral) aerial surveys.We then explore several applications of density models. First, we consider generative process over neural networks themselves and show that training over ensembles of these sampled models can lead to improved robustness to adversarial attacks. Next, we demonstrate how to use the transformative portion of a normalizing flow as a feature extractor in conjunction with a downstream task to estimate expectations over model performance in local and global regions.Finally, we propose a learnable, continuous parameterization of mixture models directly on the input space to improve model interpretability while simultaneously allowing for arbitrary marginalization or conditioning without the need to train new models or develop complex masking mechanisms.Doctor of Philosoph

    Convergence of flow-based generative models via proximal gradient descent in Wasserstein space

    Full text link
    Flow-based generative models enjoy certain advantages in computing the data generation and the likelihood, and have recently shown competitive empirical performance. Compared to the accumulating theoretical studies on related score-based diffusion models, analysis of flow-based models, which are deterministic in both forward (data-to-noise) and reverse (noise-to-data) directions, remain sparse. In this paper, we provide a theoretical guarantee of generating data distribution by a progressive flow model, the so-called JKO flow model, which implements the Jordan-Kinderleherer-Otto (JKO) scheme in a normalizing flow network. Leveraging the exponential convergence of the proximal gradient descent (GD) in Wasserstein space, we prove the Kullback-Leibler (KL) guarantee of data generation by a JKO flow model to be O(ε2)O(\varepsilon^2) when using Nlog(1/ε)N \lesssim \log (1/\varepsilon) many JKO steps (NN Residual Blocks in the flow) where ε\varepsilon is the error in the per-step first-order condition. The assumption on data density is merely a finite second moment, and the theory extends to data distributions without density and when there are inversion errors in the reverse process where we obtain KL-W2W_2 mixed error guarantees. The non-asymptotic convergence rate of the JKO-type W2W_2-proximal GD is proved for a general class of convex objective functionals that includes the KL divergence as a special case, which can be of independent interest

    Neural distribution estimation as a two-part problem

    Get PDF
    Given a dataset of examples, distribution estimation is the task of approximating the assumed underlying probability distribution from which those samples were drawn. Neural distribution estimation relies on the powerful function approximation capabilities of deep neural networks to build models for this purpose, and excels when data is high-dimensional and exhibits complex, nonlinear dependencies. In this thesis, we explore several approaches to neural distribution estimation, and present a unified perspective for these methods based on a two-part design principle. In particular, we examine how many models iteratively break down the task of distribution estimation into a series of tractable sub-tasks, before fitting a multi-step generative process which combines solutions to these sub-tasks in order to approximate the data distribution of interest. Framing distribution estimation as a two-part problem provides a shared language in which to compare and contrast prevalent models in the literature, and also allows for discussion of alternative approaches which do not follow this structure. We first present the Autoregressive Energy Machine, an energy-based model which is trained by approximate maximum likelihood through an autoregressive decomposition. The method demonstrates the flexibility of an energy-based model over an explicitly normalized model, and the novel application of autoregressive importance sampling highlights the benefit of an autoregressive approach to distribution estimation which recursively transforms the problem into a series of univariate tasks. Next, we present Neural Spline Flows, a class of normalizing flow models based on monotonic spline transformations which admit both an explicit inverse and a tractable Jacobian determinant. Normalizing flows tackle distribution estimation by searching for an invertible map between the data distribution and a more tractable base distribution, and this map is typically constructed as the composition of a series of invertible building blocks. We demonstrate that spline flows can be used to enhance density estimation of tabular data, variational inference in latent variable models, and generative modeling of natural images. The third chapter presents Maximum Likelihood Training of Score-Based Diffusion Models. Generative models based on estimation of the gradient of the logarithm of the probability density---or score function---have recently gained traction as a powerful modeling paradigm, in which the data distribution is gradually transformed toward a tractable base distribution by means of a stochastic process. The paper illustrates how this class of models can be trained by maximum likelihood, resulting in a model which is functionally equivalent to a continuous normalizing flow, and which bridges the gap between two branches of the literature. We also discuss latent-variable generative models more broadly, of which diffusion models are a structured special case. Finally, we present On Contrastive Learning for Likelihood-Free Inference, a unifying perspective for likelihood-free inference methods which perform Bayesian inference using either density estimation or density-ratio estimation. Likelihood-free inference focuses on inference in stochastic simulator models where the likelihood of parameters given observations is computationally intractable, and traditional inference methods fall short. In addition to illustrating the power of normalizing flows as generic tools for density estimation, this chapter also gives us the opportunity to discuss likelihood-free models more broadly. These so-called implicit generative models form a large part of the distribution estimation literature under the umbrella of generative adversarial networks, and are distinct in how they treat distribution estimation as a one-part problem

    Simulation-free Schr\"odinger bridges via score and flow matching

    Full text link
    We present simulation-free score and flow matching ([SF]2^2M), a simulation-free objective for inferring stochastic dynamics given unpaired source and target samples drawn from arbitrary distributions. Our method generalizes both the score-matching loss used in the training of diffusion models and the recently proposed flow matching loss used in the training of continuous normalizing flows. [SF]2^2M interprets continuous-time stochastic generative modeling as a Schr\"odinger bridge (SB) problem. It relies on static entropy-regularized optimal transport, or a minibatch approximation, to efficiently learn the SB without simulating the learned stochastic process. We find that [SF]2^2M is more efficient and gives more accurate solutions to the SB problem than simulation-based methods from prior work. Finally, we apply [SF]2^2M to the problem of learning cell dynamics from snapshot data. Notably, [SF]2^2M is the first method to accurately model cell dynamics in high dimensions and can recover known gene regulatory networks from simulated data.Comment: A version of this paper appeared in the New Frontiers in Learning, Control, and Dynamical Systems workshop at ICML 2023. Code: https://github.com/atong01/conditional-flow-matchin

    Lifting to parity decision trees via Stifling

    Get PDF
    We show that the deterministic decision tree complexity of a (partial) function or relation f lifts to the deterministic parity decision tree (PDT) size complexity of the composed function/relation f ◦ g as long as the gadget g satisfies a property that we call stifling. We observe that several simple gadgets of constant size, like Indexing on 3 input bits, Inner Product on 4 input bits, Majority on 3 input bits and random functions, satisfy this property. It can be shown that existing randomized communication lifting theorems ([Göös, Pitassi, Watson. SICOMP'20], [Chattopadhyay et al. SICOMP'21]) imply PDT-size lifting. However there are two shortcomings of this approach: first they lift randomized decision tree complexity of f, which could be exponentially smaller than its deterministic counterpart when either f is a partial function or even a total search problem. Second, the size of the gadgets in such lifting theorems are as large as logarithmic in the size of the input to f. Reducing the gadget size to a constant is an important open problem at the frontier of current research. Our result shows that even a random constant-size gadget does enable lifting to PDT size. Further, it also yields the first systematic way of turning lower bounds on the width of tree-like resolution proofs of the unsatisfiability of constant-width CNF formulas to lower bounds on the size of tree-like proofs in the resolution with parity system, i.e., Res(☉), of the unsatisfiability of closely related constant-width CNF formulas
    corecore