Search CORE

163 research outputs found

Distributionally Robust Optimization and Robust Statistics

Author: Blanchet Jose
Li Jiajin
Lin Sirui
Zhang Xuhui
Publication venue
Publication date: 26/01/2024
Field of study

We review distributionally robust optimization (DRO), a principled approach for constructing statistical estimators that hedge against the impact of deviations in the expected loss between the training and deployment environments. Many well-known estimators in statistics and machine learning (e.g. AdaBoost, LASSO, ridge regression, dropout training, etc.) are distributionally robust in a precise sense. We hope that by discussing the DRO interpretation of well-known estimators, statisticians who may not be too familiar with DRO may find a way to access the DRO literature through the bridge between classical results and their DRO equivalent formulation. On the other hand, the topic of robustness in statistics has a rich tradition associated with removing the impact of contamination. Thus, another objective of this paper is to clarify the difference between DRO and classical statistical robustness. As we will see, these are two fundamentally different philosophies leading to completely different types of estimators. In DRO, the statistician hedges against an environment shift that occurs after the decision is made; thus DRO estimators tend to be pessimistic in an adversarial setting, leading to a min-max type formulation. In classical robust statistics, the statistician seeks to correct contamination that occurred before a decision is made; thus robust statistical estimators tend to be optimistic leading to a min-min type formulation

arXiv.org e-Print Archive

Synthetic Principal Component Design: Fast Covariate Balancing with Synthetic Controls

Author: Blanchet Jose
Li Jiajin
Lu Yiping
Ying Lexing
Publication venue
Publication date: 28/11/2022
Field of study

The optimal design of experiments typically involves solving an NP-hard combinatorial optimization problem. In this paper, we aim to develop a globally convergent and practically efficient optimization algorithm. Specifically, we consider a setting where the pre-treatment outcome data is available and the synthetic control estimator is invoked. The average treatment effect is estimated via the difference between the weighted average outcomes of the treated and control units, where the weights are learned from the observed data. {Under this setting, we surprisingly observed that the optimal experimental design problem could be reduced to a so-called \textit{phase synchronization} problem.} We solve this problem via a normalized variant of the generalized power method with spectral initialization. On the theoretical side, we establish the first global optimality guarantee for experiment design when pre-treatment data is sampled from certain data-generating processes. Empirically, we conduct extensive experiments to demonstrate the effectiveness of our method on both the US Bureau of Labor Statistics and the Abadie-Diemond-Hainmueller California Smoking Data. In terms of the root mean square error, our algorithm surpasses the random design by a large margin

arXiv.org e-Print Archive

Nonsmooth Composite Nonconvex-Concave Minimax Optimization

Author: Li Jiajin
So Anthony Man-Cho
Zhu Linglingzhi
Publication venue
Publication date: 22/09/2022
Field of study

Nonconvex-concave minimax optimization has received intense interest in machine learning, including learning with robustness to data distribution, learning with non-decomposable loss, adversarial learning, to name a few. Nevertheless, most existing works focus on the gradient-descent-ascent (GDA) variants that can only be applied in smooth settings. In this paper, we consider a family of minimax problems whose objective function enjoys the nonsmooth composite structure in the variable of minimization and is concave in the variables of maximization. By fully exploiting the composite structure, we propose a smoothed proximal linear descent ascent (\textit{smoothed} PLDA) algorithm and further establish its

\mathcal{O}(\epsilon^{-4})

iteration complexity, which matches that of smoothed GDA~\cite{zhang2020single} under smooth settings. Moreover, under the mild assumption that the objective function satisfies the one-sided Kurdyka-\L{}ojasiewicz condition with exponent

\theta \in (0,1)

, we can further improve the iteration complexity to

\mathcal{O}(\epsilon^{-2\max\{2\theta,1\}})

. To the best of our knowledge, this is the first provably efficient algorithm for nonsmooth nonconvex-concave problems that can achieve the optimal iteration complexity

\mathcal{O}(\epsilon^{-2})

\theta \in (0,1/2]

. As a byproduct, we discuss different stationarity concepts and clarify their relationships quantitatively, which could be of independent interest. Empirically, we illustrate the effectiveness of the proposed smoothed PLDA in variation regularized Wasserstein distributionally robust optimization problems

arXiv.org e-Print Archive

ForestQC: Quality control on genetic variants from next-generation sequencing data using random forest.

Author: Coppola Giovanni
Freimer Nelson B
Hwang Sungoo
Jew Brandon
Li Jiajin
Sul Jae Hoon
Zhan Lingyu
Publication venue: eScholarship, University of California
Publication date: 01/12/2019
Field of study

Next-generation sequencing technology (NGS) enables the discovery of nearly all genetic variants present in a genome. A subset of these variants, however, may have poor sequencing quality due to limitations in NGS or variant callers. In genetic studies that analyze a large number of sequenced individuals, it is critical to detect and remove those variants with poor quality as they may cause spurious findings. In this paper, we present ForestQC, a statistical tool for performing quality control on variants identified from NGS data by combining a traditional filtering approach and a machine learning approach. Our software uses the information on sequencing quality, such as sequencing depth, genotyping quality, and GC contents, to predict whether a particular variant is likely to be false-positive. To evaluate ForestQC, we applied it to two whole-genome sequencing datasets where one dataset consists of related individuals from families while the other consists of unrelated individuals. Results indicate that ForestQC outperforms widely used methods for performing quality control on variants such as VQSR of GATK by considerably improving the quality of variants to be included in the analysis. ForestQC is also very efficient, and hence can be applied to large sequencing datasets. We conclude that combining a machine learning algorithm trained with sequencing quality information and the filtering approach is a practical approach to perform quality control on genetic variants from sequencing data

Directory of Open Access Journals

eScholarship - University of California

An Efficient Linear Mixed Model Framework for Meta-Analytic Association Studies Across Multiple Contexts

Author: Jew Brandon
Li Jiajin
Sankararaman Sriram
Sul Jae Hoon
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 21st International Workshop on Algorithms in Bioinformatics (WABI 2021)
Publication date: 01/01/2021
Field of study

Linear mixed models (LMMs) can be applied in the meta-analyses of responses from individuals across multiple contexts, increasing power to detect associations while accounting for confounding effects arising from within-individual variation. However, traditional approaches to fitting these models can be computationally intractable. Here, we describe an efficient and exact method for fitting a multiple-context linear mixed model. Whereas existing exact methods may be cubic in their time complexity with respect to the number of individuals, our approach for multiple-context LMMs (mcLMM) is linear. These improvements allow for large-scale analyses requiring computing time and memory magnitudes of order less than existing methods. As examples, we apply our approach to identify expression quantitative trait loci from large-scale gene expression data measured across multiple tissues as well as joint analyses of multiple phenotypes in genome-wide association studies at biobank scale

Dagstuhl Research Online Publication Server

Tikhonov Regularization is Optimal Transport Robust under Martingale Constraints

Author: Blanchet Jose
Li Jiajin
Lin Sirui
Nguyen Viet Anh
Publication venue
Publication date: 04/10/2022
Field of study

Distributionally robust optimization has been shown to offer a principled way to regularize learning models. In this paper, we find that Tikhonov regularization is distributionally robust in an optimal transport sense (i.e., if an adversary chooses distributions in a suitable optimal transport neighborhood of the empirical measure), provided that suitable martingale constraints are also imposed. Further, we introduce a relaxation of the martingale constraints which not only provides a unified viewpoint to a class of existing robust methods but also leads to new regularization tools. To realize these novel tools, tractable computational algorithms are proposed. As a byproduct, the strong duality theorem proved in this paper can be potentially applied to other problems of independent interest.Comment: Accepted by NeurIPS 202

arXiv.org e-Print Archive

Outlier-Robust Gromov-Wasserstein for Graph Data

Author: Kong Lemin
Li Jiajin
So Anthony Man-Cho
Tang Jianheng
Publication venue
Publication date: 30/10/2023
Field of study

Gromov-Wasserstein (GW) distance is a powerful tool for comparing and aligning probability distributions supported on different metric spaces. Recently, GW has become the main modeling technique for aligning heterogeneous data for a wide range of graph learning tasks. However, the GW distance is known to be highly sensitive to outliers, which can result in large inaccuracies if the outliers are given the same weight as other samples in the objective function. To mitigate this issue, we introduce a new and robust version of the GW distance called RGW. RGW features optimistically perturbed marginal constraints within a Kullback-Leibler divergence-based ambiguity set. To make the benefits of RGW more accessible in practice, we develop a computationally efficient and theoretically provable procedure using Bregman proximal alternating linearized minimization algorithm. Through extensive experimentation, we validate our theoretical results and demonstrate the effectiveness of RGW on real-world graph learning tasks, such as subgraph matching and partial shape correspondence

arXiv.org e-Print Archive

Multiplexed entanglement swapping with atomic-ensemble-based quantum memories in the single excitation regime

Author: Fan Wenxin
Jiao Haole
Li Shujing
Lu Jiajin
Wang Hai
Wang Minjie
Publication venue
Publication date: 31/12/2023
Field of study

Entanglement swapping (ES) between memory repeater links is critical for establishing quantum networks via quantum repeaters. So far, ES with atomic-ensemble-based memories has not been achieved. Here, we experimentally demonstrated ES between two entangled pairs of spin-wave memories via Duan-Lukin-Cirac-Zoller scheme. With a cloud of cold atoms inserted in a cavity, we produce non-classically-correlated spin-wave-photon pairs in 12 spatial modes and then prepare two entangled pairs of spin-wave memories via a multiplexed scheme. Via single-photon Bell measurement on retrieved fields from two memories, we project the two remaining memories never entangled previously into an entangled state with the measured concurrence of C = 0.0124(0.003). The successful probability of ES in our scheme is increased by three times, compared with that in non-multiplexed scheme. Our presented work shows that the generation of entanglement (C>0) between the remaining memory ensembles requires the average cross-correlation function of the spin-wave-photon pairs to be >30

arXiv.org e-Print Archive