Search CORE

280 research outputs found

Discussion of ``2004 IMS Medallion Lecture: Local Rademacher complexities and oracle inequalities in risk minimization'' by V. Koltchinskii

Author: Shen Xiaotong
Wang Lifeng
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/08/2007
Field of study

Discussion of ``2004 IMS Medallion Lecture: Local Rademacher complexities and oracle inequalities in risk minimization'' by V. Koltchinskii [arXiv:0708.0083]Comment: Published at http://dx.doi.org/10.1214/009053606000001055 in the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

Crossref

Generalization error for multi-class margin classification

Author: Shen Xiaotong
Wang Lifeng
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2007
Field of study

In this article, we study rates of convergence of the generalization error of multi-class margin classifiers. In particular, we develop an upper bound theory quantifying the generalization error of various large margin classifiers. The theory permits a treatment of general margin losses, convex or nonconvex, in presence or absence of a dominating class. Three main results are established. First, for any fixed margin loss, there may be a trade-off between the ideal and actual generalization performances with respect to the choice of the class of candidate decision functions, which is governed by the trade-off between the approximation and estimation errors. In fact, different margin losses lead to different ideal or actual performances in specific cases. Second, we demonstrate, in a problem of linear learning, that the convergence rate can be arbitrarily fast in the sample size

n

depending on the joint distribution of the input/output pair. This goes beyond the anticipated rate

O(n^{-1})

. Third, we establish rates of convergence of several margin classifiers in feature selection with the number of candidate variables

p

allowed to greatly exceed the sample size

n

but no faster than

\exp(n)

.Comment: Published at http://dx.doi.org/10.1214/07-EJS069 in the Electronic Journal of Statistics (http://www.i-journals.org/ejs/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

Crossref

Conversions between barycentric, RKFUN, and Newton representations of rational interpolants

Author: Wei Pan (701)
Xiaotong Shen (609993)
Zhiyuan Xu (609992)
Publication venue
Publication date: 02/11/2017
Field of study

We derive explicit formulas for converting between rational interpolants in barycentric, rational Krylov (RKFUN), and Newton form. We show applications of these conversions when working with rational approximants produced by the AAA algorithm [Y. Nakatsukasa, O. S\`ete, L. N. Trefethen, arXiv preprint 1612.00337, 2016] within the Rational Krylov Toolbox and for the solution of nonlinear eigenvalue problems

arXiv.org e-Print Archive

FigShare

Boosting Data Analytics With Synthetic Volume Expansion

Author: Liu Yifei
Shen Rex
Shen Xiaotong
Publication venue
Publication date: 10/03/2024
Field of study

Synthetic data generation, a cornerstone of Generative Artificial Intelligence, promotes a paradigm shift in data science by addressing data scarcity and privacy while enabling unprecedented performance. As synthetic data becomes more prevalent, concerns emerge regarding the accuracy of statistical methods when applied to synthetic data in contrast to raw data. This article explores the effectiveness of statistical methods on synthetic data and the privacy risks of synthetic data. Regarding effectiveness, we present the Synthetic Data Generation for Analytics framework. This framework applies statistical approaches to high-quality synthetic data produced by generative models like tabular diffusion models, which, initially trained on raw data, benefit from insights from pertinent studies through transfer learning. A key finding within this framework is the generational effect, which reveals that the error rate of statistical methods on synthetic data decreases with the addition of more synthetic data but may eventually rise or stabilize. This phenomenon, stemming from the challenge of accurately mirroring raw data distributions, highlights a "reflection point"-an ideal volume of synthetic data defined by specific error metrics. Through three case studies, sentiment analysis, predictive modeling of structured data, and inference in tabular data, we validate the superior performance of this framework compared to conventional approaches. On privacy, synthetic data imposes lower risks while supporting the differential privacy standard. These studies underscore synthetic data's untapped potential in redefining data science's landscape

arXiv.org e-Print Archive

Perturbation-Assisted Sample Synthesis: A Novel Approach for Uncertainty Quantification

Author: Liu Yifei
Shen Rex
Shen Xiaotong
Publication venue
Publication date: 29/05/2023
Field of study

This paper introduces a novel generator called Perturbation-Assisted Sample Synthesis (PASS), designed for drawing reliable conclusions from complex data, especially when using advanced modeling techniques like deep neural networks. PASS utilizes perturbation to generate synthetic data that closely mirrors the distribution of raw data, encompassing numerical and unstructured data types such as gene expression, images, and text. By estimating the data-generating distribution and leveraging large pre-trained generative models, PASS enhances estimation accuracy, providing an estimated distribution of any statistic through Monte Carlo experiments. Building on PASS, we propose a generative inference framework called Perturbation-Assisted Inference (PAI), which offers a statistical guarantee of validity. In pivotal inference, PAI enables accurate conclusions without knowing a pivotal's distribution as in simulations, even with limited data. In non-pivotal situations, we train PASS using an independent holdout sample, resulting in credible conclusions. To showcase PAI's capability in tackling complex problems, we highlight its applications in three domains: image synthesis inference, sentiment word inference, and multimodal inference via stable diffusion

arXiv.org e-Print Archive