54 research outputs found
Approximate co-sufficient sampling with regularization
In this work, we consider the problem of goodness-of-fit (GoF) testing for
parametric models -- for example, testing whether observed data follows a
logistic regression model. This testing problem involves a composite null
hypothesis, due to the unknown values of the model parameters. In some special
cases, co-sufficient sampling (CSS) can remove the influence of these unknown
parameters via conditioning on a sufficient statistic -- often, the maximum
likelihood estimator (MLE) of the unknown parameters. However, many common
parametric settings (including logistic regression) do not permit this
approach, since conditioning on a sufficient statistic leads to a powerless
test. The recent approximate co-sufficient sampling (aCSS) framework of Barber
and Janson (2022) offers an alternative, replacing sufficiency with an
approximately sufficient statistic (namely, a noisy version of the MLE). This
approach recovers power in a range of settings where CSS cannot be applied, but
can only be applied in settings where the unconstrained MLE is well-defined and
well-behaved, which implicitly assumes a low-dimensional regime. In this work,
we extend aCSS to the setting of constrained and penalized maximum likelihood
estimation, so that more complex estimation problems can now be handled within
the aCSS framework, including examples such as mixtures-of-Gaussians (where the
unconstrained MLE is not well-defined due to degeneracy) and high-dimensional
Gaussian linear models (where the MLE can perform well under regularization,
such as an penalty or a shape constraint)
Recommended from our members
Towards Collaborative Generative AI for Vision-and-Language Studies
In recent years, the field of vision-and-language studies has witnessed significant advancements, aiming to bridge the gap between visual perception and linguistic understanding. These studies have explored various approaches to enhance the capabilities of AI systems in generating natural language or visual content, understanding multimodal scenarios, and conducting commonsense reasoning. Despite these advancements, there remains a crucial need for further progress to enable more collaborative and comprehensive interactions between vision and language modalities. This dissertation addresses this need through three primary contributions:First, I introduce the concept of machine imagination for natural language processing studies. Specifically, I present the use of visual information generated by machines for the automatic evaluation of natural language generation, natural language understanding, and natural language generation.Second, I explore the utilization of large language models (LLMs) to enhance the performance of vision and multimodal tasks. In particular, I examine the effectiveness of applying LLMs for prompt editing in text-to-image generation, compositional layout planning and generation, and vision-and-language navigation.Third, I outline my contributions to publicly available open-source vision-and-language research. Specifically, we introduce Multimodal C4, a large-scale multimodal dataset containing interleaved images and text, which we used to train the large-scale multimodal model OpenFlamingo. Additionally, we introduce VisIT-Bench, a public benchmark for evaluating instruction-following vision-language models in real-world applications.This dissertation aims to push the boundaries of vision-and-language integration, providing new insights and tools for developing more sophisticated AI systems capable of seamless multimodal interactions
Weighted Averaged Stochastic Gradient Descent: Asymptotic Normality and Optimality
Stochastic Gradient Descent (SGD) is one of the simplest and most popular
algorithms in modern statistical and machine learning due to its computational
and memory efficiency. Various averaging schemes have been proposed to
accelerate the convergence of SGD in different settings. In this paper, we
explore a general averaging scheme for SGD. Specifically, we establish the
asymptotic normality of a broad range of weighted averaged SGD solutions and
provide asymptotically valid online inference approaches. Furthermore, we
propose an adaptive averaging scheme that exhibits both optimal statistical
rate and favorable non-asymptotic convergence, drawing insights from the
optimal weight for the linear model in terms of non-asymptotic mean squared
error (MSE)
GaitRef: Gait Recognition with Refined Sequential Skeletons
Identifying humans with their walking sequences, known as gait recognition,
is a useful biometric understanding task as it can be observed from a long
distance and does not require cooperation from the subject. Two common
modalities used for representing the walking sequence of a person are
silhouettes and joint skeletons. Silhouette sequences, which record the
boundary of the walking person in each frame, may suffer from the variant
appearances from carried-on objects and clothes of the person. Framewise joint
detections are noisy and introduce some jitters that are not consistent with
sequential detections. In this paper, we combine the silhouettes and skeletons
and refine the framewise joint predictions for gait recognition. With temporal
information from the silhouette sequences. We show that the refined skeletons
can improve gait recognition performance without extra annotations. We compare
our methods on four public datasets, CASIA-B, OUMVLP, Gait3D and GREW, and show
state-of-the-art performance.Comment: IJCB 2023. Code is available at
https://github.com/haidongz-usc/GaitRe
ShARc: Shape and Appearance Recognition for Person Identification In-the-wild
Identifying individuals in unconstrained video settings is a valuable yet
challenging task in biometric analysis due to variations in appearances,
environments, degradations, and occlusions. In this paper, we present ShARc, a
multimodal approach for video-based person identification in uncontrolled
environments that emphasizes 3-D body shape, pose, and appearance. We introduce
two encoders: a Pose and Shape Encoder (PSE) and an Aggregated Appearance
Encoder (AAE). PSE encodes the body shape via binarized silhouettes, skeleton
motions, and 3-D body shape, while AAE provides two levels of temporal
appearance feature aggregation: attention-based feature aggregation and
averaging aggregation. For attention-based feature aggregation, we employ
spatial and temporal attention to focus on key areas for person distinction.
For averaging aggregation, we introduce a novel flattening layer after
averaging to extract more distinguishable information and reduce overfitting of
attention. We utilize centroid feature averaging for gallery registration. We
demonstrate significant improvements over existing state-of-the-art methods on
public datasets, including CCVID, MEVID, and BRIAR.Comment: WACV 202
High Confidence Level Inference is Almost Free using Parallel Stochastic Optimization
Uncertainty quantification for estimation through stochastic optimization
solutions in an online setting has gained popularity recently. This paper
introduces a novel inference method focused on constructing confidence
intervals with efficient computation and fast convergence to the nominal level.
Specifically, we propose to use a small number of independent multi-runs to
acquire distribution information and construct a t-based confidence interval.
Our method requires minimal additional computation and memory beyond the
standard updating of estimates, making the inference process almost cost-free.
We provide a rigorous theoretical guarantee for the confidence interval,
demonstrating that the coverage is approximately exact with an explicit
convergence rate and allowing for high confidence level inference. In
particular, a new Gaussian approximation result is developed for the online
estimators to characterize the coverage properties of our confidence intervals
in terms of relative errors. Additionally, our method also allows for
leveraging parallel computing to further accelerate calculations using multiple
cores. It is easy to implement and can be integrated with existing stochastic
algorithms without the need for complicated modifications
End-to-end Dense Video Captioning as Sequence Generation
Dense video captioning aims to identify the events of interest in an input
video, and generate descriptive captions for each event. Previous approaches
usually follow a two-stage generative process, which first proposes a segment
for each event, then renders a caption for each identified segment. Recent
advances in large-scale sequence generation pretraining have seen great success
in unifying task formulation for a great variety of tasks, but so far, more
complex tasks such as dense video captioning are not able to fully utilize this
powerful paradigm. In this work, we show how to model the two subtasks of dense
video captioning jointly as one sequence generation task, and simultaneously
predict the events and the corresponding descriptions. Experiments on YouCook2
and ViTT show encouraging results and indicate the feasibility of training
complex tasks such as end-to-end dense video captioning integrated into
large-scale pre-trained models
ImaginE: An Imagination-Based Automatic Evaluation Metric for Natural Language Generation
Automatic evaluations for natural language generation (NLG) conventionally
rely on token-level or embedding-level comparisons with text references. This
differs from human language processing, for which visual imagination often
improves comprehension. In this work, we propose ImaginE, an imagination-based
automatic evaluation metric for natural language generation. With the help of
StableDiffusion, a state-of-the-art text-to-image generator, we automatically
generate an image as the embodied imagination for the text snippet and compute
the imagination similarity using contextual embeddings. Experiments spanning
several text generation tasks demonstrate that adding machine-generated images
with our ImaginE displays great potential in introducing multi-modal
information into NLG evaluation, and improves existing automatic metrics'
correlations with human similarity judgments in both reference-based and
reference-free evaluation scenarios.Comment: EACL 202
- …