63 research outputs found
SnAG: Scalable and Accurate Video Grounding
Temporal grounding of text descriptions in videos is a central problem in
vision-language learning and video understanding. Existing methods often
prioritize accuracy over scalability -- they have been optimized for grounding
only a few text queries within short videos, and fail to scale up to long
videos with hundreds of queries. In this paper, we study the effect of
cross-modal fusion on the scalability of video grounding models. Our analysis
establishes late fusion as a more cost-effective fusion scheme for long-form
videos with many text queries. Moreover, it leads us to a novel, video-centric
sampling scheme for efficient training. Based on these findings, we present
SnAG, a simple baseline for scalable and accurate video grounding. Without
bells and whistles, SnAG is 43% more accurate and 1.5x faster than CONE, a
state of the art for long-form video grounding on the challenging MAD dataset,
while achieving highly competitive results on short videos.Comment: Accepted to CVPR 2024. Code available at
https://github.com/fmu2/snag_releas
Towards Few-Shot Adaptation of Foundation Models via Multitask Finetuning
Foundation models have emerged as a powerful tool for many AI problems.
Despite the tremendous success of foundation models, effective adaptation to
new tasks, particularly those with limited labels, remains an open question and
lacks theoretical understanding. An emerging solution with recent success in
vision and NLP involves finetuning a foundation model on a selection of
relevant tasks, before its adaptation to a target task with limited labeled
samples. In this paper, we study the theoretical justification of this
multitask finetuning approach. Our theoretical analysis reveals that with a
diverse set of related tasks, this multitask finetuning leads to reduced error
in the target task, in comparison to directly adapting the same pretrained
model. We quantify the relationship between finetuning tasks and target tasks
by diversity and consistency metrics, and further propose a practical task
selection algorithm. We substantiate our theoretical claims with extensive
empirical evidence. Further, we present results affirming our task selection
algorithm adeptly chooses related finetuning tasks, providing advantages to the
model performance on target tasks. We believe our study shed new light on the
effective adaptation of foundation models to new tasks that lack abundant
labels. Our code is available at
https://github.com/OliverXUZY/Foudation-Model_Multitask.Comment: Published at ICLR 2024. 54 page
FreeControl: Training-Free Spatial Control of Any Text-to-Image Diffusion Model with Any Condition
Recent approaches such as ControlNet offer users fine-grained spatial control
over text-to-image (T2I) diffusion models. However, auxiliary modules have to
be trained for each type of spatial condition, model architecture, and
checkpoint, putting them at odds with the diverse intents and preferences a
human designer would like to convey to the AI models during the content
creation process. In this work, we present FreeControl, a training-free
approach for controllable T2I generation that supports multiple conditions,
architectures, and checkpoints simultaneously. FreeControl designs structure
guidance to facilitate the structure alignment with a guidance image, and
appearance guidance to enable the appearance sharing between images generated
using the same seed. Extensive qualitative and quantitative experiments
demonstrate the superior performance of FreeControl across a variety of
pre-trained T2I models. In particular, FreeControl facilitates convenient
training-free control over many different architectures and checkpoints, allows
the challenging input conditions on which most of the existing training-free
methods fail, and achieves competitive synthesis quality with training-based
approaches.Comment: Project Page: https://genforce.github.io/freecontrol
Towards 3D Vision with Low-Cost Single-Photon Cameras
We present a method for reconstructing 3D shape of arbitrary Lambertian
objects based on measurements by miniature, energy-efficient, low-cost
single-photon cameras. These cameras, operating as time resolved image sensors,
illuminate the scene with a very fast pulse of diffuse light and record the
shape of that pulse as it returns back from the scene at a high temporal
resolution. We propose to model this image formation process, account for its
non-idealities, and adapt neural rendering to reconstruct 3D geometry from a
set of spatially distributed sensors with known poses. We show that our
approach can successfully recover complex 3D shapes from simulated data. We
further demonstrate 3D object reconstruction from real-world captures,
utilizing measurements from a commodity proximity sensor. Our work draws a
connection between image-based modeling and active range scanning and is a step
towards 3D vision with single-photon cameras
Projective Quasiparticle Interference of a Single Scatterer to Analyze the Electronic Band Structure of ZrSiS
Quasiparticle interference (QPI) of the electronic states has been widely
applied in scanning tunneling microscopy (STM) to analyze the electronic band
structure of materials. Single-defect induced QPI reveals defect-dependent
interaction between a single atomic defect and electronic states, which
deserves special attention. Due to the weak signal of single-defect-induced
QPI, the signal-to-noise ratio (SNR) is relatively low in a standard
two-dimensional QPI measurement. In this paper, we introduce a projective
quasiparticle interference (PQPI) method, in which a one-dimensional
measurement is taken along high-symmetry directions centered on a specified
defect. We apply the PQPI method to a topological nodal-line semimetal ZrSiS.
We focus on two special types of atomic defects that scatter the surface and
bulk electronic bands. With enhanced SNR in PQPI, the energy dispersions are
clearly resolved along high symmetry directions. We discuss the
defect-dependent scattering of bulk bands with the non-symmorphic
symmetry-enforced selection rules. Furthermore, an energy shift of the surface
floating band is observed and a new branch of energy dispersion (q6) is
resolved. This PQPI method can be applied to other complex materials to explore
defect-dependent interactions in the future.Comment: 21 pages, 5 figures, supplementary 3 pages, 2 figure
Learned Compressive Representations for Single-Photon 3D Imaging
Single-photon 3D cameras can record the time-of-arrival of billions of photons per second with picosecond accuracy. One common approach to summarize the photon data stream is to build a per-pixel timestamp histogram, resulting in a 3D histogram tensor that encodes distances along the time axis. As the spatio-temporal resolution of the histogram tensor increases, the in-pixel memory requirements and output data rates can quickly become impractical. To overcome this limitation, we propose a family of linear compressive representations of histogram tensors that can be computed efficiently, in an online fashion, as a matrix operation. We design practical lightweight compressive representations that are amenable to an in-pixel implementation and consider the spatio-temporal information of each timestamp. Furthermore, we implement our proposed framework as the first layer of a neural network, which enables the joint end-to-end optimization of the compressive representations and a downstream SPAD data processing model. We find that a well-designed compressive representation can reduce in-sensor memory and data rates up to 2 orders of magnitude without significantly reducing 3D imaging quality. Finally, we analyze the power consumption implications through an on-chip implementation
Zero-1-to-3: Domain-level Zero-shot Cognitive Diagnosis via One Batch of Early-bird Students towards Three Diagnostic Objectives
Cognitive diagnosis seeks to estimate the cognitive states of students by
exploring their logged practice quiz data. It plays a pivotal role in
personalized learning guidance within intelligent education systems. In this
paper, we focus on an important, practical, yet often underexplored task:
domain-level zero-shot cognitive diagnosis (DZCD), which arises due to the
absence of student practice logs in newly launched domains. Recent cross-domain
diagnostic models have been demonstrated to be a promising strategy for DZCD.
These methods primarily focus on how to transfer student states across domains.
However, they might inadvertently incorporate non-transferable information into
student representations, thereby limiting the efficacy of knowledge transfer.
To tackle this, we propose Zero-1-to-3, a domain-level zero-shot cognitive
diagnosis framework via one batch of early-bird students towards three
diagnostic objectives. Our approach initiates with pre-training a diagnosis
model with dual regularizers, which decouples student states into domain-shared
and domain-specific parts. The shared cognitive signals can be transferred to
the target domain, enriching the cognitive priors for the new domain, which
ensures the cognitive state propagation objective. Subsequently, we devise a
strategy to generate simulated practice logs for cold-start students through
analyzing the behavioral patterns from early-bird students, fulfilling the
domain-adaption goal. Consequently, we refine the cognitive states of
cold-start students as diagnostic outcomes via virtual data, aligning with the
diagnosis-oriented goal. Finally, extensive experiments on six real-world
datasets highlight the efficacy of our model for DZCD and its practical
application in question recommendation. The code is publicly available at
https://github.com/bigdata-ustc/Zero-1-to-3.Comment: Accepted by AAAI202
- …