76 research outputs found
AI-Generated Images as Data Source: The Dawn of Synthetic Era
The advancement of visual intelligence is intrinsically tethered to the
availability of large-scale data. In parallel, generative Artificial
Intelligence (AI) has unlocked the potential to create synthetic images that
closely resemble real-world photographs. This prompts a compelling inquiry: how
much visual intelligence could benefit from the advance of generative AI? This
paper explores the innovative concept of harnessing these AI-generated images
as new data sources, reshaping traditional modeling paradigms in visual
intelligence. In contrast to real data, AI-generated data exhibit remarkable
advantages, including unmatched abundance and scalability, the rapid generation
of vast datasets, and the effortless simulation of edge cases. Built on the
success of generative AI models, we examine the potential of their generated
data in a range of applications, from training machine learning models to
simulating scenarios for computational modeling, testing, and validation. We
probe the technological foundations that support this groundbreaking use of
generative AI, engaging in an in-depth discussion on the ethical, legal, and
practical considerations that accompany this transformative paradigm shift.
Through an exhaustive survey of current technologies and applications, this
paper presents a comprehensive view of the synthetic era in visual
intelligence. A project associated with this paper can be found at
https://github.com/mwxely/AIGS .Comment: 20 pages, 11 figure
Vision-Language Models can Identify Distracted Driver Behavior from Naturalistic Videos
Recognizing the activities, causing distraction, in real-world driving
scenarios is critical for ensuring the safety and reliability of both drivers
and pedestrians on the roadways. Conventional computer vision techniques are
typically data-intensive and require a large volume of annotated training data
to detect and classify various distracted driving behaviors, thereby limiting
their efficiency and scalability. We aim to develop a generalized framework
that showcases robust performance with access to limited or no annotated
training data. Recently, vision-language models have offered large-scale
visual-textual pretraining that can be adapted to task-specific learning like
distracted driving activity recognition. Vision-language pretraining models,
such as CLIP, have shown significant promise in learning natural
language-guided visual representations. This paper proposes a CLIP-based driver
activity recognition approach that identifies driver distraction from
naturalistic driving images and videos. CLIP's vision embedding offers
zero-shot transfer and task-based finetuning, which can classify distracted
activities from driving video data. Our results show that this framework offers
state-of-the-art performance on zero-shot transfer and video-based CLIP for
predicting the driver's state on two public datasets. We propose both
frame-based and video-based frameworks developed on top of the CLIP's visual
representation for distracted driving detection and classification task and
report the results.Comment: 15 pages, 10 figure
- …