226 research outputs found
Controllable Image-to-Video Translation: A Case Study on Facial Expression Generation
The recent advances in deep learning have made it possible to generate
photo-realistic images by using neural networks and even to extrapolate video
frames from an input video clip. In this paper, for the sake of both furthering
this exploration and our own interest in a realistic application, we study
image-to-video translation and particularly focus on the videos of facial
expressions. This problem challenges the deep neural networks by another
temporal dimension comparing to the image-to-image translation. Moreover, its
single input image fails most existing video generation methods that rely on
recurrent models. We propose a user-controllable approach so as to generate
video clips of various lengths from a single face image. The lengths and types
of the expressions are controlled by users. To this end, we design a novel
neural network architecture that can incorporate the user input into its skip
connections and propose several improvements to the adversarial training method
for the neural network. Experiments and user studies verify the effectiveness
of our approach. Especially, we would like to highlight that even for the face
images in the wild (downloaded from the Web and the authors' own photos), our
model can generate high-quality facial expression videos of which about 50\%
are labeled as real by Amazon Mechanical Turk workers.Comment: 10 page
Making the Invisible Visible: Action Recognition Through Walls and Occlusions
Understanding people's actions and interactions typically depends on seeing
them. Automating the process of action recognition from visual data has been
the topic of much research in the computer vision community. But what if it is
too dark, or if the person is occluded or behind a wall? In this paper, we
introduce a neural network model that can detect human actions through walls
and occlusions, and in poor lighting conditions. Our model takes radio
frequency (RF) signals as input, generates 3D human skeletons as an
intermediate representation, and recognizes actions and interactions of
multiple people over time. By translating the input to an intermediate
skeleton-based representation, our model can learn from both vision-based and
RF-based datasets, and allow the two tasks to help each other. We show that our
model achieves comparable accuracy to vision-based action recognition systems
in visible scenarios, yet continues to work accurately when people are not
visible, hence addressing scenarios that are beyond the limit of today's
vision-based action recognition.Comment: ICCV 2019. The first two authors contributed equally to this pape
Improving CLIP Training with Language Rewrites
Contrastive Language-Image Pre-training (CLIP) stands as one of the most
effective and scalable methods for training transferable vision models using
paired image and text data. CLIP models are trained using contrastive loss,
which typically relies on data augmentations to prevent overfitting and
shortcuts. However, in the CLIP training paradigm, data augmentations are
exclusively applied to image inputs, while language inputs remain unchanged
throughout the entire training process, limiting the exposure of diverse texts
to the same image. In this paper, we introduce Language augmented CLIP
(LaCLIP), a simple yet highly effective approach to enhance CLIP training
through language rewrites. Leveraging the in-context learning capability of
large language models, we rewrite the text descriptions associated with each
image. These rewritten texts exhibit diversity in sentence structure and
vocabulary while preserving the original key concepts and meanings. During
training, LaCLIP randomly selects either the original texts or the rewritten
versions as text augmentations for each image. Extensive experiments on CC3M,
CC12M, RedCaps and LAION-400M datasets show that CLIP pre-training with
language rewrites significantly improves the transfer performance without
computation or memory overhead during training. Specifically for ImageNet
zero-shot accuracy, LaCLIP outperforms CLIP by 8.2% on CC12M and 2.4% on
LAION-400M. Code is available at https://github.com/LijieFan/LaCLIP
StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners
We investigate the potential of learning visual representations using
synthetic images generated by text-to-image models. This is a natural question
in the light of the excellent performance of such models in generating
high-quality images. We consider specifically the Stable Diffusion, one of the
leading open source text-to-image models. We show that (1) when the generative
model is configured with proper classifier-free guidance scale, training
self-supervised methods on synthetic images can match or beat the real image
counterpart; (2) by treating the multiple images generated from the same text
prompt as positives for each other, we develop a multi-positive contrastive
learning method, which we call StableRep. With solely synthetic images, the
representations learned by StableRep surpass the performance of representations
learned by SimCLR and CLIP using the same set of text prompts and corresponding
real images, on large scale datasets. When we further add language supervision,
StableRep trained with 20M synthetic images achieves better accuracy than CLIP
trained with 50M real images.Comment: code is available at:
https://github.com/google-research/syn-rep-lear
Identifying Product Defects from User Complaints: A Probabilistic Defect Model
The recent surge in using social media has created a massive amount of unstructured textual complaints about products and services. However, discovering potential product defects from large amounts of unstructured text is a nontrivial task. In this paper, we develop a probabilistic defect model (PDM) that identifies the most critical product issues and corresponding product attributes, simultaneously. We facilitate domain-oriented key attributes (e.g., product model, year of production, defective components, symptoms, etc.) of a product to identify and acquire integral information of defect. We conduct comprehensive evaluations including quantitative evaluations and qualitative evaluations to ensure the quality of discovered information. Experimental results demonstrate that our proposed model outperforms existing unsupervised method (K-Means Clustering), and could find more valuable information. Our research has significant managerial implications for mangers, manufacturers, and policy makers
Learning Longterm Representations for Person Re-Identification Using Radio Signals
Person Re-Identification (ReID) aims to recognize a person-of-interest across
different places and times. Existing ReID methods rely on images or videos
collected using RGB cameras. They extract appearance features like clothes,
shoes, hair, etc. Such features, however, can change drastically from one day
to the next, leading to inability to identify people over extended time
periods. In this paper, we introduce RF-ReID, a novel approach that harnesses
radio frequency (RF) signals for longterm person ReID. RF signals traverse
clothes and reflect off the human body; thus they can be used to extract more
persistent human-identifying features like body size and shape. We evaluate the
performance of RF-ReID on longitudinal datasets that span days and weeks, where
the person may wear different clothes across days. Our experiments demonstrate
that RF-ReID outperforms state-of-the-art RGB-based ReID approaches for long
term person ReID. Our results also reveal two interesting features: First since
RF signals work in the presence of occlusions and poor lighting, RF-ReID allows
for person ReID in such scenarios. Second, unlike photos and videos which
reveal personal and private information, RF signals are more
privacy-preserving, and hence can help extend person ReID to privacy-concerned
domains, like healthcare.Comment: CVPR 2020. The first three authors contributed equally to this pape
Analytical Modeling of a Doubly Clamped Flexible Piezoelectric Energy Harvester with Axial Excitation and Its Experimental Characterization
With the rapid development of wearable electronics, novel power solutions are required to adapt to flexible surfaces for widespread applications, thus flexible energy harvesters have been extensively studied for their flexibility and stretchability. However, poor power output and insufficient sensitivity to environmental changes limit its widespread application in engineering practice. A doubly clamped flexible piezoelectric energy harvester (FPEH) with axial excitation is therefore proposed for higher power output in a low-frequency vibration environment. Combining the Euler–Bernoulli beam theory and the D’Alembert principle, the differential dynamic equation of the doubly clamped energy harvester is derived, in which the excitation mode of axial load with pre-deformation is considered. A numerical solution of voltage amplitude and average power is obtained using the Rayleigh–Ritz method. Output power of 22.5 μW at 27.1 Hz, with the optimal load resistance being 1 MΩ, is determined by the frequency sweeping analysis. In order to power electronic devices, the converted alternating electric energy should be rectified into direct current energy. By connecting to the MDA2500 standard rectified electric bridge, a rectified DC output voltage across the 1 MΩ load resistor is characterized to be 2.39 V. For further validation of the mechanical-electrical dynamical model of the doubly clamped flexible piezoelectric energy harvester, its output performances, including both its frequency response and resistance load matching performances, are experimentally characterized. From the experimental results, the maximum output power is 1.38 μW, with a load resistance of 5.7 MΩ at 27 Hz, and the rectified DC output voltage reaches 1.84 V, which shows coincidence with simulation results and is proved to be sufficient for powering LED electronics
- …