226 research outputs found

    Controllable Image-to-Video Translation: A Case Study on Facial Expression Generation

    Full text link
    The recent advances in deep learning have made it possible to generate photo-realistic images by using neural networks and even to extrapolate video frames from an input video clip. In this paper, for the sake of both furthering this exploration and our own interest in a realistic application, we study image-to-video translation and particularly focus on the videos of facial expressions. This problem challenges the deep neural networks by another temporal dimension comparing to the image-to-image translation. Moreover, its single input image fails most existing video generation methods that rely on recurrent models. We propose a user-controllable approach so as to generate video clips of various lengths from a single face image. The lengths and types of the expressions are controlled by users. To this end, we design a novel neural network architecture that can incorporate the user input into its skip connections and propose several improvements to the adversarial training method for the neural network. Experiments and user studies verify the effectiveness of our approach. Especially, we would like to highlight that even for the face images in the wild (downloaded from the Web and the authors' own photos), our model can generate high-quality facial expression videos of which about 50\% are labeled as real by Amazon Mechanical Turk workers.Comment: 10 page

    Making the Invisible Visible: Action Recognition Through Walls and Occlusions

    Full text link
    Understanding people's actions and interactions typically depends on seeing them. Automating the process of action recognition from visual data has been the topic of much research in the computer vision community. But what if it is too dark, or if the person is occluded or behind a wall? In this paper, we introduce a neural network model that can detect human actions through walls and occlusions, and in poor lighting conditions. Our model takes radio frequency (RF) signals as input, generates 3D human skeletons as an intermediate representation, and recognizes actions and interactions of multiple people over time. By translating the input to an intermediate skeleton-based representation, our model can learn from both vision-based and RF-based datasets, and allow the two tasks to help each other. We show that our model achieves comparable accuracy to vision-based action recognition systems in visible scenarios, yet continues to work accurately when people are not visible, hence addressing scenarios that are beyond the limit of today's vision-based action recognition.Comment: ICCV 2019. The first two authors contributed equally to this pape

    Improving CLIP Training with Language Rewrites

    Full text link
    Contrastive Language-Image Pre-training (CLIP) stands as one of the most effective and scalable methods for training transferable vision models using paired image and text data. CLIP models are trained using contrastive loss, which typically relies on data augmentations to prevent overfitting and shortcuts. However, in the CLIP training paradigm, data augmentations are exclusively applied to image inputs, while language inputs remain unchanged throughout the entire training process, limiting the exposure of diverse texts to the same image. In this paper, we introduce Language augmented CLIP (LaCLIP), a simple yet highly effective approach to enhance CLIP training through language rewrites. Leveraging the in-context learning capability of large language models, we rewrite the text descriptions associated with each image. These rewritten texts exhibit diversity in sentence structure and vocabulary while preserving the original key concepts and meanings. During training, LaCLIP randomly selects either the original texts or the rewritten versions as text augmentations for each image. Extensive experiments on CC3M, CC12M, RedCaps and LAION-400M datasets show that CLIP pre-training with language rewrites significantly improves the transfer performance without computation or memory overhead during training. Specifically for ImageNet zero-shot accuracy, LaCLIP outperforms CLIP by 8.2% on CC12M and 2.4% on LAION-400M. Code is available at https://github.com/LijieFan/LaCLIP

    StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners

    Full text link
    We investigate the potential of learning visual representations using synthetic images generated by text-to-image models. This is a natural question in the light of the excellent performance of such models in generating high-quality images. We consider specifically the Stable Diffusion, one of the leading open source text-to-image models. We show that (1) when the generative model is configured with proper classifier-free guidance scale, training self-supervised methods on synthetic images can match or beat the real image counterpart; (2) by treating the multiple images generated from the same text prompt as positives for each other, we develop a multi-positive contrastive learning method, which we call StableRep. With solely synthetic images, the representations learned by StableRep surpass the performance of representations learned by SimCLR and CLIP using the same set of text prompts and corresponding real images, on large scale datasets. When we further add language supervision, StableRep trained with 20M synthetic images achieves better accuracy than CLIP trained with 50M real images.Comment: code is available at: https://github.com/google-research/syn-rep-lear

    Combined key-frame extraction and object-based video segmentation

    Full text link

    Identifying Product Defects from User Complaints: A Probabilistic Defect Model

    Get PDF
    The recent surge in using social media has created a massive amount of unstructured textual complaints about products and services. However, discovering potential product defects from large amounts of unstructured text is a nontrivial task. In this paper, we develop a probabilistic defect model (PDM) that identifies the most critical product issues and corresponding product attributes, simultaneously. We facilitate domain-oriented key attributes (e.g., product model, year of production, defective components, symptoms, etc.) of a product to identify and acquire integral information of defect. We conduct comprehensive evaluations including quantitative evaluations and qualitative evaluations to ensure the quality of discovered information. Experimental results demonstrate that our proposed model outperforms existing unsupervised method (K-Means Clustering), and could find more valuable information. Our research has significant managerial implications for mangers, manufacturers, and policy makers

    Learning Longterm Representations for Person Re-Identification Using Radio Signals

    Full text link
    Person Re-Identification (ReID) aims to recognize a person-of-interest across different places and times. Existing ReID methods rely on images or videos collected using RGB cameras. They extract appearance features like clothes, shoes, hair, etc. Such features, however, can change drastically from one day to the next, leading to inability to identify people over extended time periods. In this paper, we introduce RF-ReID, a novel approach that harnesses radio frequency (RF) signals for longterm person ReID. RF signals traverse clothes and reflect off the human body; thus they can be used to extract more persistent human-identifying features like body size and shape. We evaluate the performance of RF-ReID on longitudinal datasets that span days and weeks, where the person may wear different clothes across days. Our experiments demonstrate that RF-ReID outperforms state-of-the-art RGB-based ReID approaches for long term person ReID. Our results also reveal two interesting features: First since RF signals work in the presence of occlusions and poor lighting, RF-ReID allows for person ReID in such scenarios. Second, unlike photos and videos which reveal personal and private information, RF signals are more privacy-preserving, and hence can help extend person ReID to privacy-concerned domains, like healthcare.Comment: CVPR 2020. The first three authors contributed equally to this pape

    Analytical Modeling of a Doubly Clamped Flexible Piezoelectric Energy Harvester with Axial Excitation and Its Experimental Characterization

    Get PDF
    With the rapid development of wearable electronics, novel power solutions are required to adapt to flexible surfaces for widespread applications, thus flexible energy harvesters have been extensively studied for their flexibility and stretchability. However, poor power output and insufficient sensitivity to environmental changes limit its widespread application in engineering practice. A doubly clamped flexible piezoelectric energy harvester (FPEH) with axial excitation is therefore proposed for higher power output in a low-frequency vibration environment. Combining the Euler–Bernoulli beam theory and the D’Alembert principle, the differential dynamic equation of the doubly clamped energy harvester is derived, in which the excitation mode of axial load with pre-deformation is considered. A numerical solution of voltage amplitude and average power is obtained using the Rayleigh–Ritz method. Output power of 22.5 μW at 27.1 Hz, with the optimal load resistance being 1 MΩ, is determined by the frequency sweeping analysis. In order to power electronic devices, the converted alternating electric energy should be rectified into direct current energy. By connecting to the MDA2500 standard rectified electric bridge, a rectified DC output voltage across the 1 MΩ load resistor is characterized to be 2.39 V. For further validation of the mechanical-electrical dynamical model of the doubly clamped flexible piezoelectric energy harvester, its output performances, including both its frequency response and resistance load matching performances, are experimentally characterized. From the experimental results, the maximum output power is 1.38 μW, with a load resistance of 5.7 MΩ at 27 Hz, and the rectified DC output voltage reaches 1.84 V, which shows coincidence with simulation results and is proved to be sufficient for powering LED electronics
    • …
    corecore