Search CORE

14,802 research outputs found

Video Synthesis from the StyleGAN Latent Space

Author: Zhang Lei
Publication venue: SJSU ScholarWorks
Publication date: 20/05/2020
Field of study

Generative models have shown impressive results in generating synthetic images. However, video synthesis is still difficult to achieve, even for these generative models. The best videos that generative models can currently create are a few seconds long, distorted, and low resolution. For this project, I propose and implement a model to synthesize videos at 1024x1024x32 resolution that include human facial expressions by using static images generated from a Generative Adversarial Network trained on the human facial images. To the best of my knowledge, this is the first work that generates realistic videos that are larger than 256x256 resolution from single starting images. This model improves the video synthesis in both quantitative and qualitative ways compared to two state-of-the-art models: TGAN and MocoGAN. In a quantitative comparison, this project reaches a best Average Content Distance (ACD) score of 0.167, as compared to 0.305 and 0.201 of TGAN and MocoGAN, respectively

SJSU ScholarWorks

Multimodal Content Analysis for Effective Advertisements on YouTube

Author: Gupta Harsh
Johnson Joseph
Lee Hyunhwan
Ogihara Mitsunori
Parthasarathy Srinivasan
Ren Gang
Sun Wei
Vedula Nikhita
Publication venue
Publication date: 12/09/2017
Field of study

The rapid advances in e-commerce and Web 2.0 technologies have greatly increased the impact of commercial advertisements on the general public. As a key enabling technology, a multitude of recommender systems exists which analyzes user features and browsing patterns to recommend appealing advertisements to users. In this work, we seek to study the characteristics or attributes that characterize an effective advertisement and recommend a useful set of features to aid the designing and production processes of commercial advertisements. We analyze the temporal patterns from multimedia content of advertisement videos including auditory, visual and textual components, and study their individual roles and synergies in the success of an advertisement. The objective of this work is then to measure the effectiveness of an advertisement, and to recommend a useful set of features to advertisement designers to make it more successful and approachable to users. Our proposed framework employs the signal processing technique of cross modality feature learning where data streams from different components are employed to train separate neural network models and are then fused together to learn a shared representation. Subsequently, a neural network model trained on this joint feature embedding representation is utilized as a classifier to predict advertisement effectiveness. We validate our approach using subjective ratings from a dedicated user study, the sentiment strength of online viewer comments, and a viewer opinion metric of the ratio of the Likes and Views received by each advertisement from an online platform.Comment: 11 pages, 5 figures, ICDM 201

arXiv.org e-Print Archive

Crossref

Recommended from our members

The Warburg Dance Movement Library-The WADAMO Library: A Validation Study

Author: Christensen J. F.
Lambrechts A.
Tsakiris M.
Publication venue: 'SAGE Publications'
Publication date: 01/01/2019
Field of study

The Warburg Dance Movement Library is a validated set of 234 video clips of dance movements for empirical research in the fields of cognitive science and neuroscience of action perception, affect perception and neuroaesthetics. The library contains two categories of video clips of dance movement sequences. Of each pair, one version of the movement sequence is emotionally expressive (Clip a), while the other version of the same sequence (Clip b) is not expressive but as technically correct as the expressive version (Clip a). We sought to complement previous dance video stimuli libraries. Facial information, colour and music have been removed, and each clip has been faded in and out. We equalised stimulus length (6 seconds, 8 counts in dance theory), the dancers’ clothing and video background and included both male and female dancers, and we controlled for technical correctness of movement execution. The Warburg Dance Movement Library contains both contemporary and ballet movements. Two online surveys (N = 160) confirmed the classification into the two categories of expressivity. Four additional online surveys (N = 80) provided beauty and liking ratings for each clip. A correlation matrix illustrates all variables of this norming study (technical correctness, expressivity, beauty, liking, luminance, motion energy)

City Research Online

MPG.PuRe

Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis in Videos

Author: Acar Esra
Albayrak Sahin
Hopfgartner Frank
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/06/2015
Field of study

When designing a video affective content analysis algorithm, one of the most important steps is the selection of discriminative features for the effective representation of video segments. The majority of existing affective content analysis methods either use low-level audio-visual features or generate handcrafted higher level representations based on these low-level features. We propose in this work to use deep learning methods, in particular convolutional neural networks (CNNs), in order to automatically learn and extract mid-level representations from raw data. To this end, we exploit the audio and visual modality of videos by employing Mel-Frequency Cepstral Coefficients (MFCC) and color values in the HSV color space. We also incorporate dense trajectory based motion features in order to further enhance the performance of the analysis. By means of multi-class support vector machines (SVMs) and fusion mechanisms, music video clips are classified into one of four affective categories representing the four quadrants of the Valence-Arousal (VA) space. Results obtained on a subset of the DEAP dataset show (1) that higher level representations perform better than low-level features, and (2) that incorporating motion information leads to a notable performance gain, independently from the chosen representation

Crossref

Enlighten