2,415 research outputs found
Hallucinating Pose-Compatible Scenes
What does human pose tell us about a scene? We propose a task to answer this
question: given human pose as input, hallucinate a compatible scene. Subtle
cues captured by human pose -- action semantics, environment affordances,
object interactions -- provide surprising insight into which scenes are
compatible. We present a large-scale generative adversarial network for
pose-conditioned scene generation. We significantly scale the size and
complexity of training data, curating a massive meta-dataset containing over 19
million frames of humans in everyday environments. We double the capacity of
our model with respect to StyleGAN2 to handle such complex data, and design a
pose conditioning mechanism that drives our model to learn the nuanced
relationship between pose and scene. We leverage our trained model for various
applications: hallucinating pose-compatible scene(s) with or without humans,
visualizing incompatible scenes and poses, placing a person from one generated
image into another scene, and animating pose. Our model produces diverse
samples and outperforms pose-conditioned StyleGAN2 and Pix2Pix baselines in
terms of accurate human placement (percent of correct keypoints) and image
quality (Frechet inception distance)
DIY Human Action Data Set Generation
The recent successes in applying deep learning techniques to solve standard
computer vision problems has aspired researchers to propose new computer vision
problems in different domains. As previously established in the field, training
data itself plays a significant role in the machine learning process,
especially deep learning approaches which are data hungry. In order to solve
each new problem and get a decent performance, a large amount of data needs to
be captured which may in many cases pose logistical difficulties. Therefore,
the ability to generate de novo data or expand an existing data set, however
small, in order to satisfy data requirement of current networks may be
invaluable. Herein, we introduce a novel way to partition an action video clip
into action, subject and context. Each part is manipulated separately and
reassembled with our proposed video generation technique. Furthermore, our
novel human skeleton trajectory generation along with our proposed video
generation technique, enables us to generate unlimited action recognition
training data. These techniques enables us to generate video action clips from
an small set without costly and time-consuming data acquisition. Lastly, we
prove through extensive set of experiments on two small human action
recognition data sets, that this new data generation technique can improve the
performance of current action recognition neural nets
VisageSynTalk: Unseen Speaker Video-to-Speech Synthesis via Speech-Visage Feature Selection
The goal of this work is to reconstruct speech from a silent talking face
video. Recent studies have shown impressive performance on synthesizing speech
from silent talking face videos. However, they have not explicitly considered
on varying identity characteristics of different speakers, which place a
challenge in the video-to-speech synthesis, and this becomes more critical in
unseen-speaker settings. Our approach is to separate the speech content and the
visage-style from a given silent talking face video. By guiding the model to
independently focus on modeling the two representations, we can obtain the
speech of high intelligibility from the model even when the input video of an
unseen subject is given. To this end, we introduce speech-visage selection that
separates the speech content and the speaker identity from the visual features
of the input video. The disentangled representations are jointly incorporated
to synthesize speech through visage-style based synthesizer which generates
speech by coating the visage-styles while maintaining the speech content. Thus,
the proposed framework brings the advantage of synthesizing the speech
containing the right content even with the silent talking face video of an
unseen subject. We validate the effectiveness of the proposed framework on the
GRID, TCD-TIMIT volunteer, and LRW datasets.Comment: Accepted by ECCV 202
Detection of Machine-Generated Text: Literature Survey
Since language models produce fake text quickly and easily, there is an
oversupply of such content in the public domain. The degree of sophistication
and writing style has reached a point where differentiating between human
authored and machine-generated content is nearly impossible. As a result, works
generated by language models rather than human authors have gained significant
media attention and stirred controversy.Concerns regarding the possible
influence of advanced language models on society have also arisen, needing a
fuller knowledge of these processes. Natural language generation (NLG) and
generative pre-trained transformer (GPT) models have revolutionized a variety
of sectors: the scope not only permeated throughout journalism and customer
service but also reached academia. To mitigate the hazardous implications that
may arise from the use of these models, preventative measures must be
implemented, such as providing human agents with the capacity to distinguish
between artificially made and human composed texts utilizing automated systems
and possibly reverse-engineered language models. Furthermore, to ensure a
balanced and responsible approach, it is critical to have a full grasp of the
socio-technological ramifications of these breakthroughs. This literature
survey aims to compile and synthesize accomplishments and developments in the
aforementioned work, while also identifying future prospects. It also gives an
overview of machine-generated text trends and explores the larger societal
implications. Ultimately, this survey intends to contribute to the development
of robust and effective approaches for resolving the issues connected with the
usage and detection of machine-generated text by exploring the interplay
between the capabilities of language models and their possible implications
- …