4 research outputs found
Adversarial Training for Multi-Channel Sign Language Production
Sign Languages are rich multi-channel languages, requiring articulation of
both manual (hands) and non-manual (face and body) features in a precise,
intricate manner. Sign Language Production (SLP), the automatic translation
from spoken to sign languages, must embody this full sign morphology to be
truly understandable by the Deaf community. Previous work has mainly focused on
manual feature production, with an under-articulated output caused by
regression to the mean.
In this paper, we propose an Adversarial Multi-Channel approach to SLP. We
frame sign production as a minimax game between a transformer-based Generator
and a conditional Discriminator. Our adversarial discriminator evaluates the
realism of sign production conditioned on the source text, pushing the
generator towards a realistic and articulate output. Additionally, we fully
encapsulate sign articulators with the inclusion of non-manual features,
producing facial features and mouthing patterns.
We evaluate on the challenging RWTH-PHOENIX-Weather-2014T (PHOENIX14T)
dataset, and report state-of-the art SLP back-translation performance for
manual production. We set new benchmarks for the production of multi-channel
sign to underpin future research into realistic SLP
ChoreoNet: Towards Music to Dance Synthesis with Choreographic Action Unit
Dance and music are two highly correlated artistic forms. Synthesizing dance
motions has attracted much attention recently. Most previous works conduct
music-to-dance synthesis via directly music to human skeleton keypoints
mapping. Meanwhile, human choreographers design dance motions from music in a
two-stage manner: they firstly devise multiple choreographic dance units
(CAUs), each with a series of dance motions, and then arrange the CAU sequence
according to the rhythm, melody and emotion of the music. Inspired by these, we
systematically study such two-stage choreography approach and construct a
dataset to incorporate such choreography knowledge. Based on the constructed
dataset, we design a two-stage music-to-dance synthesis framework ChoreoNet to
imitate human choreography procedure. Our framework firstly devises a CAU
prediction model to learn the mapping relationship between music and CAU
sequences. Afterwards, we devise a spatial-temporal inpainting model to convert
the CAU sequence into continuous dance motions. Experimental results
demonstrate that the proposed ChoreoNet outperforms baseline methods (0.622 in
terms of CAU BLEU score and 1.59 in terms of user study score).Comment: 10 pages, 5 figures, Accepted by ACM MM 202
Continuous 3D Multi-Channel Sign Language Production via Progressive Transformers and Mixture Density Networks
Sign languages are multi-channel visual languages, where signers use a
continuous 3D space to communicate.Sign Language Production (SLP), the
automatic translation from spoken to sign languages, must embody both the
continuous articulation and full morphology of sign to be truly understandable
by the Deaf community. Previous deep learning-based SLP works have produced
only a concatenation of isolated signs focusing primarily on the manual
features, leading to a robotic and non-expressive production.
In this work, we propose a novel Progressive Transformer architecture, the
first SLP model to translate from spoken language sentences to continuous 3D
multi-channel sign pose sequences in an end-to-end manner. Our transformer
network architecture introduces a counter decoding that enables variable length
continuous sequence generation by tracking the production progress over time
and predicting the end of sequence. We present extensive data augmentation
techniques to reduce prediction drift, alongside an adversarial training regime
and a Mixture Density Network (MDN) formulation to produce realistic and
expressive sign pose sequences.
We propose a back translation evaluation mechanism for SLP, presenting
benchmark quantitative results on the challenging PHOENIX14T dataset and
setting baselines for future research. We further provide a user evaluation of
our SLP model, to understand the Deaf reception of our sign pose productions