1,905 research outputs found
Adversarial Training for Multi-Channel Sign Language Production
Sign Languages are rich multi-channel languages, requiring articulation of
both manual (hands) and non-manual (face and body) features in a precise,
intricate manner. Sign Language Production (SLP), the automatic translation
from spoken to sign languages, must embody this full sign morphology to be
truly understandable by the Deaf community. Previous work has mainly focused on
manual feature production, with an under-articulated output caused by
regression to the mean.
In this paper, we propose an Adversarial Multi-Channel approach to SLP. We
frame sign production as a minimax game between a transformer-based Generator
and a conditional Discriminator. Our adversarial discriminator evaluates the
realism of sign production conditioned on the source text, pushing the
generator towards a realistic and articulate output. Additionally, we fully
encapsulate sign articulators with the inclusion of non-manual features,
producing facial features and mouthing patterns.
We evaluate on the challenging RWTH-PHOENIX-Weather-2014T (PHOENIX14T)
dataset, and report state-of-the art SLP back-translation performance for
manual production. We set new benchmarks for the production of multi-channel
sign to underpin future research into realistic SLP
Continuous 3D Multi-Channel Sign Language Production via Progressive Transformers and Mixture Density Networks
Sign languages are multi-channel visual languages, where signers use a
continuous 3D space to communicate.Sign Language Production (SLP), the
automatic translation from spoken to sign languages, must embody both the
continuous articulation and full morphology of sign to be truly understandable
by the Deaf community. Previous deep learning-based SLP works have produced
only a concatenation of isolated signs focusing primarily on the manual
features, leading to a robotic and non-expressive production.
In this work, we propose a novel Progressive Transformer architecture, the
first SLP model to translate from spoken language sentences to continuous 3D
multi-channel sign pose sequences in an end-to-end manner. Our transformer
network architecture introduces a counter decoding that enables variable length
continuous sequence generation by tracking the production progress over time
and predicting the end of sequence. We present extensive data augmentation
techniques to reduce prediction drift, alongside an adversarial training regime
and a Mixture Density Network (MDN) formulation to produce realistic and
expressive sign pose sequences.
We propose a back translation evaluation mechanism for SLP, presenting
benchmark quantitative results on the challenging PHOENIX14T dataset and
setting baselines for future research. We further provide a user evaluation of
our SLP model, to understand the Deaf reception of our sign pose productions
TensorLayer: A Versatile Library for Efficient Deep Learning Development
Deep learning has enabled major advances in the fields of computer vision,
natural language processing, and multimedia among many others. Developing a
deep learning system is arduous and complex, as it involves constructing neural
network architectures, managing training/trained models, tuning optimization
process, preprocessing and organizing data, etc. TensorLayer is a versatile
Python library that aims at helping researchers and engineers efficiently
develop deep learning systems. It offers rich abstractions for neural networks,
model and data management, and parallel workflow mechanism. While boosting
efficiency, TensorLayer maintains both performance and scalability. TensorLayer
was released in September 2016 on GitHub, and has helped people from academia
and industry develop real-world applications of deep learning.Comment: ACM Multimedia 201
Progressive Transformers for End-to-End Sign Language Production
The goal of automatic Sign Language Production (SLP) is to translate spoken
language to a continuous stream of sign language video at a level comparable to
a human translator. If this was achievable, then it would revolutionise Deaf
hearing communications. Previous work on predominantly isolated SLP has shown
the need for architectures that are better suited to the continuous domain of
full sign sequences.
In this paper, we propose Progressive Transformers, a novel architecture that
can translate from discrete spoken language sentences to continuous 3D skeleton
pose outputs representing sign language. We present two model configurations,
an end-to-end network that produces sign direct from text and a stacked network
that utilises a gloss intermediary.
Our transformer network architecture introduces a counter that enables
continuous sequence generation at training and inference. We also provide
several data augmentation processes to overcome the problem of drift and
improve the performance of SLP models. We propose a back translation evaluation
mechanism for SLP, presenting benchmark quantitative results on the challenging
RWTH-PHOENIX-Weather-2014T(PHOENIX14T) dataset and setting baselines for future
research
Towards Diverse and Natural Image Descriptions via a Conditional GAN
Despite the substantial progress in recent years, the image captioning
techniques are still far from being perfect.Sentences produced by existing
methods, e.g. those based on RNNs, are often overly rigid and lacking in
variability. This issue is related to a learning principle widely used in
practice, that is, to maximize the likelihood of training samples. This
principle encourages high resemblance to the "ground-truth" captions while
suppressing other reasonable descriptions. Conventional evaluation metrics,
e.g. BLEU and METEOR, also favor such restrictive methods. In this paper, we
explore an alternative approach, with the aim to improve the naturalness and
diversity -- two essential properties of human expression. Specifically, we
propose a new framework based on Conditional Generative Adversarial Networks
(CGAN), which jointly learns a generator to produce descriptions conditioned on
images and an evaluator to assess how well a description fits the visual
content. It is noteworthy that training a sequence generator is nontrivial. We
overcome the difficulty by Policy Gradient, a strategy stemming from
Reinforcement Learning, which allows the generator to receive early feedback
along the way. We tested our method on two large datasets, where it performed
competitively against real people in our user study and outperformed other
methods on various tasks.Comment: accepted in ICCV2017 as an Oral pape
Advancing Text-to-GLOSS Neural Translation Using a Novel Hyper-parameter Optimization Technique
In this paper, we investigate the use of transformers for Neural Machine
Translation of text-to-GLOSS for Deaf and Hard-of-Hearing communication. Due to
the scarcity of available data and limited resources for text-to-GLOSS
translation, we treat the problem as a low-resource language task. We use our
novel hyper-parameter exploration technique to explore a variety of
architectural parameters and build an optimal transformer-based architecture
specifically tailored for text-to-GLOSS translation. The study aims to improve
the accuracy and fluency of Neural Machine Translation generated GLOSS. This is
achieved by examining various architectural parameters including layer count,
attention heads, embedding dimension, dropout, and label smoothing to identify
the optimal architecture for improving text-to-GLOSS translation performance.
The experiments conducted on the PHOENIX14T dataset reveal that the optimal
transformer architecture outperforms previous work on the same dataset. The
best model reaches a ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
score of 55.18% and a BLEU-1 (BiLingual Evaluation Understudy 1) score of
63.6%, outperforming state-of-the-art results on the BLEU1 and ROUGE score by
8.42 and 0.63 respectively.Comment: 8 pages, 5 figure
Data and methods for a visual understanding of sign languages
Signed languages are complete and natural languages used as the first or preferred mode of communication by millions of people worldwide. However, they, unfortunately, continue to be marginalized languages. Designing, building, and evaluating models that work on sign languages presents compelling research challenges and requires interdisciplinary and collaborative efforts. The recent advances in Machine Learning (ML) and Artificial Intelligence (AI) has the power to enable better accessibility to sign language users and narrow down the existing communication barrier between the Deaf community and non-sign language users. However, recent AI-powered technologies still do not account for sign language in their pipelines. This is mainly because sign languages are visual languages, that use manual and non-manual features to convey information, and do not have a standard written form. Thus, the goal of this thesis is to contribute to the development of new technologies that account for sign language by creating large-scale multimodal resources suitable for training modern data-hungry machine learning models and developing automatic systems that focus on computer vision tasks related to sign language that aims at learning better visual understanding of sign languages.
Thus, in Part I, we introduce the How2Sign dataset, which is a large-scale collection of multimodal and multiview sign language videos in American Sign Language. In Part II, we contribute to the development of technologies that account for sign languages by presenting in Chapter 4 a framework called Spot-Align, based on sign spotting methods, to automatically annotate sign instances in continuous sign language. We further present the benefits of this framework and establish a baseline for the sign language recognition task on the How2Sign dataset. In addition to that, in Chapter 5 we benefit from the different annotations and modalities of the How2Sign to explore sign language video retrieval by learning cross-modal embeddings. Later in Chapter 6, we explore sign language video generation by applying Generative Adversarial Networks to the sign language domain and assess if and how well sign language users can understand automatically generated sign language videos by proposing an evaluation protocol based on How2Sign topics and English translationLes llengües de signes són llengües completes i naturals que utilitzen milions de persones de tot el món com mode de comunicació primer o preferit. Tanmateix, malauradament, continuen essent llengües marginades. Dissenyar, construir i avaluar tecnologies que funcionin amb les llengües de signes presenta reptes de recerca que requereixen d’esforços interdisciplinaris i col·laboratius. Els avenços recents en l’aprenentatge automà tic i la intel·ligència artificial (IA) poden millorar l’accessibilitat tecnològica dels signants, i alhora reduir la barrera de comunicació existent entre la comunitat sorda i les persones no-signants. Tanmateix, les tecnologies més modernes en IA encara no consideren les llengües de signes en les seves interfÃcies amb l’usuari. Això es deu principalment a que les llengües de signes són llenguatges visuals, que utilitzen caracterÃstiques manuals i no manuals per transmetre informació, i no tenen una forma escrita està ndard. Els objectius principals d’aquesta tesi són la creació de recursos multimodals a gran escala adequats per entrenar models d’aprenentatge automà tic per a llengües de signes, i desenvolupar sistemes de visió per computador adreçats a una millor comprensió automà tica de les llengües de signes. AixÃ, a la Part I presentem la base de dades How2Sign, una gran col·lecció multimodal i multivista de vÃdeos de la llengua de signes nord-americana. A la Part II, contribuïm al desenvolupament de tecnologia per a llengües de signes, presentant al capÃtol 4 una solució per anotar signes automà ticament anomenada Spot-Align, basada en mètodes de localització de signes en seqüències contÃnues de signes. Després, presentem els avantatges d’aquesta solució i proporcionem uns primers resultats per la tasca de reconeixement de la llengua de signes a la base de dades How2Sign. A continuació, al capÃtol 5 aprofitem de les anotacions i diverses modalitats de How2Sign per explorar la cerca de vÃdeos en llengua de signes a partir de l’entrenament d’incrustacions multimodals. Finalment, al capÃtol 6, explorem la generació de vÃdeos en llengua de signes aplicant xarxes adversà ries generatives al domini de la llengua de signes. Avaluem fins a quin punt els signants poden entendre els vÃdeos generats automà ticament, proposant un nou protocol d’avaluació basat en les categories dins de How2Sign i la traducció dels vÃdeos a l’anglès escritLas lenguas de signos son lenguas completas y naturales que utilizan millones de personas
de todo el mundo como modo de comunicación primero o preferido. Sin embargo, desgraciadamente,
siguen siendo lenguas marginadas. Diseñar, construir y evaluar tecnologÃas
que funcionen con las lenguas de signos presenta retos de investigación que requieren
esfuerzos interdisciplinares y colaborativos. Los avances recientes en el aprendizaje automático
y la inteligencia artificial (IA) pueden mejorar la accesibilidad tecnológica de
los signantes, al tiempo que reducir la barrera de comunicación existente entre la comunidad
sorda y las personas no signantes. Sin embargo, las tecnologÃas más modernas en
IA todavÃa no consideran las lenguas de signos en sus interfaces con el usuario. Esto
se debe principalmente a que las lenguas de signos son lenguajes visuales, que utilizan
caracterÃsticas manuales y no manuales para transmitir información, y carecen de una
forma escrita estándar. Los principales objetivos de esta tesis son la creación de recursos
multimodales a gran escala adecuados para entrenar modelos de aprendizaje automático
para lenguas de signos, y desarrollar sistemas de visión por computador dirigidos a una
mejor comprensión automática de las lenguas de signos.
AsÃ, en la Parte I presentamos la base de datos How2Sign, una gran colección multimodal
y multivista de vÃdeos de lenguaje la lengua de signos estadounidense. En la Part II,
contribuimos al desarrollo de tecnologÃa para lenguas de signos, presentando en el capÃtulo
4 una solución para anotar signos automáticamente llamada Spot-Align, basada en
métodos de localización de signos en secuencias continuas de signos. Después, presentamos
las ventajas de esta solución y proporcionamos unos primeros resultados por la tarea
de reconocimiento de la lengua de signos en la base de datos How2Sign. A continuación,
en el capÃtulo 5 aprovechamos de las anotaciones y diversas modalidades de How2Sign
para explorar la búsqueda de vÃdeos en lengua de signos a partir del entrenamiento de
incrustaciones multimodales. Finalmente, en el capÃtulo 6, exploramos la generación
de vÃdeos en lengua de signos aplicando redes adversarias generativas al dominio de la
lengua de signos. Evaluamos hasta qué punto los signantes pueden entender los vÃdeos
generados automáticamente, proponiendo un nuevo protocolo de evaluación basado en
las categorÃas dentro de How2Sign y la traducción de los vÃdeos al inglés escrito.Teoria del Senyal i Comunicacion
Data and methods for a visual understanding of sign languages
Signed languages are complete and natural languages used as the first or preferred mode of communication by millions of people worldwide. However, they, unfortunately, continue to be marginalized languages. Designing, building, and evaluating models that work on sign languages presents compelling research challenges and requires interdisciplinary and collaborative efforts. The recent advances in Machine Learning (ML) and Artificial Intelligence (AI) has the power to enable better accessibility to sign language users and narrow down the existing communication barrier between the Deaf community and non-sign language users. However, recent AI-powered technologies still do not account for sign language in their pipelines. This is mainly because sign languages are visual languages, that use manual and non-manual features to convey information, and do not have a standard written form. Thus, the goal of this thesis is to contribute to the development of new technologies that account for sign language by creating large-scale multimodal resources suitable for training modern data-hungry machine learning models and developing automatic systems that focus on computer vision tasks related to sign language that aims at learning better visual understanding of sign languages.
Thus, in Part I, we introduce the How2Sign dataset, which is a large-scale collection of multimodal and multiview sign language videos in American Sign Language. In Part II, we contribute to the development of technologies that account for sign languages by presenting in Chapter 4 a framework called Spot-Align, based on sign spotting methods, to automatically annotate sign instances in continuous sign language. We further present the benefits of this framework and establish a baseline for the sign language recognition task on the How2Sign dataset. In addition to that, in Chapter 5 we benefit from the different annotations and modalities of the How2Sign to explore sign language video retrieval by learning cross-modal embeddings. Later in Chapter 6, we explore sign language video generation by applying Generative Adversarial Networks to the sign language domain and assess if and how well sign language users can understand automatically generated sign language videos by proposing an evaluation protocol based on How2Sign topics and English translationLes llengües de signes són llengües completes i naturals que utilitzen milions de persones de tot el món com mode de comunicació primer o preferit. Tanmateix, malauradament, continuen essent llengües marginades. Dissenyar, construir i avaluar tecnologies que funcionin amb les llengües de signes presenta reptes de recerca que requereixen d’esforços interdisciplinaris i col·laboratius. Els avenços recents en l’aprenentatge automà tic i la intel·ligència artificial (IA) poden millorar l’accessibilitat tecnològica dels signants, i alhora reduir la barrera de comunicació existent entre la comunitat sorda i les persones no-signants. Tanmateix, les tecnologies més modernes en IA encara no consideren les llengües de signes en les seves interfÃcies amb l’usuari. Això es deu principalment a que les llengües de signes són llenguatges visuals, que utilitzen caracterÃstiques manuals i no manuals per transmetre informació, i no tenen una forma escrita està ndard. Els objectius principals d’aquesta tesi són la creació de recursos multimodals a gran escala adequats per entrenar models d’aprenentatge automà tic per a llengües de signes, i desenvolupar sistemes de visió per computador adreçats a una millor comprensió automà tica de les llengües de signes. AixÃ, a la Part I presentem la base de dades How2Sign, una gran col·lecció multimodal i multivista de vÃdeos de la llengua de signes nord-americana. A la Part II, contribuïm al desenvolupament de tecnologia per a llengües de signes, presentant al capÃtol 4 una solució per anotar signes automà ticament anomenada Spot-Align, basada en mètodes de localització de signes en seqüències contÃnues de signes. Després, presentem els avantatges d’aquesta solució i proporcionem uns primers resultats per la tasca de reconeixement de la llengua de signes a la base de dades How2Sign. A continuació, al capÃtol 5 aprofitem de les anotacions i diverses modalitats de How2Sign per explorar la cerca de vÃdeos en llengua de signes a partir de l’entrenament d’incrustacions multimodals. Finalment, al capÃtol 6, explorem la generació de vÃdeos en llengua de signes aplicant xarxes adversà ries generatives al domini de la llengua de signes. Avaluem fins a quin punt els signants poden entendre els vÃdeos generats automà ticament, proposant un nou protocol d’avaluació basat en les categories dins de How2Sign i la traducció dels vÃdeos a l’anglès escritLas lenguas de signos son lenguas completas y naturales que utilizan millones de personas
de todo el mundo como modo de comunicación primero o preferido. Sin embargo, desgraciadamente,
siguen siendo lenguas marginadas. Diseñar, construir y evaluar tecnologÃas
que funcionen con las lenguas de signos presenta retos de investigación que requieren
esfuerzos interdisciplinares y colaborativos. Los avances recientes en el aprendizaje automático
y la inteligencia artificial (IA) pueden mejorar la accesibilidad tecnológica de
los signantes, al tiempo que reducir la barrera de comunicación existente entre la comunidad
sorda y las personas no signantes. Sin embargo, las tecnologÃas más modernas en
IA todavÃa no consideran las lenguas de signos en sus interfaces con el usuario. Esto
se debe principalmente a que las lenguas de signos son lenguajes visuales, que utilizan
caracterÃsticas manuales y no manuales para transmitir información, y carecen de una
forma escrita estándar. Los principales objetivos de esta tesis son la creación de recursos
multimodales a gran escala adecuados para entrenar modelos de aprendizaje automático
para lenguas de signos, y desarrollar sistemas de visión por computador dirigidos a una
mejor comprensión automática de las lenguas de signos.
AsÃ, en la Parte I presentamos la base de datos How2Sign, una gran colección multimodal
y multivista de vÃdeos de lenguaje la lengua de signos estadounidense. En la Part II,
contribuimos al desarrollo de tecnologÃa para lenguas de signos, presentando en el capÃtulo
4 una solución para anotar signos automáticamente llamada Spot-Align, basada en
métodos de localización de signos en secuencias continuas de signos. Después, presentamos
las ventajas de esta solución y proporcionamos unos primeros resultados por la tarea
de reconocimiento de la lengua de signos en la base de datos How2Sign. A continuación,
en el capÃtulo 5 aprovechamos de las anotaciones y diversas modalidades de How2Sign
para explorar la búsqueda de vÃdeos en lengua de signos a partir del entrenamiento de
incrustaciones multimodales. Finalmente, en el capÃtulo 6, exploramos la generación
de vÃdeos en lengua de signos aplicando redes adversarias generativas al dominio de la
lengua de signos. Evaluamos hasta qué punto los signantes pueden entender los vÃdeos
generados automáticamente, proponiendo un nuevo protocolo de evaluación basado en
las categorÃas dentro de How2Sign y la traducción de los vÃdeos al inglés escrito.Postprint (published version
- …