Search CORE

16,432 research outputs found

Cross-modality Data Augmentation for End-to-End Sign Language Translation

Author: Jiao Wenxiang
Tu Zhaopeng
Wang Xing
Xiong Hui
Ye Jinhui
Publication venue
Publication date: 18/10/2023
Field of study

End-to-end sign language translation (SLT) aims to convert sign language videos into spoken language texts directly without intermediate representations. It has been a challenging task due to the modality gap between sign videos and texts and the data scarcity of labeled data. To tackle these challenges, we propose a novel Cross-modality Data Augmentation (XmDA) framework to transfer the powerful gloss-to-text translation capabilities to end-to-end sign language translation (i.e. video-to-text) by exploiting pseudo gloss-text pairs from the sign gloss translation model. Specifically, XmDA consists of two key components, namely, cross-modality mix-up and cross-modality knowledge distillation. The former explicitly encourages the alignment between sign video features and gloss embeddings to bridge the modality gap. The latter utilizes the generation knowledge from gloss-to-text teacher models to guide the spoken language text generation. Experimental results on two widely used SLT datasets, i.e., PHOENIX-2014T and CSL-Daily, demonstrate that the proposed XmDA framework significantly and consistently outperforms the baseline models. Extensive analyses confirm our claim that XmDA enhances spoken language text generation by reducing the representation distance between videos and texts, as well as improving the processing of low-frequency words and long sentences.Comment: Accepted to Findings EMNLP 202

arXiv.org e-Print Archive

SignAvatars: A Large-scale 3D Sign Language Holistic Motion Dataset and Benchmark

Author: Birdal Tolga
Cheng Yongkang
Huang Shaoli
Yu Zhengdi
Publication venue
Publication date: 31/10/2023
Field of study

In this paper, we present SignAvatars, the first large-scale multi-prompt 3D sign language (SL) motion dataset designed to bridge the communication gap for hearing-impaired individuals. While there has been an exponentially growing number of research regarding digital communication, the majority of existing communication technologies primarily cater to spoken or written languages, instead of SL, the essential communication method for hearing-impaired communities. Existing SL datasets, dictionaries, and sign language production (SLP) methods are typically limited to 2D as the annotating 3D models and avatars for SL is usually an entirely manual and labor-intensive process conducted by SL experts, often resulting in unnatural avatars. In response to these challenges, we compile and curate the SignAvatars dataset, which comprises 70,000 videos from 153 signers, totaling 8.34 million frames, covering both isolated signs and continuous, co-articulated signs, with multiple prompts including HamNoSys, spoken language, and words. To yield 3D holistic annotations, including meshes and biomechanically-valid poses of body, hands, and face, as well as 2D and 3D keypoints, we introduce an automated annotation pipeline operating on our large corpus of SL videos. SignAvatars facilitates various tasks such as 3D sign language recognition (SLR) and the novel 3D SL production (SLP) from diverse inputs like text scripts, individual words, and HamNoSys notation. Hence, to evaluate the potential of SignAvatars, we further propose a unified benchmark of 3D SL holistic motion production. We believe that this work is a significant step forward towards bringing the digital world to the hearing-impaired communities. Our project page is at https://signavatars.github.io/Comment: 9 pages; Project page available at https://signavatars.github.io

arXiv.org e-Print Archive

Artificial Intelligence for Multimedia Signal Processing

Author
Publication venue: 'MDPI AG'
Publication date: 16/09/2022
Field of study

Artificial intelligence technologies are also actively applied to broadcasting and multimedia processing technologies. A lot of research has been conducted in a wide variety of fields, such as content creation, transmission, and security, and these attempts have been made in the past two to three years to improve image, video, speech, and other data compression efficiency in areas related to MPEG media processing technology. Additionally, technologies such as media creation, processing, editing, and creating scenarios are very important areas of research in multimedia processing and engineering. This book contains a collection of some topics broadly across advanced computational intelligence algorithms and technologies for emerging multimedia signal processing as: Computer vision field, speech/sound/text processing, and content analysis/information mining

Directory of Open Access Books (DOAB)

Collaborative agents for task-oriented dialogue systems

Author: Pei J.
Publication venue
Publication date: 01/01/2022
Field of study

International Migration, Integration and Social Cohesion online publications

UvA-DARE

딥러닝 기반 생성 모델을 이용한 자연어처리 데이터 증강 기법

Author: 유강민
Publication venue: 서울대학교 대학원
Publication date: 01/02/2020
Field of study

학위논문(박사)--서울대학교 대학원 :공과대학 컴퓨터공학부,2020. 2. 이상구.Recent advances in generation capability of deep learning models have spurred interest in utilizing deep generative models for unsupervised generative data augmentation (GDA). Generative data augmentation aims to improve the performance of a downstream machine learning model by augmenting the original dataset with samples generated from a deep latent variable model. This data augmentation approach is attractive to the natural language processing community, because (1) there is a shortage of text augmentation techniques that require little supervision and (2) resource scarcity being prevalent. In this dissertation, we explore the feasibility of exploiting deep latent variable models for data augmentation on three NLP tasks: sentence classification, spoken language understanding (SLU) and dialogue state tracking (DST), represent NLP tasks of various complexities and properties -- SLU requires multi-task learning of text classification and sequence tagging, while DST requires the understanding of hierarchical and recurrent data structures. For each of the three tasks, we propose a task-specific latent variable model based on conditional, hierarchical and sequential variational autoencoders (VAE) for multi-modal joint modeling of linguistic features and the relevant annotations. We conduct extensive experiments to statistically justify our hypothesis that deep generative data augmentation is beneficial for all subject tasks. Our experiments show that deep generative data augmentation is effective for the select tasks, supporting the idea that the technique can potentially be utilized for other range of NLP tasks. Ablation and qualitative studies reveal deeper insight into the underlying mechanisms of generative data augmentation. As a secondary contribution, we also shed light onto the recurring posterior collapse phenomenon in autoregressive VAEs and, subsequently, propose novel techniques to reduce the model risk, which is crucial for proper training of complex VAE models, enabling them to synthesize better samples for data augmentation. In summary, this work intends to demonstrate and analyze the effectiveness of unsupervised generative data augmentation in NLP. Ultimately, our approach enables standardized adoption of generative data augmentation, which can be applied orthogonally to existing regularization techniques.최근 딥러닝 기반 생성 모델의 급격한 발전으로 이를 이용한 생성 기반 데이터 증강 기법(generative data augmentation, GDA)의 실현 가능성에 대한 기대가 커지고 있다. 생성 기반 데이터 증강 기법은 딥러닝 기반 잠재변수 모델에서 생성 된 샘플을 원본 데이터셋에 추가하여 연관된 태스크의 성능을 향상시키는 기술을 의미한다. 따라서 생성 기반 데이터 증강 기법은 데이터 공간에서 이뤄지는 정규화 기술의 한 형태로 간주될 수 있다. 이러한 딥러닝 기반 생성 모델의 새로운 활용 가능성은 자연어처리 분야에서 더욱 중요하게 부각되는 이유는 (1) 범용 가능한 텍스트 데이터 증강 기술의 부재와 (2) 텍스트 데이터의 희소성을 극복할 수 있는 대안이 필요하기 때문이다. 문제의 복잡도와 특징을 골고루 채집하기 위해 본 논문에서는 텍스트 분류(text classification), 순차적 레이블링과 멀티태스킹 기술이 필요한 발화 이해(spoken language understanding, SLU), 계층적이며 재귀적인 데이터 구조에 대한 고려가 필요한 대화 상태 추적(dialogue state tracking, DST) 등 세 가지 문제에서 딥러닝 기반 생성 모델을 활용한 데이터 증강 기법의 타당성에 대해 다룬다. 본 연구에서는 조건부, 계층적 및 순차적 variational autoencoder (VAE)에 기반하여 각 자연어처리 문제에 특화된 텍스트 및 연관 부착 정보를 동시에 생성하는 특수 딥러닝 생성 모델들을 제시하고, 다양한 하류 모델과 데이터셋을 다루는 등 폭 넓은 실험을 통해 딥 생성 모델 기반 데이터 증강 기법의 효과를 통계적으로 입증하였다. 부수적 연구에서는 자기회귀적(autoregressive) VAE에서 빈번히 발생하는 posterior collapse 문제에 대해 탐구하고, 해당 문제를 완화할 수 있는 신규 방안도 제안한다. 해당 방법을 생성적 데이터 증강에 필요한 복잡한 VAE 모델에 적용하였을 때, 생성 모델의 생성 질이 향상되어 데이터 증강 효과에도 긍정적인 영향을 미칠 수 있음을 검증하였다. 본 논문을 통해 자연어처리 분야에서 기존 정규화 기법과 병행 적용 가능한 비지도 형태의 데이터 증강 기법의 표준화를 기대해 볼 수 있다.1 Introduction 1 1.1 Motivation 1 1.2 Dissertation Overview 6 2 Background and Related Work 8 2.1 Deep Latent Variable Models 8 2.1.1 Variational Autoencoder (VAE) 10 2.1.2 Deep Generative Models and Text Generation 12 2.2 Data Augmentation 12 2.2.1 General Description 13 2.2.2 Categorization of Data Augmentation 14 2.2.3 Theoretical Explanations 21 2.3 Summary 24 3 Basic Task: Text Classi cation 25 3.1 Introduction 25 3.2 Our Approach 28 3.2.1 Proposed Models 28 3.2.2 Training with I-VAE 29 3.3 Experiments 31 3.3.1 Datasets 32 3.3.2 Experimental Settings 33 3.3.3 Implementation Details 34 3.3.4 Data Augmentation Results 36 3.3.5 Ablation Studies 39 3.3.6 Qualitative Analysis 40 3.4 Summary 45 4 Multi-task Learning: Spoken Language Understanding 46 4.1 Introduction 46 4.2 Related Work 48 4.3 Model Description 48 4.3.1 Framework Formulation 48 4.3.2 Joint Generative Model 49 4.4 Experiments 56 4.4.1 Datasets 56 4.4.2 Experimental Settings 57 4.4.3 Generative Data Augmentation Results 61 4.4.4 Comparison to Other State-of-the-art Results 63 4.4.5 Ablation Studies 63 4.5 Summary 67 5 Complex Data: Dialogue State Tracking 68 5.1 Introduction 68 5.2 Background and Related Work 70 5.2.1 Task-oriented Dialogue 70 5.2.2 Dialogue State Tracking 72 5.2.3 Conversation Modeling 72 5.3 Variational Hierarchical Dialogue Autoencoder (VHDA) 73 5.3.1 Notations 73 5.3.2 Variational Hierarchical Conversational RNN 74 5.3.3 Proposed Model 75 5.3.4 Posterior Collapse 82 5.4 Experimental Results 84 5.4.1 Experimental Settings 84 5.4.2 Data Augmentation Results 90 5.4.3 Intrinsic Evaluation - Language Evaluation 94 5.4.4 Qualitative Results 95 5.5 Summary 101 6 Conclusion 103 6.1 Summary 103 6.2 Limitations 104 6.3 Future Work 105Docto

SNU Open Repository and Archive