1,087 research outputs found

    Text data augmentation using generative adversarial networks โ€“ a systematic review

    Get PDF
    Insufficient data is one of the main drawbacks in natural language processing tasks, and the most prevalent solution is to collect a decent amount of data that will be enough for the optimisation of the model. However, recent research directions are strategically moving towards increasing training examples due to the nature of the data-hungry neural models. Data augmentation is an emerging area that aims to ensure the diversity of data without attempting to collect new data exclusively to boost a modelโ€™s performance. Limitations in data augmentation, especially for textual data, are mainly due to the nature of language data, which is precisely discrete. Generative Adversarial Networks (GANs) were initially introduced for computer vision applications, aiming to generate highly realistic images by learning the image representations. Recent research has focused on using GANs for text generation and augmentation. This systematic review aims to present the theoretical background of GANs and their use for text augmentation alongside a systematic review of recent textual data augmentation applications such as sentiment analysis, low resource language generation, hate speech detection and fraud review analysis. Further, a notion of challenges in current research and future directions of GAN-based text augmentation are discussed in this paper to pave the way for researchers especially working on low-text resources

    Toward Text Data Augmentation for Sentiment Analysis

    Get PDF
    A significant part of natural language processing (NLP) techniques for sentiment analysis is based on supervised methods, which are affected by the quality of data. Therefore, sentiment analysis needs to be prepared for data quality issues, such as imbalance and lack of labeled data. Data augmentation methods, widely adopted in image classification tasks, include data-space solutions to tackle the problem of limited data and enhance the size and quality of training datasets to provide better models. In this work, we study the advantages and drawbacks of text augmentation methods such as easy data augmentation, back-translation, BART, and pretrained data augmentor) with recent classification algorithms (long short-term memory, convolutional neural network, bidirectional encoder representations of transformers, support vector machine, gated recurrent units, random forests, and enhanced language representation with informative entities, that have attracted sentiment-analysis researchers and industry applications. We explored seven sentiment-analysis datasets to provide scenarios of imbalanced datasets and limited data to discuss the influence of a given classifier in overcoming these problems, and provide insights into promising combinations of transformation, paraphrasing, and generation methods of sentence augmentation. The results revealed improvements from the augmented dataset, mainly for reduced datasets. Furthermore, when balanced by augmenting the minority class, the datasets were found to have improved quality, leading to more robust classifiers. The contributions to this article include the taxonomy of NLP augmentation methods and their efficiency over several classifiers from recent research trends in sentiment analysis and related fields

    CUDA: Contradistinguisher for Unsupervised Domain Adaptation

    Full text link
    In this paper, we propose a simple model referred as Contradistinguisher (CTDR) for unsupervised domain adaptation whose objective is to jointly learn to contradistinguish on unlabeled target domain in a fully unsupervised manner along with prior knowledge acquired by supervised learning on an entirely different domain. Most recent works in domain adaptation rely on an indirect way of first aligning the source and target domain distributions and then learn a classifier on a labeled source domain to classify target domain. This approach of an indirect way of addressing the real task of unlabeled target domain classification has three main drawbacks. (i) The sub-task of obtaining a perfect alignment of the domain in itself might be impossible due to large domain shift (e.g., language domains). (ii) The use of multiple classifiers to align the distributions unnecessarily increases the complexity of the neural networks leading to over-fitting in many cases. (iii) Due to distribution alignment, the domain-specific information is lost as the domains get morphed. In this work, we propose a simple and direct approach that does not require domain alignment. We jointly learn CTDR on both source and target distribution for unsupervised domain adaptation task using contradistinguish loss for the unlabeled target domain in conjunction with a supervised loss for labeled source domain. Our experiments show that avoiding domain alignment by directly addressing the task of unlabeled target domain classification using CTDR achieves state-of-the-art results on eight visual and four language benchmark domain adaptation datasets.Comment: International Conference on Data Mining, ICDM 201

    Generic Black-Box End-to-End Attack Against State of the Art API Call Based Malware Classifiers

    Full text link
    In this paper, we present a black-box attack against API call based machine learning malware classifiers, focusing on generating adversarial sequences combining API calls and static features (e.g., printable strings) that will be misclassified by the classifier without affecting the malware functionality. We show that this attack is effective against many classifiers due to the transferability principle between RNN variants, feed forward DNNs, and traditional machine learning classifiers such as SVM. We also implement GADGET, a software framework to convert any malware binary to a binary undetected by malware classifiers, using the proposed attack, without access to the malware source code.Comment: Accepted as a conference paper at RAID 201

    ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ์ƒ์„ฑ ๋ชจ๋ธ์„ ์ด์šฉํ•œ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ• ๊ธฐ๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ)--์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› :๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€,2020. 2. ์ด์ƒ๊ตฌ.Recent advances in generation capability of deep learning models have spurred interest in utilizing deep generative models for unsupervised generative data augmentation (GDA). Generative data augmentation aims to improve the performance of a downstream machine learning model by augmenting the original dataset with samples generated from a deep latent variable model. This data augmentation approach is attractive to the natural language processing community, because (1) there is a shortage of text augmentation techniques that require little supervision and (2) resource scarcity being prevalent. In this dissertation, we explore the feasibility of exploiting deep latent variable models for data augmentation on three NLP tasks: sentence classification, spoken language understanding (SLU) and dialogue state tracking (DST), represent NLP tasks of various complexities and properties -- SLU requires multi-task learning of text classification and sequence tagging, while DST requires the understanding of hierarchical and recurrent data structures. For each of the three tasks, we propose a task-specific latent variable model based on conditional, hierarchical and sequential variational autoencoders (VAE) for multi-modal joint modeling of linguistic features and the relevant annotations. We conduct extensive experiments to statistically justify our hypothesis that deep generative data augmentation is beneficial for all subject tasks. Our experiments show that deep generative data augmentation is effective for the select tasks, supporting the idea that the technique can potentially be utilized for other range of NLP tasks. Ablation and qualitative studies reveal deeper insight into the underlying mechanisms of generative data augmentation. As a secondary contribution, we also shed light onto the recurring posterior collapse phenomenon in autoregressive VAEs and, subsequently, propose novel techniques to reduce the model risk, which is crucial for proper training of complex VAE models, enabling them to synthesize better samples for data augmentation. In summary, this work intends to demonstrate and analyze the effectiveness of unsupervised generative data augmentation in NLP. Ultimately, our approach enables standardized adoption of generative data augmentation, which can be applied orthogonally to existing regularization techniques.์ตœ๊ทผ ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ์ƒ์„ฑ ๋ชจ๋ธ์˜ ๊ธ‰๊ฒฉํ•œ ๋ฐœ์ „์œผ๋กœ ์ด๋ฅผ ์ด์šฉํ•œ ์ƒ์„ฑ ๊ธฐ๋ฐ˜ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ• ๊ธฐ๋ฒ•(generative data augmentation, GDA)์˜ ์‹คํ˜„ ๊ฐ€๋Šฅ์„ฑ์— ๋Œ€ํ•œ ๊ธฐ๋Œ€๊ฐ€ ์ปค์ง€๊ณ  ์žˆ๋‹ค. ์ƒ์„ฑ ๊ธฐ๋ฐ˜ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ• ๊ธฐ๋ฒ•์€ ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ์ž ์žฌ๋ณ€์ˆ˜ ๋ชจ๋ธ์—์„œ ์ƒ์„ฑ ๋œ ์ƒ˜ํ”Œ์„ ์›๋ณธ ๋ฐ์ดํ„ฐ์…‹์— ์ถ”๊ฐ€ํ•˜์—ฌ ์—ฐ๊ด€๋œ ํƒœ์Šคํฌ์˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๊ธฐ์ˆ ์„ ์˜๋ฏธํ•œ๋‹ค. ๋”ฐ๋ผ์„œ ์ƒ์„ฑ ๊ธฐ๋ฐ˜ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ• ๊ธฐ๋ฒ•์€ ๋ฐ์ดํ„ฐ ๊ณต๊ฐ„์—์„œ ์ด๋ค„์ง€๋Š” ์ •๊ทœํ™” ๊ธฐ์ˆ ์˜ ํ•œ ํ˜•ํƒœ๋กœ ๊ฐ„์ฃผ๋  ์ˆ˜ ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ์ƒ์„ฑ ๋ชจ๋ธ์˜ ์ƒˆ๋กœ์šด ํ™œ์šฉ ๊ฐ€๋Šฅ์„ฑ์€ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ ๋ถ„์•ผ์—์„œ ๋”์šฑ ์ค‘์š”ํ•˜๊ฒŒ ๋ถ€๊ฐ๋˜๋Š” ์ด์œ ๋Š” (1) ๋ฒ”์šฉ ๊ฐ€๋Šฅํ•œ ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ• ๊ธฐ์ˆ ์˜ ๋ถ€์žฌ์™€ (2) ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ์˜ ํฌ์†Œ์„ฑ์„ ๊ทน๋ณตํ•  ์ˆ˜ ์žˆ๋Š” ๋Œ€์•ˆ์ด ํ•„์š”ํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๋ฌธ์ œ์˜ ๋ณต์žก๋„์™€ ํŠน์ง•์„ ๊ณจ๊ณ ๋ฃจ ์ฑ„์ง‘ํ•˜๊ธฐ ์œ„ํ•ด ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ํ…์ŠคํŠธ ๋ถ„๋ฅ˜(text classification), ์ˆœ์ฐจ์  ๋ ˆ์ด๋ธ”๋ง๊ณผ ๋ฉ€ํ‹ฐํƒœ์Šคํ‚น ๊ธฐ์ˆ ์ด ํ•„์š”ํ•œ ๋ฐœํ™” ์ดํ•ด(spoken language understanding, SLU), ๊ณ„์ธต์ ์ด๋ฉฐ ์žฌ๊ท€์ ์ธ ๋ฐ์ดํ„ฐ ๊ตฌ์กฐ์— ๋Œ€ํ•œ ๊ณ ๋ ค๊ฐ€ ํ•„์š”ํ•œ ๋Œ€ํ™” ์ƒํƒœ ์ถ”์ (dialogue state tracking, DST) ๋“ฑ ์„ธ ๊ฐ€์ง€ ๋ฌธ์ œ์—์„œ ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ์ƒ์„ฑ ๋ชจ๋ธ์„ ํ™œ์šฉํ•œ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ• ๊ธฐ๋ฒ•์˜ ํƒ€๋‹น์„ฑ์— ๋Œ€ํ•ด ๋‹ค๋ฃฌ๋‹ค. ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ์กฐ๊ฑด๋ถ€, ๊ณ„์ธต์  ๋ฐ ์ˆœ์ฐจ์  variational autoencoder (VAE)์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ ๊ฐ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ ๋ฌธ์ œ์— ํŠนํ™”๋œ ํ…์ŠคํŠธ ๋ฐ ์—ฐ๊ด€ ๋ถ€์ฐฉ ์ •๋ณด๋ฅผ ๋™์‹œ์— ์ƒ์„ฑํ•˜๋Š” ํŠน์ˆ˜ ๋”ฅ๋Ÿฌ๋‹ ์ƒ์„ฑ ๋ชจ๋ธ๋“ค์„ ์ œ์‹œํ•˜๊ณ , ๋‹ค์–‘ํ•œ ํ•˜๋ฅ˜ ๋ชจ๋ธ๊ณผ ๋ฐ์ดํ„ฐ์…‹์„ ๋‹ค๋ฃจ๋Š” ๋“ฑ ํญ ๋„“์€ ์‹คํ—˜์„ ํ†ตํ•ด ๋”ฅ ์ƒ์„ฑ ๋ชจ๋ธ ๊ธฐ๋ฐ˜ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ• ๊ธฐ๋ฒ•์˜ ํšจ๊ณผ๋ฅผ ํ†ต๊ณ„์ ์œผ๋กœ ์ž…์ฆํ•˜์˜€๋‹ค. ๋ถ€์ˆ˜์  ์—ฐ๊ตฌ์—์„œ๋Š” ์ž๊ธฐํšŒ๊ท€์ (autoregressive) VAE์—์„œ ๋นˆ๋ฒˆํžˆ ๋ฐœ์ƒํ•˜๋Š” posterior collapse ๋ฌธ์ œ์— ๋Œ€ํ•ด ํƒ๊ตฌํ•˜๊ณ , ํ•ด๋‹น ๋ฌธ์ œ๋ฅผ ์™„ํ™”ํ•  ์ˆ˜ ์žˆ๋Š” ์‹ ๊ทœ ๋ฐฉ์•ˆ๋„ ์ œ์•ˆํ•œ๋‹ค. ํ•ด๋‹น ๋ฐฉ๋ฒ•์„ ์ƒ์„ฑ์  ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•์— ํ•„์š”ํ•œ ๋ณต์žกํ•œ VAE ๋ชจ๋ธ์— ์ ์šฉํ•˜์˜€์„ ๋•Œ, ์ƒ์„ฑ ๋ชจ๋ธ์˜ ์ƒ์„ฑ ์งˆ์ด ํ–ฅ์ƒ๋˜์–ด ๋ฐ์ดํ„ฐ ์ฆ๊ฐ• ํšจ๊ณผ์—๋„ ๊ธ์ •์ ์ธ ์˜ํ–ฅ์„ ๋ฏธ์น  ์ˆ˜ ์žˆ์Œ์„ ๊ฒ€์ฆํ•˜์˜€๋‹ค. ๋ณธ ๋…ผ๋ฌธ์„ ํ†ตํ•ด ์ž์—ฐ์–ด์ฒ˜๋ฆฌ ๋ถ„์•ผ์—์„œ ๊ธฐ์กด ์ •๊ทœํ™” ๊ธฐ๋ฒ•๊ณผ ๋ณ‘ํ–‰ ์ ์šฉ ๊ฐ€๋Šฅํ•œ ๋น„์ง€๋„ ํ˜•ํƒœ์˜ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ• ๊ธฐ๋ฒ•์˜ ํ‘œ์ค€ํ™”๋ฅผ ๊ธฐ๋Œ€ํ•ด ๋ณผ ์ˆ˜ ์žˆ๋‹ค.1 Introduction 1 1.1 Motivation 1 1.2 Dissertation Overview 6 2 Background and Related Work 8 2.1 Deep Latent Variable Models 8 2.1.1 Variational Autoencoder (VAE) 10 2.1.2 Deep Generative Models and Text Generation 12 2.2 Data Augmentation 12 2.2.1 General Description 13 2.2.2 Categorization of Data Augmentation 14 2.2.3 Theoretical Explanations 21 2.3 Summary 24 3 Basic Task: Text Classi cation 25 3.1 Introduction 25 3.2 Our Approach 28 3.2.1 Proposed Models 28 3.2.2 Training with I-VAE 29 3.3 Experiments 31 3.3.1 Datasets 32 3.3.2 Experimental Settings 33 3.3.3 Implementation Details 34 3.3.4 Data Augmentation Results 36 3.3.5 Ablation Studies 39 3.3.6 Qualitative Analysis 40 3.4 Summary 45 4 Multi-task Learning: Spoken Language Understanding 46 4.1 Introduction 46 4.2 Related Work 48 4.3 Model Description 48 4.3.1 Framework Formulation 48 4.3.2 Joint Generative Model 49 4.4 Experiments 56 4.4.1 Datasets 56 4.4.2 Experimental Settings 57 4.4.3 Generative Data Augmentation Results 61 4.4.4 Comparison to Other State-of-the-art Results 63 4.4.5 Ablation Studies 63 4.5 Summary 67 5 Complex Data: Dialogue State Tracking 68 5.1 Introduction 68 5.2 Background and Related Work 70 5.2.1 Task-oriented Dialogue 70 5.2.2 Dialogue State Tracking 72 5.2.3 Conversation Modeling 72 5.3 Variational Hierarchical Dialogue Autoencoder (VHDA) 73 5.3.1 Notations 73 5.3.2 Variational Hierarchical Conversational RNN 74 5.3.3 Proposed Model 75 5.3.4 Posterior Collapse 82 5.4 Experimental Results 84 5.4.1 Experimental Settings 84 5.4.2 Data Augmentation Results 90 5.4.3 Intrinsic Evaluation - Language Evaluation 94 5.4.4 Qualitative Results 95 5.5 Summary 101 6 Conclusion 103 6.1 Summary 103 6.2 Limitations 104 6.3 Future Work 105Docto

    Deep Learning -- A first Meta-Survey of selected Reviews across Scientific Disciplines, their Commonalities, Challenges and Research Impact

    Full text link
    Deep learning belongs to the field of artificial intelligence, where machines perform tasks that typically require some kind of human intelligence. Similar to the basic structure of a brain, a deep learning algorithm consists of an artificial neural network, which resembles the biological brain structure. Mimicking the learning process of humans with their senses, deep learning networks are fed with (sensory) data, like texts, images, videos or sounds. These networks outperform the state-of-the-art methods in different tasks and, because of this, the whole field saw an exponential growth during the last years. This growth resulted in way over 10,000 publications per year in the last years. For example, the search engine PubMed alone, which covers only a sub-set of all publications in the medical field, provides already over 11,000 results in Q3 2020 for the search term 'deep learning', and around 90% of these results are from the last three years. Consequently, a complete overview over the field of deep learning is already impossible to obtain and, in the near future, it will potentially become difficult to obtain an overview over a subfield. However, there are several review articles about deep learning, which are focused on specific scientific fields or applications, for example deep learning advances in computer vision or in specific tasks like object detection. With these surveys as a foundation, the aim of this contribution is to provide a first high-level, categorized meta-survey of selected reviews on deep learning across different scientific disciplines. The categories (computer vision, language processing, medical informatics and additional works) have been chosen according to the underlying data sources (image, language, medical, mixed). In addition, we review the common architectures, methods, pros, cons, evaluations, challenges and future directions for every sub-category.Comment: 83 pages, 22 figures, 9 tables, 100 reference
    • โ€ฆ
    corecore