14,038 research outputs found

    Capsule Network based Contrastive Learning of Unsupervised Visual Representations

    Full text link
    Capsule Networks have shown tremendous advancement in the past decade, outperforming the traditional CNNs in various task due to it's equivariant properties. With the use of vector I/O which provides information of both magnitude and direction of an object or it's part, there lies an enormous possibility of using Capsule Networks in unsupervised learning environment for visual representation tasks such as multi class image classification. In this paper, we propose Contrastive Capsule (CoCa) Model which is a Siamese style Capsule Network using Contrastive loss with our novel architecture, training and testing algorithm. We evaluate the model on unsupervised image classification CIFAR-10 dataset and achieve a top-1 test accuracy of 70.50% and top-5 test accuracy of 98.10%. Due to our efficient architecture our model has 31 times less parameters and 71 times less FLOPs than the current SOTA in both supervised and unsupervised learning

    ์–ผ๊ตด ํ‘œ์ • ์ธ์‹, ๋‚˜์ด ๋ฐ ์„ฑ๋ณ„ ์ถ”์ •์„ ์œ„ํ•œ ๋‹ค์ค‘ ๋ฐ์ดํ„ฐ์…‹ ๋‹ค์ค‘ ๋„๋ฉ”์ธ ๋‹ค์ค‘์ž‘์—… ๋„คํŠธ์›Œํฌ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(์„์‚ฌ)--์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› :๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€,2019. 8. Cho, Nam Ik.์ปจ๋ณผ ๋ฃจ์…˜ ๋‰ด๋Ÿด ๋„คํŠธ์›Œํฌ (CNN)๋Š” ์–ผ๊ตด๊ณผ ๊ด€๋ จ๋œ ๋ฌธ์ œ๋ฅผ ํฌํ•จํ•˜์—ฌ ๋งŽ์€ ์ปดํ“จํ„ฐ ๋น„์ „ ์ž‘์—…์—์„œ ๋งค์šฐ ์ž˜ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์—ฐ๋ น ์ถ”์ • ๋ฐ ์–ผ๊ตด ํ‘œ์ • ์ธ์‹ (FER)์˜ ๊ฒฝ์šฐ CNN์ด ์ œ๊ณต ํ•œ ์ •ํ™•๋„๋Š” ์—ฌ์ „ํžˆ ์‹ค์ œ ๋ฌธ์ œ์— ๋Œ€ํ•ด ์ถฉ๋ถ„ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. CNN์€ ์–ผ๊ตด์˜ ์ฃผ๋ฆ„์˜ ๋‘๊ป˜์™€ ์–‘์˜ ๋ฏธ๋ฌ˜ํ•œ ์ฐจ์ด๋ฅผ ๋ฐœ๊ฒฌํ•˜์ง€ ๋ชปํ–ˆ์ง€๋งŒ, ์ด๊ฒƒ์€ ์—ฐ๋ น ์ถ”์ •๊ณผ FER์— ํ•„์ˆ˜์ ์ž…๋‹ˆ๋‹ค. ๋˜ํ•œ ์‹ค์ œ ์„ธ๊ณ„์—์„œ์˜ ์–ผ๊ตด ์ด๋ฏธ์ง€๋Š” CNN์ด ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์—์„œ ๊ฐ€๋Šฅํ•  ๋•Œ ํšŒ์ „ ๋œ ๋ฌผ์ฒด๋ฅผ ์ฐพ๋Š” ๋ฐ ๊ฐ•๊ฑดํ•˜์ง€ ์•Š์€ ํšŒ์ „ ๋ฐ ์กฐ๋ช…์œผ๋กœ ์ธํ•ด ๋งŽ์€ ์ฐจ์ด๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ MTL (Multi Task Learning)์€ ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ์ง€๊ฐ ์ž‘์—…์„ ๋™์‹œ์— ํšจ์œจ์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ๋ชจ๋ฒ”์  ์ธ MTL ๋ฐฉ๋ฒ•์—์„œ๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ์ž‘์—…์— ๋Œ€ํ•œ ๋ชจ๋“  ๋ ˆ์ด๋ธ”์„ ํ•จ๊ป˜ ํฌํ•จํ•˜๋Š” ๋ฐ์ดํ„ฐ ์ง‘ํ•ฉ์„ ๊ตฌ์„ฑํ•˜๋Š” ๊ฒƒ์„ ๊ณ ๋ คํ•ด์•ผํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋Œ€์ƒ ์ž‘์—…์ด ๋‹ค๊ฐํ™”๋˜๊ณ  ๋ณต์žกํ•ด์ง€๋ฉด ๋” ๊ฐ•๋ ฅํ•œ ๋ ˆ์ด๋ธ”์„ ๊ฐ€์ง„ ๊ณผ๋„ํ•˜๊ฒŒ ํฐ ๋ฐ์ดํ„ฐ ์„ธํŠธ๊ฐ€ ํ•„์š”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์›ํ•˜๋Š” ๋ผ๋ฒจ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๋น„์šฉ์€ ์ข…์ข… ์žฅ์• ๋ฌผ์ด๋ฉฐ ํŠนํžˆ ๋‹ค์ค‘ ์ž‘์—… ํ•™์Šต์˜ ๊ฒฝ์šฐ ์žฅ์• ๊ฐ€๋ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์šฐ๋ฆฌ๋Š” ๊ฐ€๋ฒ„ ํ•„ํ„ฐ์™€ ์บก์Š ๊ธฐ๋ฐ˜ ๋„คํŠธ์›Œํฌ (MTL) ๋ฐ ๋ฐ์ดํ„ฐ ์ฆ๋ฅ˜๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœํ•˜๋Š” ๋‹ค์ค‘ ์ž‘์—… ํ•™์Šต์— ๊ธฐ๋ฐ˜ํ•œ ์ƒˆ๋กœ์šด ๋ฐ˜ ๊ฐ๋… ํ•™์Šต ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค.The convolutional neural network (CNN) works very well in many computer vision tasks including the face-related problems. However, in the case of age estimation and facial expression recognition (FER), the accuracy provided by the CNN is still not good enough to be used for the real-world problems. It seems that the CNN does not well find the subtle differences in thickness and amount of wrinkles on the face, which are the essential features for the age estimation and FER. Also, the face images in the real world have many variations due to the face rotation and illumination, where the CNN is not robust in finding the rotated objects when not every possible variation is in the training data. Moreover, The Multi Task Learning (MTL) Based based methods can be much helpful to achieve the real-time visual understanding of a dynamic scene, as they are able to perform several different perceptual tasks simultaneously and efficiently. In the exemplary MTL methods, we need to consider constructing a dataset that contains all the labels for different tasks together. However, as the target task becomes multi-faceted and more complicated, sometimes unduly large dataset with stronger labels is required. Hence, the cost of generating desired labeled data for complicated learning tasks is often an obstacle, especially for multi-task learning. Therefore, first to alleviate these problems, we first propose few methods in order to improve single task baseline performance using gabor filters and Capsule Based Networks , Then We propose a new semi-supervised learning method on face-related tasks based on Multi-Task Learning (MTL) and data distillation.1 INTRODUCTION 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.1 Age and Gender Estimation . . . . . . . . . . . . . . . . . . 4 1.2.2 Facial Expression Recognition (FER) . . . . . . . . . . . . . 4 1.2.3 Capsule networks (CapsNet) . . . . . . . . . . . . . . . . . . 5 1.2.4 Semi-Supervised Learning. . . . . . . . . . . . . . . . . . . . 5 1.2.5 Multi-Task Learning. . . . . . . . . . . . . . . . . . . . . . . 6 1.2.6 Knowledge and data distillation. . . . . . . . . . . . . . . . . 6 1.2.7 Domain Adaptation. . . . . . . . . . . . . . . . . . . . . . . 7 1.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2. GF-CapsNet: Using Gabor Jet and Capsule Networks for Face-Related Tasks 10 2.1 Feeding CNN with Hand-Crafted Features . . . . . . . . . . . . . . . 10 2.1.1 Preparation of Input . . . . . . . . . . . . . . . . . . . . . . 10 2.1.2 Age and Gender Estimation using the Gabor Responses . . . . 13 2.2 GF-CapsNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.1 Modification of CapsNet . . . . . . . . . . . . . . . . . 16 3. Distill-2MD-MTL: Data Distillation based on Multi-Dataset Multi-Domain Multi-Task Frame Work to Solve Face Related Tasks 20 3.1 MTL learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2 Data Distillation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4. Experiments and Results 25 4.1 Experiments on GF-CNN and GF-CapsNet . . . . . . . . . . . . . . 25 4.2 GF-CNN Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.2.1 GF-CapsNet Results . . . . . . . . . . . . . . . . . . . . . . 30 4.3 Experiment on Distill-2MD-MTL . . . . . . . . . . . . . . . . . . . 33 4.3.1 Semi-Supervised MTL . . . . . . . . . . . . . . . . . . . . . 34 4.3.2 Cross Datasets Cross-Domain Evaluation . . . . . . . . . . . 36 5. Conclusion 38 Abstract (In Korean) 49Maste

    Multi-labeled Relation Extraction with Attentive Capsule Network

    Full text link
    To disclose overlapped multiple relations from a sentence still keeps challenging. Most current works in terms of neural models inconveniently assuming that each sentence is explicitly mapped to a relation label, cannot handle multiple relations properly as the overlapped features of the relations are either ignored or very difficult to identify. To tackle with the new issue, we propose a novel approach for multi-labeled relation extraction with capsule network which acts considerably better than current convolutional or recurrent net in identifying the highly overlapped relations within an individual sentence. To better cluster the features and precisely extract the relations, we further devise attention-based routing algorithm and sliding-margin loss function, and embed them into our capsule network. The experimental results show that the proposed approach can indeed extract the highly overlapped features and achieve significant performance improvement for relation extraction comparing to the state-of-the-art works.Comment: To be published in AAAI 201

    Attention-Based Capsule Networks with Dynamic Routing for Relation Extraction

    Full text link
    A capsule is a group of neurons, whose activity vector represents the instantiation parameters of a specific type of entity. In this paper, we explore the capsule networks used for relation extraction in a multi-instance multi-label learning framework and propose a novel neural approach based on capsule networks with attention mechanisms. We evaluate our method with different benchmarks, and it is demonstrated that our method improves the precision of the predicted relations. Particularly, we show that capsule networks improve multiple entity pairs relation extraction.Comment: To be published in EMNLP 201

    Polyphonic Sound Event Detection by using Capsule Neural Networks

    Full text link
    Artificial sound event detection (SED) has the aim to mimic the human ability to perceive and understand what is happening in the surroundings. Nowadays, Deep Learning offers valuable techniques for this goal such as Convolutional Neural Networks (CNNs). The Capsule Neural Network (CapsNet) architecture has been recently introduced in the image processing field with the intent to overcome some of the known limitations of CNNs, specifically regarding the scarce robustness to affine transformations (i.e., perspective, size, orientation) and the detection of overlapped images. This motivated the authors to employ CapsNets to deal with the polyphonic-SED task, in which multiple sound events occur simultaneously. Specifically, we propose to exploit the capsule units to represent a set of distinctive properties for each individual sound event. Capsule units are connected through a so-called "dynamic routing" that encourages learning part-whole relationships and improves the detection performance in a polyphonic context. This paper reports extensive evaluations carried out on three publicly available datasets, showing how the CapsNet-based algorithm not only outperforms standard CNNs but also allows to achieve the best results with respect to the state of the art algorithms

    VSSA-NET: Vertical Spatial Sequence Attention Network for Traffic Sign Detection

    Full text link
    Although traffic sign detection has been studied for years and great progress has been made with the rise of deep learning technique, there are still many problems remaining to be addressed. For complicated real-world traffic scenes, there are two main challenges. Firstly, traffic signs are usually small size objects, which makes it more difficult to detect than large ones; Secondly, it is hard to distinguish false targets which resemble real traffic signs in complex street scenes without context information. To handle these problems, we propose a novel end-to-end deep learning method for traffic sign detection in complex environments. Our contributions are as follows: 1) We propose a multi-resolution feature fusion network architecture which exploits densely connected deconvolution layers with skip connections, and can learn more effective features for the small size object; 2) We frame the traffic sign detection as a spatial sequence classification and regression task, and propose a vertical spatial sequence attention (VSSA) module to gain more context information for better detection performance. To comprehensively evaluate the proposed method, we do experiments on several traffic sign datasets as well as the general object detection dataset and the results have shown the effectiveness of our proposed method
    • โ€ฆ
    corecore