118 research outputs found

    An original framework for understanding human actions and body language by using deep neural networks

    Get PDF
    The evolution of both fields of Computer Vision (CV) and Artificial Neural Networks (ANNs) has allowed the development of efficient automatic systems for the analysis of people's behaviour. By studying hand movements it is possible to recognize gestures, often used by people to communicate information in a non-verbal way. These gestures can also be used to control or interact with devices without physically touching them. In particular, sign language and semaphoric hand gestures are the two foremost areas of interest due to their importance in Human-Human Communication (HHC) and Human-Computer Interaction (HCI), respectively. While the processing of body movements play a key role in the action recognition and affective computing fields. The former is essential to understand how people act in an environment, while the latter tries to interpret people's emotions based on their poses and movements; both are essential tasks in many computer vision applications, including event recognition, and video surveillance. In this Ph.D. thesis, an original framework for understanding Actions and body language is presented. The framework is composed of three main modules: in the first one, a Long Short Term Memory Recurrent Neural Networks (LSTM-RNNs) based method for the Recognition of Sign Language and Semaphoric Hand Gestures is proposed; the second module presents a solution based on 2D skeleton and two-branch stacked LSTM-RNNs for action recognition in video sequences; finally, in the last module, a solution for basic non-acted emotion recognition by using 3D skeleton and Deep Neural Networks (DNNs) is provided. The performances of RNN-LSTMs are explored in depth, due to their ability to model the long term contextual information of temporal sequences, making them suitable for analysing body movements. All the modules were tested by using challenging datasets, well known in the state of the art, showing remarkable results compared to the current literature methods

    A Comprehensive Review of Deep Learning Architectures for Computer Vision Applications

    Get PDF
    The emergence of machine learning in the artificial intelligence field led the world of technology to make great strides. Todayโ€™s advanced systems with the ability of being designed just like human brain functions has given practitioners the ability to train systems so that they could process, analyze, classify, and predict different data classes. Therefore, the machine learning field has become a hot topic for scientists and researchers to introduce the best network with the highest performance for such mentioned purposes. In this article, computer vision science, image classification implementation, and deep neural networks are presented. This article discusses how models have been designed based on the concept of the human brain. The development of a Convolutional Neural Network (CNN) and its various architectures, which have shown great efficiency and evaluation in object detection, face recognition, image classification, and localization, are also introduced. Furthermore, the utilization and application of CNNs, including voice recognition, image processing, video processing, and text recognition, are examined closely. A literature review is conducted to illustrate the significance and the details of Convolutional Neural Networks in various applications

    ์ผ๋ฐ˜์ ์ธ ๋ฌธ์ž ์ด๋ฏธ์ง€์˜ ์–ธ์–ด๋ถ„๋ฅ˜

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ์ž์—ฐ๊ณผํ•™๋Œ€ํ•™ ํ˜‘๋™๊ณผ์ • ๊ณ„์‚ฐ๊ณผํ•™์ „๊ณต, 2021. 2. ๊ฐ•๋ช…์ฃผ.As other machine learning fields, there has been a lot of progress in text detection and recognition to obtain text information contained in images since the deep learning era. When multiple languages are mixed in the im- age, the process of recognition typically goes through a detection, language classification and recognition. This dissertation aims to classify languages of image patches which are the results of text detection. As far as we know, there are no prior research exactly targeting language classification of images. So we started from basic backbone networks that are used commonly in many other general object detection fields. With a ResNeSt-based network which is based on Resnet and automated pre-processing of ground-truth data to improve classification performance, we can achieve state of the art record of this task with a public benchmark dataset.๋‹ค๋ฅธ ๊ธฐ๊ณ„ํ•™์Šต๋ถ„์•ผ์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ, ์ด๋ฏธ์ง€๊ฐ€ ๋‹ด๊ณ  ์žˆ๋Š” ๋ฌธ์ž์ •๋ณด๋ฅผ ์–ป์–ด ๋‚ด๋ ค๋Š” ๋ฌธ์ž์ธ์‹ ๋ถ„์•ผ์—์„œ๋„ ๋”ฅ๋Ÿฌ๋‹ ์ดํ›„ ๋งŽ์€ ์ง„์ „์ด ์žˆ์—ˆ๋‹ค. ์ธ์‹์˜ ๊ณผ์ •์€ ํ†ต์ƒ์ ์œผ๋กœ ๋ฌธ์ž๊ฒ€์ถœ, ๋ฌธ์ž์ธ์‹์˜ ๊ณผ์ •์„ ์ฐจ๋ก€๋กœ ๊ฑฐ์น˜๋Š”๋ฐ, ๋‹ค์ˆ˜์˜ ์–ธ์–ด๊ฐ€ ํ˜ผ์žฌํ•  ๊ฒฝ์šฐ ๊ฒ€์ถœ๊ณผ ์ธ์‹ ์‚ฌ์ด์— ์–ธ์–ด๋ถ„๋ฅ˜ ๋‹จ๊ณ„๋ฅผ ํ•œ๋ฒˆ ๋” ๊ฑฐ์น˜๋Š” ๊ฒƒ์ด ๋ณดํ†ต ์ด๋‹ค. ๋ณธ์—ฐ๊ตฌ๋Š”๋ฌธ์ž๊ฒ€์ถœ์ดํ›„์˜๋‹จ๊ณ„์—์„œ์ด๋ฏธ์ง€ํŒจ์น˜๋“ค์„๊ฐ์–ธ์–ด์—๋”ฐ๋ผ ๋ถ„๋ฅ˜ํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•œ๋‹ค. ๋ถ„๋ฅ˜์ž‘์—…๋งŒ์„ ์ „๋ฌธ์ ์œผ๋กœ ๋‹ค๋ฃฌ ์„ ํ–‰์—ฐ๊ตฌ๊ฐ€ ์—†์œผ ๋ฏ€๋กœ, ์ผ๋ฐ˜์ ์ธ ๊ฐ์ฒด๊ฒ€์ถœ์—์„œ ์“ฐ์ด๋Š” ๋„คํŠธ์›Œํฌ ์ค‘์—์„œ ์ ์ ˆํ•œ ๊ฒƒ์„ ์„ ํƒํ•˜๊ณ  ์‘์šฉํ•˜์˜€๋‹ค. ResNeSt๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœํ•œ ๋„คํŠธ์›Œํฌ์™€ ์ž๋™ํ™”๋œ ์ „์ฒ˜๋ฆฌ ๊ณผ์ •์„ ํ†ตํ•ด ๊ณต๊ฐœ๋œ ๋ฒค์น˜๋งˆํฌ ๋ฐ์ดํ„ฐ์…‹์„ ๊ธฐ์ค€์œผ๋กœ ๊ฐ€์žฅ ์ข‹์€ ๊ธฐ๋ก์„ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.Abstract i 1 Introduction 1 1.1 OpticalCharacterRecognition.................. 1 1.2 DeepLearning........................... 2 2 Backgrounds 4 2.1 Detection ............................. 4 2.2 Recognition ............................ 5 2.3 LanguageClassification...................... 6 2.4 Multi-lingualText(MLT)..................... 7 2.5 ConvolutionalNeuralNetwork(CNN) . . . . . . . . . . . . . . 7 2.6 AttentionMechanism....................... 8 2.7 RelatedWorks........................... 9 2.7.1 Detectors ......................... 9 2.7.2 Recognizers ........................ 14 2.7.3 End-to-end methods (detector + recognizer) . . . . . . 14 2.8 Dataset .............................. 15 2.8.1 ICDARMLT ....................... 15 2.8.2 Syntheticdata:Gupta.................. 17 2.8.3 COCO-Text........................ 17 3 Proposed Methods 18 3.1 BaseNetworkSelection...................... 18 3.1.1 Googlenet ......................... 18 3.1.2 ShufflenetV2 ....................... 20 3.1.3 Resnet........................... 21 3.1.4 WideResnet........................ 23 3.1.5 ResNeXt.......................... 24 3.1.6 ResNeSt(Split-Attention network) ............ 24 3.1.7 Densenet.......................... 25 3.1.8 EfficientNet ........................ 25 3.1.9 Automaticsearch:AutoSTR .............. 27 3.2 Methods.............................. 28 3.2.1 Groundtruthcleansing.................. 28 3.2.2 Divide-and-stack ..................... 32 3.2.3 Usingadditionaldata................... 33 3.2.4 OHEM........................... 34 3.2.5 Network using the number of characters . . . . . . . . 35 3.2.6 UseofR-CNNstructure ................. 36 3.2.7 Highresolutioninput................... 39 3.2.8 Handling outliers using variant of OHEM . . . . . . . . 39 3.2.9 Variable sized input images using the attention . . . . 41 3.2.10 Classbalancing ...................... 41 3.2.11 Finetuningonspecificclasses.............. 42 3.2.12 Optimizerselection.................... 42 3.3 Result ............................... 42 4 Conclusion 44 Abstract (in Korean) 49Docto

    Comparing brain-like representations learned by vanilla, residual, and recurrent CNN architectures

    Get PDF
    Though it has been hypothesized that state-of-the art residual networks approximate the recurrent visual system, it is yet to be seen if the representations learned by these biologically inspired CNNs actually have closer representations to neural data. It is likely that CNNs and DNNs that are most functionally similar to the brain will contain mechanisms that are most like those used by the brain. In this thesis, we investigate how different CNN architectures approximate the representations learned through the ventral-object recognition and processing-stream of the brain. We specifically evaluate how recent approximations of biological neural recurrence-such as residual connections, dense residual connections, and a biologically-inspired implemen- tation of recurrence-affect the representations learned by each CNN. We first investigate the representations learned by layers throughout a few state-of-the-art CNNs-VGG-19 (vanilla CNN), ResNet-152 (CNN with residual connections), and DenseNet-161 (CNN with dense connections). To control for differences in model depth, we then extend this analysis to the CORnet family of biologically-inspired CNN models with matching high-level architectures. The CORnet family has three models: a vanilla CNN (CORnet-Z), a CNN with biologically-valid recurrent dynamics (CORnet-R), and a CNN with both recurrent and residual connections (CORnet-S). We compare the representations of these six models to functionally aligned (with hyperalignment) fMRI brain data acquired during a naturalistic visual task. We take two approaches to comparing these CNN and brain representations. We first use forward encoding, a predictive approach that uses CNN features to predict neural responses across the whole brain. We next use representational similarity analysis (RSA) and centered kernel alignment (CKA) to measure the similarities in representation within CNN layers and specific brain ROIs. We show that, compared to vanilla CNNs, CNNs with residual and recurrent connections exhibit representations that are even more similar to those learned by the human ventral visual stream. We also achieve state-of-the-art forward encoding and RSA performance with the residual and recurrent CNN models
    • โ€ฆ
    corecore