42 research outputs found

    Spatio-temporal Video Re-localization by Warp LSTM

    Full text link
    The need for efficiently finding the video content a user wants is increasing because of the erupting of user-generated videos on the Web. Existing keyword-based or content-based video retrieval methods usually determine what occurs in a video but not when and where. In this paper, we make an answer to the question of when and where by formulating a new task, namely spatio-temporal video re-localization. Specifically, given a query video and a reference video, spatio-temporal video re-localization aims to localize tubelets in the reference video such that the tubelets semantically correspond to the query. To accurately localize the desired tubelets in the reference video, we propose a novel warp LSTM network, which propagates the spatio-temporal information for a long period and thereby captures the corresponding long-term dependencies. Another issue for spatio-temporal video re-localization is the lack of properly labeled video datasets. Therefore, we reorganize the videos in the AVA dataset to form a new dataset for spatio-temporal video re-localization research. Extensive experimental results show that the proposed model achieves superior performances over the designed baselines on the spatio-temporal video re-localization task

    Recent Trends and Techniques in Text Detection and Text Localization in a Natural Scene: A Survey

    Get PDF
    Text information extraction from natural scene images is a rising area of research. Since text in natural scene images generally carries valuable details, detecting and recognizing scene text has been deemed essential for a variety of advanced computer vision applications. There has been a lot of effort put into extracting text regions from scene text images in an effective and reliable manner. As most text recognition applications have high demand of robust algorithms for detecting and localizing texts from a given scene text image, so the researchers mainly focus on the two important stages text detection and text localization. This paper provides a review of various techniques of text detection and text localization

    심측 신경망을 ν™œμš©ν•œ μ˜μƒ ν’ˆμ§ˆ κ°•ν™” 기법

    Get PDF
    ν•™μœ„λ…Όλ¬Έ(박사) -- μ„œμšΈλŒ€ν•™κ΅λŒ€ν•™μ› : μžμ—°κ³Όν•™λŒ€ν•™ ν˜‘λ™κ³Όμ • 계산과학전곡, 2021.8. λ…Έν˜•λ―Ό.In this thesis, we focus on deep learning methods to enhance the quality of a single image. We first categorize the image quality enhancement problem into three tasks: denoising, deblurring, and super-resolution, then introduce deep learning techniques optimized for each problem. To solve these problems, we introduce a novel deep neural network suitable for multi-scale analysis and propose efficient model-agnostic methods that help the network extract information from high-frequency domains to reconstruct clearer images. Experiments on SIDD, Flickr2K, DIV2K, and REDS datasets show that our method achieves state-of-the-art performance on each task. Furthermore, we show that our model can overcome the over-smoothing problem commonly observed in existing PSNR-oriented methods and generate more natural high-resolution images by applying adversarial training.λ³Έ ν•™μœ„ 논문은 단일 μ˜μƒμ˜ ν’ˆμ§ˆ κ°•ν™”λ₯Ό μœ„ν•œ λ”₯λŸ¬λ‹ 기법에 λŒ€ν•œ 연ꡬλ₯Ό 닀룬닀. μ˜μƒ ν’ˆμ§ˆ κ°•ν™”λ₯Ό μ†μƒλœ μ΄λ―Έμ§€μ˜ 작음 제거 및 λ””λΈ”λŸ¬λ§κ³Ό 저해상도 이미지λ₯Ό κ³ ν•΄μƒλ„λ‘œ λ³€ν™˜ν•˜λŠ” μ΄ˆν•΄μƒλ„ 문제둜 μ„ΈλΆ„ν™”ν•œ λ’€, 각각의 문제 해결에 μ΅œμ ν™”λœ λ”₯λŸ¬λ‹ 기법을 λ‹¨κ³„λ³„λ‘œ μ†Œκ°œν•œλ‹€. 특히, μ†μƒλœ μ˜μƒμ˜ νŠΉμ„±μ„ 효과적으둜 λΆ„μ„ν•˜κ³  보닀 κΉ”λ”ν•œ 고해상도 μ˜μƒμ„ μƒμ„±ν•˜κΈ° μœ„ν•˜μ—¬ 주어진 μ˜μƒμ„ 닀쀑 μŠ€μΌ€μΌλ‘œ λΆ„μ„ν•˜λŠ” 심측 신경망 ꡬ쑰λ₯Ό μ œμ•ˆν•˜μ˜€μœΌλ©°, 이외에도 λ”₯λŸ¬λ‹ λͺ¨λΈμ΄ μ˜μƒ λ‚΄ λ³΅μž‘ν•œ 고주파수 μ˜μ—­μ— λŒ€ν•œ 정보λ₯Ό 효과적으둜 μΆ”μΆœν•˜κ³  μž¬κ±΄ν•  수 μžˆλ„λ‘ λ•λŠ” 기법듀을 μ†Œκ°œν•œλ‹€. μš°λ¦¬λŠ” μ œμ•ˆλœ 기법듀을 SIDD, Flickr2K, DIV2K, REDS λ“± 데이터셋에 μ μš©ν•˜μ—¬ 기쑴의 λ”₯λŸ¬λ‹ 기반 기법보닀 ν–₯μƒλœ μ„±λŠ₯을 μ‹€ν—˜μ μœΌλ‘œ 증λͺ…ν•˜μ˜€λ‹€. λ˜ν•œ μ΄ˆν•΄μƒλ„ 문제 해결을 μœ„ν•΄ ν•™μŠ΅λœ 심측 신경망에 좔가적인 μ λŒ€μ  ν•™μŠ΅μ„ μ μš©ν•¨μœΌλ‘œμ¨ κΈ°μ‘΄ λ”₯λŸ¬λ‹ κΈ°λ²•λ“€μ˜ ν•œκ³„λ‘œ μ§€μ λ˜μ—ˆλ˜ λΆ€λΆ„ 평균화 문제λ₯Ό κ·Ήλ³΅ν•˜κ³  보닀 μžμ—°μŠ€λŸ¬μš΄ 고해상도 μ˜μƒμ„ 생성할 수 μžˆμŒμ„ λ³΄μ˜€λ‹€.1. Introduction 1 2. Preliminaries 4 2.1 Image Denoising 4 2.1.1 Problem Formulation: AWGN 4 2.1.2 Existing Methods 6 2.2 Image Deblurring 7 2.2.1 Problem Formulation: Blind Deblur 7 2.2.2 Existing Methods 7 2.3 Single Image Super-Resolution 9 2.3.1 Problem Formulation: SISR 9 2.3.2 Existing Methods 12 3. Image Denoising 15 3.1 Proposed Methods 15 3.1.1 Multi-scale Edge Filtering 15 3.1.2 Feature Attention Module 17 3.1.3 Network Architecture 19 3.2 Experiments 21 3.2.1 Training Details 21 3.2.2 Experimental Results on DIV2K+AWGN dataset 21 3.2.3 Experimental Results on SIDD dataset 26 4. Image Deblurring 28 4.1 Proposed Methods 28 4.1.1 Multi-Scale Feature Analysis 29 4.1.2 Network Architecture 29 4.2 Experiments 31 4.2.1 Training Details 31 4.2.2 Experimental Results on Flickr2K dataset 31 4.2.3 Experimental Results on REDS dataset 34 5. Single Image Super-Resolution 38 5.1 Proposed Methods 38 5.1.1 High-Pass Filtering Loss 39 5.1.2 Gradient Magnitude Similarity Map Masking 41 5.1.3 Soft Gradient Magnitude Similarity Map Masking 43 5.1.4 Network Architecture 44 5.1.5 Adversarial Training for Perceptual Generative Model 45 5.2 Experiments 47 5.2.1 Training Details 47 5.2.2 Experimental Results on DIV2K dataset 48 5.2.3 Experimental Results on Set5/Set14 dataset 55 5.2.4 Experimental Results on REDS dataset 60 6. Conclusion and Future Works 63λ°•

    DiffPose:SpatioTemporal Diffusion Model for Video-Based Human Pose Estimation

    Get PDF

    DiffPose:SpatioTemporal Diffusion Model for Video-Based Human Pose Estimation

    Get PDF
    Denoising diffusion probabilistic models that were initially proposed for realistic image generation have recently shown success in various perception tasks (e.g., object detection and image segmentation) and are increasingly gaining attention in computer vision. However, extending such models to multi-frame human pose estimation is non-trivial due to the presence of the additional temporal dimension in videos. More importantly, learning representations that focus on keypoint regions is crucial for accurate localization of human joints. Nevertheless, the adaptation of the diffusion-based methods remains unclear on how to achieve such objective. In this paper, we present DiffPose, a novel diffusion architecture that formulates video-based human pose estimation as a conditional heatmap generation problem. First, to better leverage temporal information, we propose SpatioTemporal Representation Learner which aggregates visual evidences across frames and uses the resulting features in each denoising step as a condition. In addition, we present a mechanism called Lookup-based MultiScale Feature Interaction that determines the correlations between local joints and global contexts across multiple scales. This mechanism generates delicate representations that focus on keypoint regions. Altogether, by extending diffusion models, we show two unique characteristics from DiffPose on pose estimation task: (i) the ability to combine multiple sets of pose estimates to improve prediction accuracy, particularly for challenging joints, and (ii) the ability to adjust the number of iterative steps for feature refinement without retraining the model. DiffPose sets new state-of-the-art results on three benchmarks: PoseTrack2017, PoseTrack2018, and PoseTrack21

    Visual and Camera Sensors

    Get PDF
    This book includes 13 papers published in Special Issue ("Visual and Camera Sensors") of the journal Sensors. The goal of this Special Issue was to invite high-quality, state-of-the-art research papers dealing with challenging issues in visual and camera sensors

    Evolution of A Common Vector Space Approach to Multi-Modal Problems

    Get PDF
    A set of methods to address computer vision problems has been developed. Video un- derstanding is an activate area of research in recent years. If one can accurately identify salient objects in a video sequence, these components can be used in information retrieval and scene analysis. This research started with the development of a course-to-fine frame- work to extract salient objects in video sequences. Previous work on image and video frame background modeling involved methods that ranged from simple and efficient to accurate but computationally complex. It will be shown in this research that the novel approach to implement object extraction is efficient and effective that outperforms the existing state-of-the-art methods. However, the drawback to this method is the inability to deal with non-rigid motion. With the rapid development of artificial neural networks, deep learning approaches are explored as a solution to computer vision problems in general. Focusing on image and text, the image (or video frame) understanding can be achieved using CVS. With this concept, modality generation and other relevant applications such as automatic im- age description, text paraphrasing, can be explored. Specifically, video sequences can be modeled by Recurrent Neural Networks (RNN), the greater depth of the RNN leads to smaller error, but that makes the gradient in the network unstable during training.To overcome this problem, a Batch-Normalized Recurrent Highway Network (BNRHN) was developed and tested on the image captioning (image-to-text) task. In BNRHN, the highway layers are incorporated with batch normalization which diminish the gradient vanishing and exploding problem. In addition, a sentence to vector encoding framework that is suitable for advanced natural language processing is developed. This semantic text embedding makes use of the encoder-decoder model which is trained on sentence paraphrase pairs (text-to-text). With this scheme, the latent representation of the text is shown to encode sentences with common semantic information with similar vector rep- resentations. In addition to image-to-text and text-to-text, an image generation model is developed to generate image from text (text-to-image) or another image (image-to- image) based on the semantics of the content. The developed model, which refers to the Multi-Modal Vector Representation (MMVR), builds and encodes different modalities into a common vector space that achieve the goal of keeping semantics and conversion between text and image bidirectional. The concept of CVS is introduced in this research to deal with multi-modal conversion problems. In theory, this method works not only on text and image, but also can be generalized to other modalities, such as video and audio. The characteristics and performance are supported by both theoretical analysis and experimental results. Interestingly, the MMVR model is one of the many possible ways to build CVS. In the final stages of this research, a simple and straightforward framework to build CVS, which is considered as an alternative to the MMVR model, is presented

    Dataset Pre-Processing and Artificial Augmentation, Network Architecture and Training Parameters used in Appropriate Training of Convolutional Neural Networks for Classification Based Computer Vision Applications: A Survey

    Full text link
    Training a Convolutional Neural Network (CNN) based classifier is dependent on a large number of factors. These factors involve tasks such as aggregation of apt dataset, arriving at a suitable CNN network, processing of the dataset, and selecting the training parameters to arrive at the desired classification results. This review includes pre-processing techniques and dataset augmentation techniques used in various CNN based classification researches. In many classification problems, it is usually observed that the quality of dataset is responsible for proper training of CNN network, and this quality is judged on the basis of variations in data for every class. It is not usual to find such a pre-made dataset due to many natural concerns. Also it is recommended to have a large dataset, which is again not usually made available directly as a dataset. In some cases, the noise present in the dataset may not prove useful for training, while in others, researchers prefer to add noise to certain images to make the network less vulnerable to unwanted variations. Hence, researchers use artificial digital imaging techniques to derive variations in the dataset and clear or add noise. Thus, the presented paper accumulates state-of-the-art works that used the pre-processing and artificial augmentation of dataset before training. The next part to data augmentation is training, which includes proper selection of several parameters and a suitable CNN architecture. This paper also includes such network characteristics, dataset characteristics and training methodologies used in biomedical imaging, vision modules of autonomous driverless cars, and a few general vision based applications
    corecore