246 research outputs found

    Super-resolução em vídeos de baixa qualidade para aplicações forenses, de vigilância e móveis

    Get PDF
    Orientadores: Siome Klein Goldenstein, Anderson de Rezende RochaTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Algoritmos de super-resolução (SR) são métodos para obter um aumento da resolução de imagens compostas por pixels. Na super-resolução por múltiplas imagens, um conjunto de imagens de baixa resolução de uma cena é combinado para construir uma imagem de resolução superior. Super-resolução é uma solução barata para superar as limitações dos sistemas de aquisição de imagens, e pode ser útil em diversos casos em que o dispositivo não pode ser melhorado ou substituído - mas em que é possível obter diversas capturas da mesma cena. Neste trabalho, é explorada a super-resolução por múltiplas imagens para imagens naturais, em cenários nos quais é possível obter diversas imagens de uma cena. São propostas cinco variações de um método que explora propriedades geométricas de múltiplas imagens de baixa resolução para combiná-las em uma imagem de resolução superior; duas variações de um método que combina técnicas de inpainting e super-resolução; e mais três variações de um método que utiliza filtros adaptativos e regularização para resolver um problema de mínimos quadrados. Super-resolução por múltiplas imagens é possível quando existe movimento e informações não redundantes entre as imagens de baixa resolução. Entretanto, combiná-las em uma imagem de resolução superior pode não ser computacionalmente viável por técnicas complexas de super-resolução. A primeira aplicação dos métodos propostos é para um conjunto de imagens capturadas pelos dispositivos móveis mais recentes. Este tipo de ambiente requer algoritmos eficazes que sejam executados rapidamente e utilizando baixo consumo de memória. A segunda aplicação é na Ciência Forense. Câmeras de vigilância espalhadas pelas cidades poderiam fornecer dicas importantes para identificar um suspeito, por exemplo, em uma cena de crime. Entretanto, o reconhecimento dos caracteres de placas veiculares é especialmente difícil quando a resolução das imagens é baixa. Por isso, este trabalho também propõe um arcabouço que realiza a super-resolução de placas veiculares em vídeos reais de vigilância, capturados por câmeras de baixa qualidade e não projetadas especificamente para esta tarefa, ajudando o especialista forense a compreender um evento de interesse. O arcabouço realiza todas as etapas necessárias para rastrear, alinhar, reconstruir e reconhecer automaticamente os caracteres de uma placa suspeita. O usuário recebe, como saída, a imagem de alta resolução reconstruída, mais rica em detalhes, e também a sequência de caracteres reconhecida automaticamente nesta imagem. São apresentadas validações quantitativas e qualitativas dos algoritmos propostos e de suas aplicações. Os experimentos mostram, por exemplo, que é possível aumentar o número de caracteres reconhecidos corretamente, colocando o arcabouço proposto como uma ferramenta importante para fornecer aos peritos uma solução para o reconhecimento de placas veiculares sob condições adversas de aquisição. Por fim, também é sugerido o número mínimo de imagens a ser utilizada como entrada em cada aplicaçãoAbstract: Super-resolution (SR) algorithms are methods for achieving high-resolution (HR) enlargements of pixel-based images. In multi-frame super resolution, a set of low-resolution (LR) images of a scene are combined to construct an image with higher resolution. Super resolution is an inexpensive solution to overcome the limitations of image acquisition hardware systems, and can be useful in several cases in which the device cannot be upgraded or replaced, but multiple frames of the same scene can be obtained. In this work, we explore SR possibilities for natural images, in scenarios wherein we have multiple frames of a same scene. We design and develop five variations of an algorithm which rely on exploring geometric properties in order to combine pixels from LR observations into an HR grid; two variations of a method that combines inpainting techniques to multi-frame super resolution; and three variations of an algorithm that uses adaptive filtering and Tikhonov regularization to solve a least-square problem. Multi-frame super resolution is possible when there is motion and non-redundant information from LR observations. However, combining a large number of frames into a higher resolution image may not be computationally feasible by complex super-resolution techniques. The first application of the proposed methods is in consumer-grade photography with a setup in which several low-resolution images gathered by recent mobile devices can be combined to create a much higher resolution image. Such always-on low-power environment requires effective high-performance algorithms, that run fastly and with a low-memory footprint. The second application is in Digital Forensic, with a setup in which low-quality surveillance cameras throughout the cities could provide important cues to identify a suspect vehicle, for example, in a crime scene. However, license-plate recognition is especially difficult under poor image resolutions. Hence, we design and develop a novel, free and open-source framework underpinned by SR and Automatic License-Plate Recognition (ALPR) techniques to identify license-plate characters in low-quality real-world traffic videos, captured by cameras not designed for the ALPR task, aiding forensic analysts in understanding an event of interest. The framework handles the necessary conditions to identify a target license plate, using a novel methodology to locate, track, align, super resolve, and recognize its alphanumerics. The user receives as outputs the rectified and super-resolved license-plate, richer in details, and also the sequence of license-plates characters that have been automatically recognized in the super-resolved image. We present quantitative and qualitative validations of the proposed algorithms and its applications. Our experiments show, for example, that SR can increase the number of correctly recognized characters posing the framework as an important step toward providing forensic experts and practitioners with a solution for the license-plate recognition problem under difficult acquisition conditions. Finally, we also suggest a minimum number of images to use as input in each applicationDoutoradoCiência da ComputaçãoDoutor em Ciência da Computação1197478,146886153996/3-2015CAPESCNP

    Deep learning for small object detection

    Get PDF
    Small object detection has become increasingly relevant due to the fact that the performance of common object detectors falls significantly as objects become smaller. Many computer vision applications require the analysis of the entire set of objects in the image, including extremely small objects. Moreover, the detection of small objects allows to perceive objects at a greater distance, thus giving more time to adapt to any situation or unforeseen event

    Multi-wavelength infrared imaging computer systems and applications

    Get PDF
    This dissertation presents the development of three computer systems for multi-wavelength thermal imaging. Two computer systems were developed for the multi-wavelength imaging pyrometers (M-WIPs) that yield non-contact temperature measurements by remotely sensing the surface of objects with unknown wavelength-dependent emissivity. These M-WIP computer systems represent the state-of-art development in remote temperature measurement system based on the multi-wavelength approach. The dissertation research includes M-WIP computer system integration, software development, performance evaluation, and also applications in monitoring and control of temperature distribution of silicon wafers in a rapid thermal process system. The two M-WIPs are capable of data acquisition, signal processing, system calibration, radiometric measurement, parallel processing and process control. Temperature measurement experiments demonstrated the accuracy of ±1°C against blackbody and ±4°C for colorbody objects. Various algorithms were developed and implemented, including real-time two-point non-uniformity correction, thermal image pseudocoloring, PC to SUN workstation data transfer, automatic IR camera integration time control, and radiometric measurement parallel processing. A third computer system was developed for the demonstration of a 3-color InGaAs FPA which can provide images with information in three different IR wavelength range simultaneously. Numbers of functions were developed to demonstrate and characterize 3-color FPAs, and the system was delivered to be used by the 3-color FPA manufacturer

    Facial Texture Super-Resolution by Fitting 3D Face Models

    Get PDF
    This book proposes to solve the low-resolution (LR) facial analysis problem with 3D face super-resolution (FSR). A complete processing chain is presented towards effective 3D FSR in real world. To deal with the extreme challenges of incorporating 3D modeling under the ill-posed LR condition, a novel workflow coupling automatic localization of 2D facial feature points and 3D shape reconstruction is developed, leading to a robust pipeline for pose-invariant hallucination of the 3D facial texture

    Artificial Intelligence for Multimedia Signal Processing

    Get PDF
    Artificial intelligence technologies are also actively applied to broadcasting and multimedia processing technologies. A lot of research has been conducted in a wide variety of fields, such as content creation, transmission, and security, and these attempts have been made in the past two to three years to improve image, video, speech, and other data compression efficiency in areas related to MPEG media processing technology. Additionally, technologies such as media creation, processing, editing, and creating scenarios are very important areas of research in multimedia processing and engineering. This book contains a collection of some topics broadly across advanced computational intelligence algorithms and technologies for emerging multimedia signal processing as: Computer vision field, speech/sound/text processing, and content analysis/information mining

    크기, 폐색 및 레이블에 대한 어려운 조건 하 물체 검출 개선

    Get PDF
    학위논문(박사)--서울대학교 대학원 :공과대학 컴퓨터공학부,2020. 2. 김건희.물체 검출은 인스턴스 분할(instance segmentation), 물체 추적(object tracking), 이미지 캡션(image captioning), 장면 이해(scene understanding), 행동 인식(action recognition) 등 고차원의 컴퓨터 비전 태스크(task)뿐만 아니라, 비디오 감시(video surveillance), 자율주행차(self-driving car), 로봇 비전(robot vision), 증강현실(augmented reality) 등 실제 어플리케이션(application)에도 다양하게 적용되는 중요한 분야이다. 이러한 중요성에 의해 본 분야는 수십 년이라는 오랜 기간 동안 연구되어 왔으며, 특히 딥러닝의 등장과 함께 크게 발전하였다. 하지만 이런 발전에도 불구하고, 물체 검출을 하는 데 있어 어려움을 겪게 되는 조건들이 여전히 존재한다. 본 논문에서는 일반적으로 잘 알려진 세 가지 어려운 조건들에 대해, 기존의 대표적인 검출 모형들이 더 잘 대응할 수 있도록 개선하는 것을 목표로 한다. 먼저, 보행자 검출 문제에서 발생하는 폐색(occlusion)과 어려운 비물체(hard negative)에 대한 문제를 다룬다. 가려진 보행자의 경우에는 배경으로, 수직 물체와 같은 비물체는 반대로 보행자로 인식되는 경우가 많아 전체적인 성능 저하의 큰 원인이 된다. 보행자 검출 문제는 실시간 처리를 필요로 하는 경우가 많기 때문에, 본 논문에서는 속도 측면에서 장점이 있는 일단계(single-stage) 검출 모형을 개선하여 두 가지 문제를 완화할 수 있는 방법론을 제안한다. 제안된 방법론은 사람의 부위(part) 및 이미지의 격자(grid)에 대한 신뢰도 높은 분류 결과를 바탕으로, 기존 예측 결과를 보정하는 후처리 방식으로 이루어진다. 다음으로는, 일반적인 물체 검출 문제에서 발생하는 작은 물체에 대한 문제를 다룬다. 특히, 정확도 측면에서 장점이 있는 이단계(two-stage) 검출 모형에서조차 작은 물체 검출에 대한 정확도는 크게 떨어지는 편이다. 그 이유는 작은 영역에 해당하는 피쳐(feature)의 정보가 매우 부족하기 때문이다. 본 논문에서는 이단계 검출 모형을 기반으로, 피쳐 수준 초해상도(super-resolution) 방법론을 도입하여 이 문제를 해결하는 것을 소개한다. 특히, 초해상도를 위한 목표(target) 피쳐를 설정해줌으로써 학습을 안정화하고, 성능을 더욱 향상시킨다. 마지막으로, 학습환경에서 분류 레이블(classification label)만이 주어지는 약지도(weakly supervised) 학습환경에서 발생하는 문제를 다룬다. 약지도 물체 검출(weakly supervised object localization)은 일반적인 물체 검출에 대해, 이미지별 하나의 물체가 주어지고, 그 물체의 클래스(class) 정보만 학습에 활용 가능하다는 제약이 추가된 문제이다. 이에 대한 대표적인 검출 방법론은 피쳐의 활성(activation) 값들을 활용하는 CAM(class activation mapping)을 들 수 있다. 하지만 해당 방법론을 사용하는 경우, 예측 영역이 실제 물체의 영역에 비해 굉장히 좁게 잡히는 문제가 있는데, 이는 분류에 필요한 정보를 가진 피쳐들의 영역이 좁고, 검출 과정에서 이들의 비중을 높게 처리하기 때문에 발생한다. 본 논문에서는 이를 개선하여, 다양한 피쳐들의 정보를 검출에 최적화된 방식으로 활용하여 물체의 영역을 정확히 예측할 수 있는 방법론을 제안한다. 제안한 방법론들은 각 문제의 대표적인 벤치마크(benchmark) 데이터셋들에 대해 기존의 검출 모형들의 성능을 크게 향상시켰으며, 일부 환경에서는 최고 수준(state-of-the-art)의 성능을 달성하였다. 또한 다양한 모형들에 적용 가능한 유연성(flexibility)을 바탕으로, 추후 발전된 모형들에도 적용하여 추가적인 성능향상을 가져올 수 있을 것으로 기대된다.Object detection is one of the most essential and fundamental fields in computer vision. It is the foundation of not only the high-level vision tasks such as instance segmentation, object tracking, image captioning, scene understanding, and action recognition, but also the real-world applications such as video surveillance, self-driving car, robot vision, and augmented reality. Due to its important role in computer vision, object detection has been studied for decades, and drastically developed with the emergence of deep neural networks. Despite the recent rapid advancement, however, the performance of many detection models is limited under certain conditions. In this thesis, we examine three challenging conditions that hinder the robust application of object detection models and propose novel approaches to resolve the problems caused by the challenging conditions. We first investigate how to improve the performance of detecting occluded objects and hard negatives in the domain of pedestrian detection. Occluded pedestrians are often recognized as background, whereas hard negative examples such as vertical objects are considered as pedestrians, which significantly degrades the detection performance. Since pedestrian detection often requires real-time processing, we propose a method that can alleviate two problems by improving a single-stage detection model with the advantage in terms of speed. More specifically, we introduce an additional post-processing module that refines initial prediction results based on reliable classification of a person's body parts and grids of image. We then study how to better detect small objects for general object classes. Although two-stage object detection models significantly outperform single-stage models in terms of accuracy, the performance of two-stage models on small objects is still much lower than human-level performance. It is mainly due to the lack of information in the features of a small region of interest. In this thesis, we propose a feature-level super-resolution method based on two-stage object detection models to improve the performance of detecting small objects. More specifically, by properly pairing input and target features for super-resolution, we stabilize the training process, and as a result, significantly improve the detection performance on small objects. Lastly, we address the object detection problem under the setting of weak supervision. Particularly, weakly supervised object localization (WSOL) assumes there is only one object per image, and only provides class labels for training. For the absence of bounding box annotation, one dominant approach for WSOL has used class activation maps (CAM), which are generated through training for classification, and used to estimate the location of objects. However, since a classification model is trained to concentrate on the discriminative features, the localization results are often limited to small object region. To resolve this problem, we propose the methods that properly utilize the information in class activation maps. Our proposed methods significantly improved the performance of base models on each benchmark dataset and achieved state-of-the-art performance in some settings. Based on the flexibility that is applicable to the various models, it is also expected to be applied to the more recent models, resulting in additional performance improvements.1. Introduction 1 1.1. Contributions 6 1.2. Thesis Organization 8 2. Related Work 9 2.1. General Methods 9 2.2. Methods for Occluded Objects and Hard Negatives 10 2.3. Methods for Small Objects 12 2.4. Methods for Weakly Labeled Objects 14 3. Part and Grid Classification Based Post-Refinement for Occluded Objects and Hard Negatives 17 3.1. Overview 17 3.2. Our Approach 19 3.2.1. A Unified View of Output Tensors 19 3.2.2. Refinement for Occlusion Handling 21 3.2.3. Refinement for Hard Negative Handling 25 3.3. Experiment Settings 28 3.3.1. Datasets 29 3.3.2. Configuration Details 30 3.4. Experiment Results 32 3.4.1. Quantitative Results 32 3.4.2. Ablation Experiments 36 3.4.3. Memory and Computation Time Analysis 40 3.4.4. Qualitative Results 41 3.5. Conclusion 44 4. Self-Supervised Feature Super-Resolution for Small Objects 45 4.1. Overview 45 4.2. Mismatch of Relative Receptive Fields 48 4.3. Our Approach 50 4.3.1. Super-Resolution Target Extractor 52 4.3.2. Super-Resolution Feature Generator and Discriminator 54 4.3.3. Training 56 4.3.4. Inference 57 4.4. Experiment Settings 57 4.4.1. Datasets 57 4.4.2. Configuration Details 58 4.5. Experiment Results 59 4.5.1. Quantitative Results 59 4.5.2. Ablation Experiments 61 4.5.3. Qualitative Results 67 4.6. Conclusion 67 5. Rectified Class Activation Mapping for Weakly Labeled Objects 72 5.1. Overview 72 5.2. Our Approach 75 5.2.1. GAP-based CAM Localization 75 5.2.2. Thresholded Average Pooling 76 5.2.3. Negative Weight Clamping 78 5.2.4. Percentile as a Thresholding Standard 80 5.3. Experiment Settings 81 5.3.1. Datasets 81 5.3.2. Configuration Details 82 5.4. Experiment Results 82 5.4.1. Quantitative Results 83 5.4.2. Ablation Experiments 88 5.4.3. Qualitative Results 90 5.5. Conclusion 96 6. Conclusion 98 6.1. Summary 98 6.2. Future Works 99Docto

    Robust visual speech recognition using optical flow analysis and rotation invariant features

    Get PDF
    The focus of this thesis is to develop computer vision algorithms for visual speech recognition system to identify the visemes. The majority of existing speech recognition systems is based on audio-visual signals and has been developed for speech enhancement and is prone to acoustic noise. Considering this problem, aim of this research is to investigate and develop a visual only speech recognition system which should be suitable for noisy environments. Potential applications of such a system include the lip-reading mobile phones, human computer interface (HCI) for mobility-impaired users, robotics, surveillance, improvement of speech based computer control in a noisy environment and for the rehabilitation of the persons who have undergone a laryngectomy surgery. In the literature, there are several models and algorithms available for visual feature extraction. These features are extracted from static mouth images and characterized as appearance and shape based features. However, these methods rarely incorporate the time dependent information of mouth dynamics. This dissertation presents two optical flow based approaches of visual feature extraction, which capture the mouth motions in an image sequence. The motivation for using motion features is, because the human perception of lip-reading is concerned with the temporal dynamics of mouth motion. The first approach is based on extraction of features from the optical flow vertical component. The optical flow vertical component is decomposed into multiple non-overlapping fixed scale blocks and statistical features of each block are computed for successive video frames of an utterance. To overcome the issue of large variation in speed of speech, each utterance is normalized using simple linear interpolation method. In the second approach, four directional motion templates based on optical flow are developed, each representing the consolidated motion information in an utterance in four directions (i.e.,up, down, left and right). This approach is an evolution of a view based approach known as motion history image (MHI). One of the main issues with the MHI method is its motion overwriting problem because of self-occlusion. DMHIs seem to solve this issue of overwriting. Two types of image descriptors, Zernike moments and Hu moments are used to represent each image of DMHIs. A support vector machine (SVM) classifier was used to classify the features obtained from the optical flow vertical component, Zernike and Hu moments separately. For identification of visemes, a multiclass SVM approach was employed. A video speech corpus of seven subjects was used for evaluating the efficiency of the proposed methods for lip-reading. The experimental results demonstrate the promising performance of the optical flow based mouth movement representations. Performance comparison between DMHI and MHI based on Zernike moments, shows that the DMHI technique outperforms the MHI technique. A video based adhoc temporal segmentation method is proposed in the thesis for isolated utterances. It has been used to detect the start and the end frame of an utterance from an image sequence. The technique is based on a pair-wise pixel comparison method. The efficiency of the proposed technique was tested on the available data set with short pauses between each utterance
    corecore