Search CORE

33 research outputs found

부분 정보를 이용한 시각 데이터의 구조화 된 이해: 희소성, 무작위성, 연관성, 그리고 딥 네트워크

Author: 이동훈
Publication venue: 서울대학교 대학원
Publication date: 01/02/2019
Field of study

학위논문 (박사)-- 서울대학교 대학원 : 공과대학 전기·컴퓨터공학부, 2019. 2. Oh, Songhwai.For a deeper understanding of visual data, a relationship between local parts and a global scene has to be carefully examined. Examples of such relationships related to vision problems include but not limited to detecting a region of interest in the scene, classifying an image based on limited visual cues, and synthesizing new images conditioned on the local or global inputs. In this thesis, we aim to learn the relationship and demonstrate its importance by showing that it is one of critical keys to address four challenging vision problems mentioned above. For each problem, we construct deep neural networks that suit for each task. The first problem considered in the thesis is object detection. It requires not only finding local patches that look like target objects conditioned on the context of input scene but also comparing local patches themselves to assign a single detection for each object. To this end, we introduce individualness of detection candidates as a complement to objectness for object detection. The individualness assigns a single detection for each object out of raw detection candidates given by either object proposals or sliding windows. We show that conventional approaches, such as non-maximum suppression, are sub-optimal since they suppress nearby detections using only detection scores. We use a determinantal point process combined with the individualness to optimally select final detections. It models each detection using its quality and similarity to other detections based on the individualness. Then, detections with high detection scores and low correlations are selected by measuring their probability using a determinant of a matrix, which is composed of quality terms on the diagonal entries and similarities on the off-diagonal entries. For concreteness, we focus on the pedestrian detection problem as it is one of the most challenging problems due to frequent occlusions and unpredictable human motions. Experimental results demonstrate that the proposed algorithm works favorably against existing methods, including non-maximal suppression and a quadratic unconstrained binary optimization based method. For a second problem, we classify images based on observations of local patches. More specifically, we consider the problem of estimating the head pose and body orientation of a person from a low-resolution image. Under this setting, it is difficult to reliably extract facial features or detect body parts. We propose a convolutional random projection forest (CRPforest) algorithm for these tasks. A convolutional random projection network (CRPnet) is used at each node of the forest. It maps an input image to a high-dimensional feature space using a rich filter bank. The filter bank is designed to generate sparse responses so that they can be efficiently computed by compressive sensing. A sparse random projection matrix can capture most essential information contained in the filter bank without using all the filters in it. Therefore, the CRPnet is fast, e.g., it requires 0.04ms to process an image of 50×50 pixels, due to the small number of convolutions (e.g., 0.01% of a layer of a neural network) at the expense of less than 2% accuracy. The overall forest estimates head and body pose well on benchmark datasets, e.g., over 98% on the HIIT dataset, while requiring at 3.8ms without using a GPU. Extensive experiments on challenging datasets show that the proposed algorithm performs favorably against the state-of-the-art methods in low-resolution images with noise, occlusion, and motion blur. Then, we shift our attention to image synthesis based on the local-global relationship. Learning how to synthesize and place object instances into an image (semantic map) based on the scene context is a challenging and interesting problem in vision and learning. On one hand, solving this problem requires a joint decision of (a) generating an object mask from a certain class at a plausible scale, location, and shape, and (b) inserting the object instance mask into an existing scene so that the synthesized content is semantically realistic. On the other hand, such a model can synthesize realistic outputs to potentially facilitate numerous image editing and scene parsing tasks. In this paper, we propose an end-to-end trainable neural network that can synthesize and insert object instances into an image via a semantic map. The proposed network contains two generative modules that determine where the inserted object should be (i.e., location and scale) and what the object shape (and pose) should look like. The two modules are connected together with a spatial transformation network and jointly trained and optimized in a purely data-driven way. Specifically, we propose a novel network architecture with parallel supervised and unsupervised paths to guarantee diverse results. We show that the proposed network architecture learns the context-aware distribution of the location and shape of object instances to be inserted, and it can generate realistic and statistically meaningful object instances that simultaneously address the where and what sub-problems. As the final topic of the thesis, we introduce a new vision problem: generating an image based on a small number of key local patches without any geometric prior. In this work, key local patches are defined as informative regions of the target object or scene. This is a challenging problem since it requires generating realistic images and predicting locations of parts at the same time. We construct adversarial networks to tackle this problem. A generator network generates a fake image as well as a mask based on the encoder-decoder framework. On the other hand, a discriminator network aims to detect fake images. The network is trained with three losses to consider spatial, appearance, and adversarial information. The spatial loss determines whether the locations of predicted parts are correct. Input patches are restored in the output image without much modification due to the appearance loss. The adversarial loss ensures output images are realistic. The proposed network is trained without supervisory signals since no labels of key parts are required. Experimental results on seven datasets demonstrate that the proposed algorithm performs favorably on challenging objects and scenes.시각 데이터를 심도 깊게 이해하기 위해서는 전체 영역과 부분 영역들 간의 연관성 혹은 상호 작용을 주의 깊게 분석하는 것이 필요하다. 이에 관련된 컴퓨터 비전 문제로는 이미지에서 원하는 부분을 검출한다던지, 제한된 부분적인 정보만으로 전체 이미지를 판별 하거나, 혹은 주어진 정보로부터 원하는 이미지를 생성하는 등이 있다. 이 논문에서는, 그 연관성을 학습하는 것이 앞서 언급된 다양한 문제들을 푸는데 중요한 열쇠가 된다는 것을 보여주고자 한다. 이에 더해서, 각각의 문제에 알맞는 딥 네트워크의 디자인 또한 토의하고자 한다. 첫 주제로, 물체 검출 방식에 대해 분석하고자 한다. 이 문제는 타겟 물체와 비슷하게 생긴 영역을 찾아야 할 뿐 아니라, 찾아진 영역들 사이에 연관성을 분석함으로써 각 물체 마다 단 하나의 검출 결과를 할당시켜야 한다. 이를 위해, 우리는 objectness에 대한 보완으로써 individualness라는 개념을 제안 하였다. 이는 임의의 방식으로 얻어진 후보 물체 영역 중 하나씩을 물체 마다 할당하는데 쓰이는데, 이것은 검출 스코어만을 바탕으로 후처리를 하는 기존의 non-maximum suppression 등의 방식이 sub-optimal 결과를 얻을 수 밖에 없기 때문에 이를 개선하고자 도입하였다. 우리는 후보 물체 영역으로부터 최적의 영역들을 선택하기 위해서, determinantal point process라는 random process의 일종을 사용하였다. 이것은 먼저 각각의 검출 결과를 그것의 quality(검출 스코어)와 다른 검출 결과들 사이에 individualness를 바탕으 로 계산된 similarity(상관 관계)를 이용해 모델링 한다. 그 후, 각각의 검출 결과가 선택될 확률을 quality와 similarity에 기반한 커널의 determinant로 표현한다. 그 커널에 diagonal 부분에는 quality가 들어가고, off-diagonal에는 similarity가 대입 된다. 따라서, 어떤 검출 후보가 최종 검출 결과로 선택될 확률이 높아지기 위해서는, 높은 quality를 가짐과 동시에 다른 검출 결과들과 낮은 similarity를 가져야 한다. 이 논문에서는 보행자 검출에 집중하였는데, 이는 보행자 검출이 중요한 문제이면서도, 다른 물체들에 비해 자주 가려지고 다양한 움직임을 보이는 검출이 어려운 물체이기 때문이다. 실험 결과는 제안한 방법이 non-maximum suppression 혹은 quadratic unconstrained binary optimization 방식들 보다 우수함을 보여주었다. 다음 문제로는, 부분 정보를 이용해서 전체 이미지를 classify하는 것을 고려한다. 다양한 classification 문제 중에, 이 논문에서는 저해상도 이미지로부터 사람의 머리와 몸이 향하는 방향을 알아내는 문제에 집중하였다. 이 경우에는, 눈, 코, 입 등을 찾거나, 몸의 파트를 정확히 알아내는 것이 어렵다. 이를 위해, 우리는 convolutional random projection forest (CRPforest)라는 방식을 제안하였다. 이 forest에 각각의 node 안에는 convolutional random projection network (CRPnet)이 들어있는데, 이는 다양한 필터를 이용해서 인풋 이미지를 높은 차원으로 mapping 한다. 이를 효율적으로 다루기 위해 sparse한 결과를 얻을 수 있는 필터들을 사용함으로써, 압축 센싱 개념을 도입 할 수 있도록 하였다. 즉, 실제로는 적은 수의 필터만을 사용해서 전체 이미지의 중요한 정보를 모두 담고자 하는 것이다. 따라서 CRPnet은 50×50 픽셀 이미지에서 0.04ms 만에 동작 할 수 있을 정도로 매우 빠르며, 동시에 성능 하락은 2% 정도로 미미한 결과를 보여주었다. 이를 바탕으로 한 전체 forest는 GPU 없이 3.8ms 안에 동작하며, 머리와 몸통 방향 측정에 대해 다양한 데이터셋에서 최고의 성능을 보여주었다. 또한, 저해상도, 노이즈, 가려짐, 블러 등의 다양한 경우에도 좋은 성능을 보여주었다. 다음으로 부분-전체의 연관성을 통한 이미지 생성 문제를 탐구한다. 입력 이미지 상에 어떤 물체를 어떻게 놓을 것인지를 유추하는 것은 컴퓨터 비전과 기계 학습의 입장에서 아주 흥미로운 문제이다. 이는 먼저, 물체의 마스크를 적절한 크기, 위치, 모양으로 만들면서 동시에 그 물체가 입력 이미지 상에 놓여졌을 때에도 합리적으로 보일 수 있도록 해야 한다. 그렇게 된다면, image editing 혹은 scene parsing 등의 다양한 문제에 응용 될 수 있다. 이 논문에서는, 입력 semantic map으로 부터 새로운 물체를 알맞은 곳에 놓는 문제를 end-to-end 방식으로 학습 가능한 딥 네트워크를 구성하고자 한다. 이를 위해, where 모듈과 what 모듈을 바탕으로 하는 네트워크를 구성하였으며, 두 모듈을 spatial transformer network을 통해 연결하여 동시에 학습이 가능하도록 하였다. 또한, 각각의 모듈에 지도적 학습 경로와 비지도적 학습 경로를 병렬적으로 배치하여 동일한 입력으로 부터 다양한 결과를 얻을 수 있게 하였다. 실험을 통해, 제안한 방식이 삽입될 물체의 위치와 모양에 대한 분포를 동시에 학습 할 수 있고, 그 분포로부터 실제와 유사한 물체를 알맞은 곳에 놓을 수 있음을 보였다. 마지막으로 고려할 문제는, 컴퓨터 비전 분야에 새로운 문제로써, 위치 정보가 상실 된 적은 수의 부분 패치들을 바탕으로 전체 이미지를 복원하는 것이다. 이것은 이미지 생성과 동시에 각 패치의 위치 정보를 추측해야 하기에 어려운 문제가 된다. 우리는 적대적 네트워크를 바탕으로 이 문제를 해결하고자 하였다. 즉, 생성 네트워크는 encoder-decoder 방식을 이용해서 이미지와 위치 마스크를 찾고자 하는 반면에, 판별 네트워크는 생성된 가짜 이미지를 찾으려고 한다. 그리고 전체 네트워크는 위치, 겉보기, 적대적 경쟁의 세 가지 목적 함수들로 학습이 된다. 위치 목적 함수는 알맞은 위치를 예측하기 위해 사용되었고, 겉보기 목적 함수는 입력 패치 들이 결과 이미지 상에 적은 변화만을 가지고 남아있도록 하기 위해 사용되었으며, 적대적 경쟁 목적 함수는 생성된 이미지가 실제 이미지와 비슷할 수 있도록 하기 위해 적용되었다. 이렇게 구성된 네트워크는 별도의 annotation 없이 기존 데이터셋 들을 바탕으로 학습이 가능한 장점이 있다. 또한 실험을 통해, 제안한 방식이 다양한 데이터셋에서 잘 동작함을 보였다.1 Introduction 1 1.1 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . 5 2 Related Work 9 2.1 Detection methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Orientation estimation methods . . . . . . . . . . . . . . . . . . . . 11 2.3 Instance synthesis methods . . . . . . . . . . . . . . . . . . . . . . 13 2.4 Image generation methods . . . . . . . . . . . . . . . . . . . . . . . 15 3 Pedestrian detection 19 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 Proposed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2.1 Determinantal Point Process Formulation . . . . . . . . . . 22 3.2.2 Quality Term . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2.3 Individualness and Diversity Feature . . . . . . . . . . . . . 25 3.2.4 Mode Finding . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2.5 Relationship to Quadratic Unconstrained Binary Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.3.1 Experimental Settings . . . . . . . . . . . . . . . . . . . . . 36 3.3.2 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . 41 3.3.3 DET curves . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.3.4 Sensitivity analysis . . . . . . . . . . . . . . . . . . . . . . . 43 3.3.5 Effectiveness of the quality and similarity term design . . . 44 3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4 Head and body orientation estimation 51 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.2 Algorithmic Overview . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.3 Rich Filter Bank . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.3.1 Compressed Filter Bank . . . . . . . . . . . . . . . . . . . . 57 4.3.2 Box Filter Bank . . . . . . . . . . . . . . . . . . . . . . . . 58 4.4 Convolutional Random Projection Net . . . . . . . . . . . . . . . . 58 4.4.1 Input Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.4.2 Convolutional and ReLU Layers . . . . . . . . . . . . . . . 60 4.4.3 Random Projection Layer . . . . . . . . . . . . . . . . . . . 61 4.4.4 Fully-Connected and Output Layers . . . . . . . . . . . . . 62 4.5 Convolutional Random Projection Forest . . . . . . . . . . . . . . 62 4.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.6.1 Evaluation Datasets . . . . . . . . . . . . . . . . . . . . . . 65 4.6.2 CRPnet Characteristics . . . . . . . . . . . . . . . . . . . . 66 4.6.3 Head and Body Orientation Estimation . . . . . . . . . . . 67 4.6.4 Analysis of the Proposed Algorithm . . . . . . . . . . . . . 87 4.6.5 Classification Examples . . . . . . . . . . . . . . . . . . . . 87 4.6.6 Regression Examples . . . . . . . . . . . . . . . . . . . . . . 100 4.6.7 Experiments on the Original Datasets . . . . . . . . . . . . 100 4.6.8 Dataset Corrections . . . . . . . . . . . . . . . . . . . . . . 100 4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5 Instance synthesis and placement 109 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.2.1 The where module: learning a spatial distribution of object instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.2.2 The what module: learning a shape distribution of object instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.2.3 The complete pipeline . . . . . . . . . . . . . . . . . . . . . 120 5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 121 5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 6 Image generation 129 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 6.2 Proposed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 134 6.2.1 Key Part Detection . . . . . . . . . . . . . . . . . . . . . . 135 6.2.2 Part Encoding Network . . . . . . . . . . . . . . . . . . . . 135 6.2.3 Mask Prediction Network . . . . . . . . . . . . . . . . . . . 137 6.2.4 Image Generation Network . . . . . . . . . . . . . . . . . . 138 6.2.5 Real-Fake Discriminator Network . . . . . . . . . . . . . . . 139 6.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 6.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 6.3.2 Image Generation Results . . . . . . . . . . . . . . . . . . . 142 6.3.3 Experimental Details . . . . . . . . . . . . . . . . . . . . . . 150 6.3.4 Image Generation from Local Patches . . . . . . . . . . . . 150 6.3.5 Part Combination . . . . . . . . . . . . . . . . . . . . . . . 150 6.3.6 Unsupervised Feature Learning . . . . . . . . . . . . . . . . 151 6.3.7 An Alternative Objective Function . . . . . . . . . . . . . . 151 6.3.8 An Alternative Network Structure . . . . . . . . . . . . . . 151 6.3.9 Different Number of Input Patches . . . . . . . . . . . . . . 152 6.3.10 Smaller Size of Input Patches . . . . . . . . . . . . . . . . . 153 6.3.11 Degraded Input Patches . . . . . . . . . . . . . . . . . . . . 153 6.3.12 User Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 6.3.13 Failure cases . . . . . . . . . . . . . . . . . . . . . . . . . . 155 6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 7 Conclusion and Future Work 179Docto

3D 손 포즈 인식을 위한 인조 데이터의 이용

Author: Yang John
Publication venue: 서울대학교 대학원
Publication date: 01/08/2021
Field of study

학위논문(박사) -- 서울대학교대학원 : 융합과학기술대학원 융합과학부(지능형융합시스템전공), 2021.8. 양한열.3D hand pose estimation (HPE) based on RGB images has been studied for a long time. Relevant methods have focused mainly on optimization of neural framework for graphically connected finger joints. Training RGB-based HPE models has not been easy to train because of the scarcity on RGB hand pose datasets; unlike human body pose datasets, the finger joints that span hand postures are structured delicately and exquisitely. Such structure makes accurately annotating each joint with unique 3D world coordinates difficult, which is why many conventional methods rely on synthetic data samples to cover large variations of hand postures. Synthetic dataset consists of very precise annotations of ground truths, and further allows control over the variety of data samples, yielding a learning model to be trained with a large pose space. Most of the studies, however, have performed frame-by-frame estimation based on independent static images. Synthetic visual data can provide practically infinite diversity and rich labels, while avoiding ethical issues with privacy and bias. However, for many tasks, current models trained on synthetic data generalize poorly to real data. The task of 3D human hand pose estimation is a particularly interesting example of this synthetic-to-real problem, because learning-based approaches perform reasonably well given real training data, yet labeled 3D poses are extremely difficult to obtain in the wild, limiting scalability. In this dissertation, we attempt to not only consider the appearance of a hand but incorporate the temporal movement information of a hand in motion into the learning framework for better 3D hand pose estimation performance, which leads to the necessity of a large scale dataset with sequential RGB hand images. We propose a novel method that generates a synthetic dataset that mimics natural human hand movements by re-engineering annotations of an extant static hand pose dataset into pose-flows. With the generated dataset, we train a newly proposed recurrent framework, exploiting visuo-temporal features from sequential images of synthetic hands in motion and emphasizing temporal smoothness of estimations with a temporal consistency constraint. Our novel training strategy of detaching the recurrent layer of the framework during domain finetuning from synthetic to real allows preservation of the visuo-temporal features learned from sequential synthetic hand images. Hand poses that are sequentially estimated consequently produce natural and smooth hand movements which lead to more robust estimations. We show that utilizing temporal information for 3D hand pose estimation significantly enhances general pose estimations by outperforming state-of-the-art methods in experiments on hand pose estimation benchmarks. Since a fixed set of dataset provides a finite distribution of data samples, the generalization of a learning pose estimation network is limited in terms of pose, RGB and viewpoint spaces. We further propose to augment the data automatically such that the augmented pose sampling is performed in favor of training pose estimators generalization performance. Such auto-augmentation of poses is performed within a learning feature space in order to avoid computational burden of generating synthetic sample for every iteration of updates. The proposed effort can be considered as generating and utilizing synthetic samples for network training in the feature space. This allows training efficiency by requiring less number of real data samples, enhanced generalization power over multiple dataset domains and estimation performance caused by efficient augmentation.2D 이미지에서 사람의 손 모양과 포즈를 인식하고 구현흐는 연구는 각 손가락 조인트들의 3D 위치를 검출하는 것을 목표로한다. 손 포즈는 손가락 조인트들로 구성되어 있고 손목 관절부터 MCP, PIP, DIP 조인트들로 사람 손을 구성하는 신체적 요소들을 의미한다. 손 포즈 정보는 다양한 분야에서 활용될수 있고 손 제스쳐 감지 연구 분야에서 손 포즈 정보가 매우 훌륭한 입력 특징 값으로 사용된다. 사람의 손 포즈 검출 연구를 실제 시스템에 적용하기 위해서는 높은 정확도, 실시간성, 다양한 기기에 사용 가능하도록 가벼운 모델이 필요하고, 이것을 가능케 하기 위해서 학습한 인공신경망 모델을 학습하는데에는 많은 데이터가 필요로 한다. 하지만 사람 손 포즈를 측정하는 기계들이 꽤 불안정하고, 이 기계들을 장착하고 있는 이미지는 사람 손 피부 색과는 많이 달라 학습에 사용하기가 적절하지 않다. 그러기 때문에 본 논문에서는 이러한 문제를 해결하기 위해 인공적으로 만들어낸 데이터를 재가공 및 증량하여 학습에 사용하고, 그것을 통해 더 좋은 학습성과를 이루려고 한다. 인공적으로 만들어낸 사람 손 이미지 데이터들은 실제 사람 손 피부색과는 비슷할지언정 디테일한 텍스쳐가 많이 달라, 실제로 인공 데이터를 학습한 모델은 실제 손 데이터에서 성능이 현저히 많이 떨어진다. 이 두 데이타의 도메인을 줄이기 위해서 첫번째로는 사람손의 구조를 먼저 학습 시키기위해, 손 모션을 재가공하여 그 움직임 구조를 학스한 시간적 정보를 뺀 나머지만 실제 손 이미지 데이터에 학습하였고 크게 효과를 내었다. 이때 실제 사람 손모션을 모방하는 방법론을 제시하였다. 두번째로는 두 도메인이 다른 데이터를 네트워크 피쳐 공간에서 align시켰다. 그뿐만아니라 인공 포즈를 특정 데이터들로 augment하지 않고 네트워크가 많이 보지 못한 포즈가 만들어지도록 하나의 확률 모델로서 설정하여 그것에서 샘플링하는 구조를 제안하였다. 본 논문에서는 인공 데이터를 더 효과적으로 사용하여 annotation이 어려운 실제 데이터를 더 모으는 수고스러움 없이 인공 데이터들을 더 효과적으로 만들어 내는 것 뿐만 아니라, 더 안전하고 지역적 특징과 시간적 특징을 활용해서 포즈의 성능을 개선하는 방법들을 제안했다. 또한, 네트워크가 스스로 필요한 데이터를 찾아서 학습할수 있는 자동 데이터 증량 방법론도 함께 제안하였다. 이렇게 제안된 방법을 결합해서 더 나은 손 포즈의 성능을 향상 할 수 있다.1. Introduction 1 2. Related Works 14 3. Preliminaries: 3D Hand Mesh Model 27 4. SeqHAND: RGB-sequence-based 3D Hand Pose and Shape Estimation 31 5. Hand Pose Auto-Augment 66 6. Conclusion 85 Abstract (Korea) 101 감사의 글 103박

Mathematical Approaches for Image Enhancement Problems

Author: Cho Dongwook
Publication venue
Publication date: 01/01/2012
Field of study

This thesis develops novel techniques that can solve some image enhancement problems using theoretically and technically proven and very useful mathematical tools to image processing such as wavelet transforms, partial differential equations, and variational models. Three subtopics are mainly covered. First, color image denoising framework is introduced to achieve high quality denoising results by considering correlations between color components while existing denoising approaches can be plugged in flexibly. Second, a new and efficient framework for image contrast and color enhancement in the compressed wavelet domain is proposed. The proposed approach is capable of enhancing both global and local contrast and brightness as well as preserving color consistency. The framework does not require inverse transform for image enhancement since linear scale factors are directly applied to both scaling and wavelet coefficients in the compressed domain, which results in high computational efficiency. Also contaminated noise in the image can be efficiently reduced by introducing wavelet shrinkage terms adaptively in different scales. The proposed method is able to enhance a wavelet-coded image computationally efficiently with high image quality and less noise or other artifact. The experimental results show that the proposed method produces encouraging results both visually and numerically compared to some existing approaches. Finally, image inpainting problem is discussed. Literature review, psychological analysis, and challenges on image inpainting problem and related topics are described. An inpainting algorithm using energy minimization and texture mapping is proposed. Mumford-Shah energy minimization model detects and preserves edges in the inpainting domain by detecting both the main structure and the detailed edges. This approach utilizes faster hierarchical level set method and guarantees convergence independent of initial conditions. The estimated segmentation results in the inpainting domain are stored in segmentation map, which is referred by a texture mapping algorithm for filling textured regions. We also propose an inpainting algorithm using wavelet transform that can expect better global structure estimation of the unknown region in addition to shape and texture properties since wavelet transforms have been used for various image analysis problems due to its nice multi-resolution properties and decoupling characteristics

CiteSeerX

Concordia University Research Repository

Robust density modelling using the student's t-distribution for human action recognition

Author: Moghaddam Z
Piccardi M
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/12/2011
Field of study

The extraction of human features from videos is often inaccurate and prone to outliers. Such outliers can severely affect density modelling when the Gaussian distribution is used as the model since it is highly sensitive to outliers. The Gaussian distribution is also often used as base component of graphical models for recognising human actions in the videos (hidden Markov model and others) and the presence of outliers can significantly affect the recognition accuracy. In contrast, the Student's t-distribution is more robust to outliers and can be exploited to improve the recognition rate in the presence of abnormal data. In this paper, we present an HMM which uses mixtures of t-distributions as observation probabilities and show how experiments over two well-known datasets (Weizmann, MuHAVi) reported a remarkable improvement in classification accuracy. © 2011 IEEE

OPUS - University of Technology Sydney