149 research outputs found

    Computer vision reading on stickers and direct part marking on horticultural products : challenges and possible solutions

    Get PDF
    Traceability of products from production to the consumer has led to a technological advancement in product identification. There has been development from the use of traditional one-dimensional barcodes (EAN-13, Code 128, etc.) to 2D (two-dimensional) barcodes such as QR (Quick Response) and Data Matrix codes. Over the last two decades there has been an increased use of Radio Frequency Identification (RFID) and Direct Part Marking (DPM) using lasers for product identification in agriculture. However, in agriculture there are still considerable challenges to adopting barcodes, RFID and DPM technologies, unlike in industry where these technologies have been very successful. This study was divided into three main objectives. Firstly, determination of the effect of speed, dirt, moisture and bar width on barcode detection was carried out both in the laboratory and a flower producing company, Brandkamp GmbH. This study developed algorithms for automation and detection of Code 128 barcodes under rough production conditions. Secondly, investigations were carried out on the effect of low laser marking energy on barcode size, print growth, colour and contrast on decoding 2D Data Matrix codes printed directly on apples. Three different apple varieties (Golden Delicious, Kanzi and Red Jonaprince) were marked with various levels of energy and different barcode sizes. Image processing using Halcon 11.0.1 (MvTec) was used to evaluate the markings on the apples. Finally, the third objective was to evaluate both algorithms for 1D and 2D barcodes. According to the results, increasing the speed and angle of inclination of the barcode decreased barcode recognition. Also, increasing the dirt on the surface of the barcode resulted in decreasing the successful detection of those barcodes. However, there was 100% detection of the Code 128 barcode at the company’s production speed (0.15 m/s) with the proposed algorithm. Overall, the results from the company showed that the image-based system has a future prospect for automation in horticultural production systems. It overcomes the problem of using laser barcode readers. The results for apples showed that laser energy, barcode size, print growth, type of product, contrast between the markings and the colour of the products, the inertia of the laser system and the days of storage all singularly or in combination with each other influence the readability of laser Data Matrix codes and implementation on apples. There was poor detection of the Data Matrix code on Kanzi and Red Jonaprince due to the poor contrast between the markings on their skins. The proposed algorithm is currently working successfully on Golden Delicious with 100% detection for 10 days using energy 0.108 J mm-2 and a barcode size of 10 × 10 mm2. This shows that there is a future prospect of not only marking barcodes on apples but also on other agricultural products for real time production

    Automatic human behaviour anomaly detection in surveillance video

    Get PDF
    This thesis work focusses upon developing the capability to automatically evaluate and detect anomalies in human behaviour from surveillance video. We work with static monocular cameras in crowded urban surveillance scenarios, particularly air- ports and commercial shopping areas. Typically a person is 100 to 200 pixels high in a scene ranging from 10 - 20 meters width and depth, populated by 5 to 40 peo- ple at any given time. Our procedure evaluates human behaviour unobtrusively to determine outlying behavioural events, agging abnormal events to the operator. In order to achieve automatic human behaviour anomaly detection we address the challenge of interpreting behaviour within the context of the social and physical environment. We develop and evaluate a process for measuring social connectivity between individuals in a scene using motion and visual attention features. To do this we use mutual information and Euclidean distance to build a social similarity matrix which encodes the social connection strength between any two individuals. We de- velop a second contextual basis which acts by segmenting a surveillance environment into behaviourally homogeneous subregions which represent high tra c slow regions and queuing areas. We model the heterogeneous scene in homogeneous subgroups using both contextual elements. We bring the social contextual information, the scene context, the motion, and visual attention features together to demonstrate a novel human behaviour anomaly detection process which nds outlier behaviour from a short sequence of video. The method, Nearest Neighbour Ranked Outlier Clusters (NN-RCO), is based upon modelling behaviour as a time independent se- quence of behaviour events, can be trained in advance or set upon a single sequence. We nd that in a crowded scene the application of Mutual Information-based social context permits the ability to prevent self-justifying groups and propagate anomalies in a social network, granting a greater anomaly detection capability. Scene context uniformly improves the detection of anomalies in all the datasets we test upon. We additionally demonstrate that our work is applicable to other data domains. We demonstrate upon the Automatic Identi cation Signal data in the maritime domain. Our work is capable of identifying abnormal shipping behaviour using joint motion dependency as analogous for social connectivity, and similarly segmenting the shipping environment into homogeneous regions

    Object Detection in 20 Years: A Survey

    Full text link
    Object detection, as of one the most fundamental and challenging problems in computer vision, has received great attention in recent years. Its development in the past two decades can be regarded as an epitome of computer vision history. If we think of today's object detection as a technical aesthetics under the power of deep learning, then turning back the clock 20 years we would witness the wisdom of cold weapon era. This paper extensively reviews 400+ papers of object detection in the light of its technical evolution, spanning over a quarter-century's time (from the 1990s to 2019). A number of topics have been covered in this paper, including the milestone detectors in history, detection datasets, metrics, fundamental building blocks of the detection system, speed up techniques, and the recent state of the art detection methods. This paper also reviews some important detection applications, such as pedestrian detection, face detection, text detection, etc, and makes an in-deep analysis of their challenges as well as technical improvements in recent years.Comment: This work has been submitted to the IEEE TPAMI for possible publicatio

    Enhancing low-level features with mid-level cues

    Get PDF
    Local features have become an essential tool in visual recognition. Much of the progress in computer vision over the past decade has built on simple, local representations such as SIFT or HOG. SIFT in particular shifted the paradigm in feature representation. Subsequent works have often focused on improving either computational efficiency, or invariance properties. This thesis belongs to the latter group. Invariance is a particularly relevant aspect if we intend to work with dense features. The traditional approach to sparse matching is to rely on stable interest points, such as corners, where scale and orientation can be reliably estimated, enforcing invariance; dense features need to be computed on arbitrary points. Dense features have been shown to outperform sparse matching techniques in many recognition problems, and form the bulk of our work. In this thesis we present strategies to enhance low-level, local features with mid-level, global cues. We devise techniques to construct better features, and use them to handle complex ambiguities, occlusions and background changes. To deal with ambiguities, we explore the use of motion to enforce temporal consistency with optical flow priors. We also introduce a novel technique to exploit segmentation cues, and use it to extract features invariant to background variability. For this, we downplay image measurements most likely to belong to a region different from that where the descriptor is computed. In both cases we follow the same strategy: we incorporate mid-level, "big picture" information into the construction of local features, and proceed to use them in the same manner as we would the baseline features. We apply these techniques to different feature representations, including SIFT and HOG, and use them to address canonical vision problems such as stereo and object detection, demonstrating that the introduction of global cues yields consistent improvements. We prioritize solutions that are simple, general, and efficient. Our main contributions are as follows: (a) An approach to dense stereo reconstruction with spatiotemporal features, which unlike existing works remains applicable to wide baselines. (b) A technique to exploit segmentation cues to construct dense descriptors invariant to background variability, such as occlusions or background motion. (c) A technique to integrate bottom-up segmentation with recognition efficiently, amenable to sliding window detectors.Les "features" locals s'han convertit en una eina fonamental en el camp del reconeixement visual. Gran part del progrés experimentat en el camp de la visió per computador al llarg de l'última decada es basa en representacions locals de baixa complexitat, com SIFT o HOG. SIFT, en concret, ha canviat el paradigma en representació de característiques visuals. Els treballs que l'han succeït s'acostumen a centrar o bé a millorar la seva eficiencia computacional, o bé propietats d'invariança. El treball presentat en aquesta tesi pertany al segon grup. L'invariança es un aspecte especialment rellevant quan volem treballab amb "features" denses, és a dir per a cada pixel. La manera tradicional d'atacar el problema amb "features" de baixa densitat consisteix en seleccionar punts d'interés estables, com per exemple cantonades, on l'escala i l'orientació poden ser estimades de manera robusta. Les "features" denses, per definició, han de ser calculades en punts arbitraris de la imatge. S'ha demostrat que les "features" denses obtenen millors resultats en tècniques de correspondència per a molts problemes en reconeixement, i formen la major part del nostre treball. En aquesta tesi presentem estratègies per a enriquir "features" locals de baix nivell amb "cues" o dades globals, de mitja complexitat. Dissenyem tècniques per a construïr millors "features", que usem per a atacar problemes tals com correspondències amb un grau elevat d'ambigüetat, oclusions, i canvis del fons de la imatge. Per a atacar ambigüetats, explorem l'ús del moviment per a imposar consistència espai-temporal mitjançant informació d'"optical flow". També presentem una tècnica per explotar dades de segmentació que fem servir per a extreure "features" invariants a canvis en el fons de la imatge. Aquest mètode consisteix en atenuar els components de la imatge (i per tant les "features") que probablement corresponguin a regions diferents a la del descriptor que estem calculant. En ambdós casos seguim la mateixa estratègia: la nostra voluntat és incorporar dades globals d'un nivell de complexitat mitja a la construcció de "features" locals, que procedim a utilitzar de la mateixa manera que les "features" originals. Aquestes tècniques són aplicades a diferents tipus de representacions, incloent SIFT i HOG, i mostrem com utilitzar-les per a atacar problemes fonamentals en visió per computador tals com l'estèreo i la detecció d'objectes. En aquest treball demostrem que introduïnt informació global en la construcció de "features" locals podem obtenir millores consistentment. Donem prioritat a solucions senzilles, generals i eficients. Aquestes són les principals contribucions de la tesi: (a) Una tècnica per a reconstrucció estèreo densa mitjançant "features" espai-temporals, amb l'avantatge respecte a treballs existents que podem aplicar-la a càmeres en qualsevol configuració geomètrica ("wide-baseline"). (b) Una tècnica per a explotar dades de segmentació dins la construcció de descriptors densos, fent-los invariants a canvis al fons de la imatge, i per tant a problemes com les oclusions en estèreo o objectes en moviment. (c) Una tècnica per a integrar segmentació de manera ascendent ("bottom-up") en problemes de reconeixement d'una manera eficient, dissenyada per a detectors de tipus "sliding window"

    Automatic Multi-Scale and Multi-Object Pedestrian and Car Detection in Digital Images Based on the Discriminative Generalized Hough Transform and Deep Convolutional Neural Networks

    Get PDF
    Many approaches have been suggested for automatic pedestrian and car detection to cope with the large variability regarding object size, occlusion, background variability, aspect and so forth. Current state-of-the-art deep learning-based frameworks rely either on a proposal generation mechanism (e.g., "Faster R-CNN") or on the inspection of image quadrants / octants (e.g., "YOLO" or "SSD"), which are then further processed with deep convolutional neural networks (CNN). In this thesis, the Discriminative Generalized Hough Transform (DGHT), which operates on edge images, is analyzed for the application to automatic multi-scale and multi-object pedestrian and car detection in 2D digital images. The analysis motivates to use the DGHT as an efficient proposal generation mechanism, followed by a proposal (bounding box) refinement and proposal acceptance or rejection based on a deep CNN. The impact of the different components of the resulting DGHT object detection pipeline as well as the amount of DGHT training data on the detection performance are analyzed in detail. Due to the low false negative rate and the low number of candidates of the DGHT as well as the high classification accuracy of the CNN, competitive performance to the state-of-the-art in pedestrian and car detection is obtained on the IAIR database with much less generated proposals than other proposal-generating algorithms, being outperformed only by YOLOv2 fine-tuned to IAIR cars. By evaluations on further databases (without retraining or adaptation) the generalization capability of the DGHT object detection pipeline is shown

    Learning visual representations with deep neural networks for intelligent transportation systems problems

    Get PDF
    Esta tesis se centra en dos grandes problemas en el área de los sistemas de transportes inteligentes (STI): el conteo de vehículos en escenas de congestión de tráfico; y la detección y estimación del punto de vista, de forma simultánea, de los objetos en una escena. Respecto al problema del conteo, este trabajo se centra primero en el diseño de arquitecturas de redes neuronales profundas que tengan la capacidad de aprender representaciones multi-escala profundas, capaces de estimar de forma precisa la cuenta de objetos, mediante mapas de densidad. Se trata también el problema de la escala de los objetos introducida por la gran perspectiva típicamente presente en el área de recuento de objetos. Además, con el éxito de las redes hourglass profundas en el campo del conteo de objetos, este trabajo propone un nuevo tipo de red hourglass profunda con conexiones de corto circuito auto-gestionadas. Los modelos propuestos se evalúan en las bases de datos públicas más utilizadas y logran los resultados iguales o superiores al estado del arte en el momento en que fueron publicadas. Para la segunda parte, se realiza un estudio comparativo completo del problema de detección de objetos y la estimación de la pose de forma simultánea. Se expone el compromiso existente entre la localización del objeto y la estimación de su pose. Un detector necesita idealmente una representación que sea invariable al punto de vista, mientras que un estimador de poses necesita ser discriminatorio. Por lo tanto, se proponen tres nuevas arquitecturas de redes neurales profundas en las que el problema de la detección de objetos y la estimación de la pose se van desacoplando progresivamente. Además, se aborda la cuestión de si la pose debe expresarse como un valor discreto o continuo. A pesar de ofrecer un rendimiento similar, los resultados muestran que los enfoques continuos son más sensibles al sesgo del punto de vista principal de la categoría del objeto. Se realiza un análisis comparativo detallado en las dos bases de datos principales, es decir, PASCAL3D+ y ObjectNet3D. Se logran resultados competitivos con todos los modelos propuestos en ambos conjuntos de datos

    부분 정보를 이용한 시각 데이터의 구조화 된 이해: 희소성, 무작위성, 연관성, 그리고 딥 네트워크

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 공과대학 전기·컴퓨터공학부, 2019. 2. Oh, Songhwai.For a deeper understanding of visual data, a relationship between local parts and a global scene has to be carefully examined. Examples of such relationships related to vision problems include but not limited to detecting a region of interest in the scene, classifying an image based on limited visual cues, and synthesizing new images conditioned on the local or global inputs. In this thesis, we aim to learn the relationship and demonstrate its importance by showing that it is one of critical keys to address four challenging vision problems mentioned above. For each problem, we construct deep neural networks that suit for each task. The first problem considered in the thesis is object detection. It requires not only finding local patches that look like target objects conditioned on the context of input scene but also comparing local patches themselves to assign a single detection for each object. To this end, we introduce individualness of detection candidates as a complement to objectness for object detection. The individualness assigns a single detection for each object out of raw detection candidates given by either object proposals or sliding windows. We show that conventional approaches, such as non-maximum suppression, are sub-optimal since they suppress nearby detections using only detection scores. We use a determinantal point process combined with the individualness to optimally select final detections. It models each detection using its quality and similarity to other detections based on the individualness. Then, detections with high detection scores and low correlations are selected by measuring their probability using a determinant of a matrix, which is composed of quality terms on the diagonal entries and similarities on the off-diagonal entries. For concreteness, we focus on the pedestrian detection problem as it is one of the most challenging problems due to frequent occlusions and unpredictable human motions. Experimental results demonstrate that the proposed algorithm works favorably against existing methods, including non-maximal suppression and a quadratic unconstrained binary optimization based method. For a second problem, we classify images based on observations of local patches. More specifically, we consider the problem of estimating the head pose and body orientation of a person from a low-resolution image. Under this setting, it is difficult to reliably extract facial features or detect body parts. We propose a convolutional random projection forest (CRPforest) algorithm for these tasks. A convolutional random projection network (CRPnet) is used at each node of the forest. It maps an input image to a high-dimensional feature space using a rich filter bank. The filter bank is designed to generate sparse responses so that they can be efficiently computed by compressive sensing. A sparse random projection matrix can capture most essential information contained in the filter bank without using all the filters in it. Therefore, the CRPnet is fast, e.g., it requires 0.04ms to process an image of 50×50 pixels, due to the small number of convolutions (e.g., 0.01% of a layer of a neural network) at the expense of less than 2% accuracy. The overall forest estimates head and body pose well on benchmark datasets, e.g., over 98% on the HIIT dataset, while requiring at 3.8ms without using a GPU. Extensive experiments on challenging datasets show that the proposed algorithm performs favorably against the state-of-the-art methods in low-resolution images with noise, occlusion, and motion blur. Then, we shift our attention to image synthesis based on the local-global relationship. Learning how to synthesize and place object instances into an image (semantic map) based on the scene context is a challenging and interesting problem in vision and learning. On one hand, solving this problem requires a joint decision of (a) generating an object mask from a certain class at a plausible scale, location, and shape, and (b) inserting the object instance mask into an existing scene so that the synthesized content is semantically realistic. On the other hand, such a model can synthesize realistic outputs to potentially facilitate numerous image editing and scene parsing tasks. In this paper, we propose an end-to-end trainable neural network that can synthesize and insert object instances into an image via a semantic map. The proposed network contains two generative modules that determine where the inserted object should be (i.e., location and scale) and what the object shape (and pose) should look like. The two modules are connected together with a spatial transformation network and jointly trained and optimized in a purely data-driven way. Specifically, we propose a novel network architecture with parallel supervised and unsupervised paths to guarantee diverse results. We show that the proposed network architecture learns the context-aware distribution of the location and shape of object instances to be inserted, and it can generate realistic and statistically meaningful object instances that simultaneously address the where and what sub-problems. As the final topic of the thesis, we introduce a new vision problem: generating an image based on a small number of key local patches without any geometric prior. In this work, key local patches are defined as informative regions of the target object or scene. This is a challenging problem since it requires generating realistic images and predicting locations of parts at the same time. We construct adversarial networks to tackle this problem. A generator network generates a fake image as well as a mask based on the encoder-decoder framework. On the other hand, a discriminator network aims to detect fake images. The network is trained with three losses to consider spatial, appearance, and adversarial information. The spatial loss determines whether the locations of predicted parts are correct. Input patches are restored in the output image without much modification due to the appearance loss. The adversarial loss ensures output images are realistic. The proposed network is trained without supervisory signals since no labels of key parts are required. Experimental results on seven datasets demonstrate that the proposed algorithm performs favorably on challenging objects and scenes.시각 데이터를 심도 깊게 이해하기 위해서는 전체 영역과 부분 영역들 간의 연관성 혹은 상호 작용을 주의 깊게 분석하는 것이 필요하다. 이에 관련된 컴퓨터 비전 문제로는 이미지에서 원하는 부분을 검출한다던지, 제한된 부분적인 정보만으로 전체 이미지를 판별 하거나, 혹은 주어진 정보로부터 원하는 이미지를 생성하는 등이 있다. 이 논문에서는, 그 연관성을 학습하는 것이 앞서 언급된 다양한 문제들을 푸는데 중요한 열쇠가 된다는 것을 보여주고자 한다. 이에 더해서, 각각의 문제에 알맞는 딥 네트워크의 디자인 또한 토의하고자 한다. 첫 주제로, 물체 검출 방식에 대해 분석하고자 한다. 이 문제는 타겟 물체와 비슷하게 생긴 영역을 찾아야 할 뿐 아니라, 찾아진 영역들 사이에 연관성을 분석함으로써 각 물체 마다 단 하나의 검출 결과를 할당시켜야 한다. 이를 위해, 우리는 objectness에 대한 보완으로써 individualness라는 개념을 제안 하였다. 이는 임의의 방식으로 얻어진 후보 물체 영역 중 하나씩을 물체 마다 할당하는데 쓰이는데, 이것은 검출 스코어만을 바탕으로 후처리를 하는 기존의 non-maximum suppression 등의 방식이 sub-optimal 결과를 얻을 수 밖에 없기 때문에 이를 개선하고자 도입하였다. 우리는 후보 물체 영역으로부터 최적의 영역들을 선택하기 위해서, determinantal point process라는 random process의 일종을 사용하였다. 이것은 먼저 각각의 검출 결과를 그것의 quality(검출 스코어)와 다른 검출 결과들 사이에 individualness를 바탕으 로 계산된 similarity(상관 관계)를 이용해 모델링 한다. 그 후, 각각의 검출 결과가 선택될 확률을 quality와 similarity에 기반한 커널의 determinant로 표현한다. 그 커널에 diagonal 부분에는 quality가 들어가고, off-diagonal에는 similarity가 대입 된다. 따라서, 어떤 검출 후보가 최종 검출 결과로 선택될 확률이 높아지기 위해서는, 높은 quality를 가짐과 동시에 다른 검출 결과들과 낮은 similarity를 가져야 한다. 이 논문에서는 보행자 검출에 집중하였는데, 이는 보행자 검출이 중요한 문제이면서도, 다른 물체들에 비해 자주 가려지고 다양한 움직임을 보이는 검출이 어려운 물체이기 때문이다. 실험 결과는 제안한 방법이 non-maximum suppression 혹은 quadratic unconstrained binary optimization 방식들 보다 우수함을 보여주었다. 다음 문제로는, 부분 정보를 이용해서 전체 이미지를 classify하는 것을 고려한다. 다양한 classification 문제 중에, 이 논문에서는 저해상도 이미지로부터 사람의 머리와 몸이 향하는 방향을 알아내는 문제에 집중하였다. 이 경우에는, 눈, 코, 입 등을 찾거나, 몸의 파트를 정확히 알아내는 것이 어렵다. 이를 위해, 우리는 convolutional random projection forest (CRPforest)라는 방식을 제안하였다. 이 forest에 각각의 node 안에는 convolutional random projection network (CRPnet)이 들어있는데, 이는 다양한 필터를 이용해서 인풋 이미지를 높은 차원으로 mapping 한다. 이를 효율적으로 다루기 위해 sparse한 결과를 얻을 수 있는 필터들을 사용함으로써, 압축 센싱 개념을 도입 할 수 있도록 하였다. 즉, 실제로는 적은 수의 필터만을 사용해서 전체 이미지의 중요한 정보를 모두 담고자 하는 것이다. 따라서 CRPnet은 50×50 픽셀 이미지에서 0.04ms 만에 동작 할 수 있을 정도로 매우 빠르며, 동시에 성능 하락은 2% 정도로 미미한 결과를 보여주었다. 이를 바탕으로 한 전체 forest는 GPU 없이 3.8ms 안에 동작하며, 머리와 몸통 방향 측정에 대해 다양한 데이터셋에서 최고의 성능을 보여주었다. 또한, 저해상도, 노이즈, 가려짐, 블러 등의 다양한 경우에도 좋은 성능을 보여주었다. 다음으로 부분-전체의 연관성을 통한 이미지 생성 문제를 탐구한다. 입력 이미지 상에 어떤 물체를 어떻게 놓을 것인지를 유추하는 것은 컴퓨터 비전과 기계 학습의 입장에서 아주 흥미로운 문제이다. 이는 먼저, 물체의 마스크를 적절한 크기, 위치, 모양으로 만들면서 동시에 그 물체가 입력 이미지 상에 놓여졌을 때에도 합리적으로 보일 수 있도록 해야 한다. 그렇게 된다면, image editing 혹은 scene parsing 등의 다양한 문제에 응용 될 수 있다. 이 논문에서는, 입력 semantic map으로 부터 새로운 물체를 알맞은 곳에 놓는 문제를 end-to-end 방식으로 학습 가능한 딥 네트워크를 구성하고자 한다. 이를 위해, where 모듈과 what 모듈을 바탕으로 하는 네트워크를 구성하였으며, 두 모듈을 spatial transformer network을 통해 연결하여 동시에 학습이 가능하도록 하였다. 또한, 각각의 모듈에 지도적 학습 경로와 비지도적 학습 경로를 병렬적으로 배치하여 동일한 입력으로 부터 다양한 결과를 얻을 수 있게 하였다. 실험을 통해, 제안한 방식이 삽입될 물체의 위치와 모양에 대한 분포를 동시에 학습 할 수 있고, 그 분포로부터 실제와 유사한 물체를 알맞은 곳에 놓을 수 있음을 보였다. 마지막으로 고려할 문제는, 컴퓨터 비전 분야에 새로운 문제로써, 위치 정보가 상실 된 적은 수의 부분 패치들을 바탕으로 전체 이미지를 복원하는 것이다. 이것은 이미지 생성과 동시에 각 패치의 위치 정보를 추측해야 하기에 어려운 문제가 된다. 우리는 적대적 네트워크를 바탕으로 이 문제를 해결하고자 하였다. 즉, 생성 네트워크는 encoder-decoder 방식을 이용해서 이미지와 위치 마스크를 찾고자 하는 반면에, 판별 네트워크는 생성된 가짜 이미지를 찾으려고 한다. 그리고 전체 네트워크는 위치, 겉보기, 적대적 경쟁의 세 가지 목적 함수들로 학습이 된다. 위치 목적 함수는 알맞은 위치를 예측하기 위해 사용되었고, 겉보기 목적 함수는 입력 패치 들이 결과 이미지 상에 적은 변화만을 가지고 남아있도록 하기 위해 사용되었으며, 적대적 경쟁 목적 함수는 생성된 이미지가 실제 이미지와 비슷할 수 있도록 하기 위해 적용되었다. 이렇게 구성된 네트워크는 별도의 annotation 없이 기존 데이터셋 들을 바탕으로 학습이 가능한 장점이 있다. 또한 실험을 통해, 제안한 방식이 다양한 데이터셋에서 잘 동작함을 보였다.1 Introduction 1 1.1 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . 5 2 Related Work 9 2.1 Detection methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Orientation estimation methods . . . . . . . . . . . . . . . . . . . . 11 2.3 Instance synthesis methods . . . . . . . . . . . . . . . . . . . . . . 13 2.4 Image generation methods . . . . . . . . . . . . . . . . . . . . . . . 15 3 Pedestrian detection 19 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 Proposed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2.1 Determinantal Point Process Formulation . . . . . . . . . . 22 3.2.2 Quality Term . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2.3 Individualness and Diversity Feature . . . . . . . . . . . . . 25 3.2.4 Mode Finding . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2.5 Relationship to Quadratic Unconstrained Binary Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.3.1 Experimental Settings . . . . . . . . . . . . . . . . . . . . . 36 3.3.2 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . 41 3.3.3 DET curves . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.3.4 Sensitivity analysis . . . . . . . . . . . . . . . . . . . . . . . 43 3.3.5 Effectiveness of the quality and similarity term design . . . 44 3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4 Head and body orientation estimation 51 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.2 Algorithmic Overview . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.3 Rich Filter Bank . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.3.1 Compressed Filter Bank . . . . . . . . . . . . . . . . . . . . 57 4.3.2 Box Filter Bank . . . . . . . . . . . . . . . . . . . . . . . . 58 4.4 Convolutional Random Projection Net . . . . . . . . . . . . . . . . 58 4.4.1 Input Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.4.2 Convolutional and ReLU Layers . . . . . . . . . . . . . . . 60 4.4.3 Random Projection Layer . . . . . . . . . . . . . . . . . . . 61 4.4.4 Fully-Connected and Output Layers . . . . . . . . . . . . . 62 4.5 Convolutional Random Projection Forest . . . . . . . . . . . . . . 62 4.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.6.1 Evaluation Datasets . . . . . . . . . . . . . . . . . . . . . . 65 4.6.2 CRPnet Characteristics . . . . . . . . . . . . . . . . . . . . 66 4.6.3 Head and Body Orientation Estimation . . . . . . . . . . . 67 4.6.4 Analysis of the Proposed Algorithm . . . . . . . . . . . . . 87 4.6.5 Classification Examples . . . . . . . . . . . . . . . . . . . . 87 4.6.6 Regression Examples . . . . . . . . . . . . . . . . . . . . . . 100 4.6.7 Experiments on the Original Datasets . . . . . . . . . . . . 100 4.6.8 Dataset Corrections . . . . . . . . . . . . . . . . . . . . . . 100 4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5 Instance synthesis and placement 109 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.2.1 The where module: learning a spatial distribution of object instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.2.2 The what module: learning a shape distribution of object instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.2.3 The complete pipeline . . . . . . . . . . . . . . . . . . . . . 120 5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 121 5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 6 Image generation 129 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 6.2 Proposed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 134 6.2.1 Key Part Detection . . . . . . . . . . . . . . . . . . . . . . 135 6.2.2 Part Encoding Network . . . . . . . . . . . . . . . . . . . . 135 6.2.3 Mask Prediction Network . . . . . . . . . . . . . . . . . . . 137 6.2.4 Image Generation Network . . . . . . . . . . . . . . . . . . 138 6.2.5 Real-Fake Discriminator Network . . . . . . . . . . . . . . . 139 6.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 6.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 6.3.2 Image Generation Results . . . . . . . . . . . . . . . . . . . 142 6.3.3 Experimental Details . . . . . . . . . . . . . . . . . . . . . . 150 6.3.4 Image Generation from Local Patches . . . . . . . . . . . . 150 6.3.5 Part Combination . . . . . . . . . . . . . . . . . . . . . . . 150 6.3.6 Unsupervised Feature Learning . . . . . . . . . . . . . . . . 151 6.3.7 An Alternative Objective Function . . . . . . . . . . . . . . 151 6.3.8 An Alternative Network Structure . . . . . . . . . . . . . . 151 6.3.9 Different Number of Input Patches . . . . . . . . . . . . . . 152 6.3.10 Smaller Size of Input Patches . . . . . . . . . . . . . . . . . 153 6.3.11 Degraded Input Patches . . . . . . . . . . . . . . . . . . . . 153 6.3.12 User Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 6.3.13 Failure cases . . . . . . . . . . . . . . . . . . . . . . . . . . 155 6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 7 Conclusion and Future Work 179Docto

    Image-based human pose estimation

    Get PDF

    Vehicle make and model recognition for intelligent transportation monitoring and surveillance.

    Get PDF
    Vehicle Make and Model Recognition (VMMR) has evolved into a significant subject of study due to its importance in numerous Intelligent Transportation Systems (ITS), such as autonomous navigation, traffic analysis, traffic surveillance and security systems. A highly accurate and real-time VMMR system significantly reduces the overhead cost of resources otherwise required. The VMMR problem is a multi-class classification task with a peculiar set of issues and challenges like multiplicity, inter- and intra-make ambiguity among various vehicles makes and models, which need to be solved in an efficient and reliable manner to achieve a highly robust VMMR system. In this dissertation, facing the growing importance of make and model recognition of vehicles, we present a VMMR system that provides very high accuracy rates and is robust to several challenges. We demonstrate that the VMMR problem can be addressed by locating discriminative parts where the most significant appearance variations occur in each category, and learning expressive appearance descriptors. Given these insights, we consider two data driven frameworks: a Multiple-Instance Learning-based (MIL) system using hand-crafted features and an extended application of deep neural networks using MIL. Our approach requires only image level class labels, and the discriminative parts of each target class are selected in a fully unsupervised manner without any use of part annotations or segmentation masks, which may be costly to obtain. This advantage makes our system more intelligent, scalable, and applicable to other fine-grained recognition tasks. We constructed a dataset with 291,752 images representing 9,170 different vehicles to validate and evaluate our approach. Experimental results demonstrate that the localization of parts and distinguishing their discriminative powers for categorization improve the performance of fine-grained categorization. Extensive experiments conducted using our approaches yield superior results for images that were occluded, under low illumination, partial camera views, or even non-frontal views, available in our real-world VMMR dataset. The approaches presented herewith provide a highly accurate VMMR system for rea-ltime applications in realistic environments.\\ We also validate our system with a significant application of VMMR to ITS that involves automated vehicular surveillance. We show that our application can provide law inforcement agencies with efficient tools to search for a specific vehicle type, make, or model, and to track the path of a given vehicle using the position of multiple cameras
    corecore