Search CORE

1,035 research outputs found

Object Detection in 20 Years: A Survey

Author: Guo Yuhong
Shi Zhenwei
Ye Jieping
Zou Zhengxia
Publication venue
Publication date: 15/05/2019
Field of study

Object detection, as of one the most fundamental and challenging problems in computer vision, has received great attention in recent years. Its development in the past two decades can be regarded as an epitome of computer vision history. If we think of today's object detection as a technical aesthetics under the power of deep learning, then turning back the clock 20 years we would witness the wisdom of cold weapon era. This paper extensively reviews 400+ papers of object detection in the light of its technical evolution, spanning over a quarter-century's time (from the 1990s to 2019). A number of topics have been covered in this paper, including the milestone detectors in history, detection datasets, metrics, fundamental building blocks of the detection system, speed up techniques, and the recent state of the art detection methods. This paper also reviews some important detection applications, such as pedestrian detection, face detection, text detection, etc, and makes an in-deep analysis of their challenges as well as technical improvements in recent years.Comment: This work has been submitted to the IEEE TPAMI for possible publicatio

arXiv.org e-Print Archive

Enhanced contextual based deep learning model for niqab face detection

Author: Al-Ashbi Abdulaziz
Publication venue
Publication date: 01/01/2022
Field of study

Human face detection is one of the most investigated areas in computer vision which plays a fundamental role as the first step for all face processing and facial analysis systems, such as face recognition, security monitoring, and facial emotion recognition. Despite the great impact of Deep Learning Convolutional neural network (DL-CNN) approaches on solving many unconstrained face detection problems in recent years, the low performance of current face detection models when detecting highly occluded faces remains a challenging problem and worth of investigation. This challenge tends to be higher when the occlusion covers most of the face which dramatically reduce the number of learned representative features that are used by Feature Extraction Network (FEN) to discriminate face parts from the background. The lack of occluded face dataset with sufficient images for heavily occluded faces is another challenge that degrades the performance. Therefore, this research addressed the issue of low performance and developed an enhanced occluded face detection model for detecting and localizing heavily occluded faces. First, a highly occluded faces dataset was developed to provide sufficient training examples incorporated with contextual-based annotation technique, to maximize the amount of facial salient features. Second, using the training half of the dataset, a deep learning-CNN Occluded Face Detection model (OFD) with an enhanced feature extraction and detection network was proposed and trained. Common deep learning techniques, namely transfer learning and data augmentation techniques were used to speed up the training process. The false-positive reduction based on max-in-out strategy was adopted to reduce the high false-positive rate. The proposed model was evaluated and benchmarked with five current face detection models on the dataset. The obtained results show that OFD achieved improved performance in terms of accuracy (average 37%), and average precision (16.6%) compared to current face detection models. The findings revealed that the proposed model outperformed current face detection models in improving the detection of highly occluded faces. Based on the findings, an improved contextual based labeling technique has been successfully developed to address the insufficient functionalities of current labeling technique. Faculty of Engineering - School of Computing183http://dms.library.utm.my:8080/vital/access/manager/Repository/vital:150777 Deep Learning Convolutional neural network (DL-CNN), Feature Extraction Network (FEN), Occluded Face Detection model (OFD

Universiti Teknologi Malaysia Institutional Repository

Vehicle Detection and Tracking Techniques: A Concise Review

Author: George Loay Edwar
Hadi Raad Ahmed
Sulong Ghazali
Publication venue: 'Academy and Industry Research Collaboration Center (AIRCC)'
Publication date: 01/01/2014
Field of study

Vehicle detection and tracking applications play an important role for civilian and military applications such as in highway traffic surveillance control, management and urban traffic planning. Vehicle detection process on road are used for vehicle tracking, counts, average speed of each individual vehicle, traffic analysis and vehicle categorizing objectives and may be implemented under different environments changes. In this review, we present a concise overview of image processing methods and analysis tools which used in building these previous mentioned applications that involved developing traffic surveillance systems. More precisely and in contrast with other reviews, we classified the processing methods under three categories for more clarification to explain the traffic systems

arXiv.org e-Print Archive

Crossref

Universiti Teknologi Malaysia Institutional Repository

The Application of a Car Confidence Feature for the Classification of Cross-Roads Using Conditional Random Fields

Author: Heipke C.
Hinz S.
Kosov S.
Leitloff J.
Rottensteiner F.
Publication venue: Copernicus Publications
Publication date: 10/12/2014
Field of study

KITopen

Extraction of Vehicle Groups in Airborne Lidar Point Clouds with Two-Level Point Processes

Author: Benedek Csaba
Börcs Attila
Publication venue: Institute of Electrical and Electronics Engineers (IEEE)
Publication date: 01/01/2015
Field of study

In this paper we present a new object based hierarchical model for joint probabilistic extraction of vehicles and groups of corresponding vehicles - called traffic segments - in airborne Lidar point clouds collected from dense urban areas. Firstly, the 3-D point set is classified into terrain, vehicle, roof, vegetation and clutter classes. Then the points with the corresponding class labels and echo strength (i.e. intensity) values are projected to the ground. In the obtained 2-D class and intensity maps we approximate the top view projections of vehicles by rectangles. Since our tasks are simultaneously the extraction of the rectangle population which describes the position, size and orientation of the vehicles and grouping the vehicles into the traffic segments, we propose a hierarchical, Two-Level Marked Point Process (L2MPP) model for the problem. The output vehicle and traffic segment configurations are extracted by an iterative stochastic optimization algorithm. We have tested the proposed method with real data of a discrete return Lidar sensor providing up to four range measurements for each laser pulse. Using manually annotated Ground Truth information on a data set containing 1009 vehicles, we provide quantitative evaluation results showing that the L2MPP model surpasses two earlier grid-based approaches, a 3-D point-cloud-based process and a single layer MPP solution. The accuracy of the proposed method measured in F-rate is 97% at object level, 83% at pixel level and 95% at group level

SZTAKI Publication Repository

Repository of the Academy's Library

A perception pipeline exploiting trademark databases for service robots

Author: Song Joshua
Publication venue: 'University of Queensland Library'
Publication date: 14/02/2020
Field of study

University of Queensland eSpace

Predictive World Models from Real-World Partial Observations

Author: Carballo Alexander
Fujii Keisuke
Karlsson Robin
Ohtani Kento
Takeda Kazuya
Publication venue
Publication date: 25/04/2023
Field of study

Cognitive scientists believe adaptable intelligent agents like humans perform reasoning through learned causal mental simulations of agents and environments. The problem of learning such simulations is called predictive world modeling. Recently, reinforcement learning (RL) agents leveraging world models have achieved SOTA performance in game environments. However, understanding how to apply the world modeling approach in complex real-world environments relevant to mobile robots remains an open question. In this paper, we present a framework for learning a probabilistic predictive world model for real-world road environments. We implement the model using a hierarchical VAE (HVAE) capable of predicting a diverse set of fully observed plausible worlds from accumulated sensor observations. While prior HVAE methods require complete states as ground truth for learning, we present a novel sequential training method to allow HVAEs to learn to predict complete states from partially observed states only. We experimentally demonstrate accurate spatial structure prediction of deterministic regions achieving 96.21 IoU, and close the gap to perfect prediction by 62% for stochastic regions using the best prediction. By extending HVAEs to cases where complete ground truth states do not exist, we facilitate continual learning of spatial prediction as a step towards realizing explainable and comprehensive predictive world models for real-world mobile robotics applications. Code is available at https://github.com/robin-karlsson0/predictive-world-models.Comment: Accepted for IEEE MOST 202

arXiv.org e-Print Archive

Extracting structured information from 2D images

Author: Wojna Zbigniew
Publication venue: UCL (University College London)
Publication date: 28/07/2020
Field of study

Convolutional neural networks can handle an impressive array of supervised learning tasks while relying on a single backbone architecture, suggesting that one solution fits all vision problems. But for many tasks, we can directly make use of the problem structure within neural networks to deliver more accurate predictions. In this thesis, we propose novel deep learning components that exploit the structured output space of an increasingly complex set of problems. We start from Optical Character Recognition (OCR) in natural scenes and leverage the constraints imposed by a spatial outline of letters and language requirements. Conventional OCR systems do not work well in natural scenes due to distortions, blur, or letter variability. We introduce a new attention-based model, equipped with extra information about the neuron positions to guide its focus across characters sequentially. It beats the previous state-of-the-art benchmark by a significant margin. We then turn to dense labeling tasks employing encoder-decoder architectures. We start with an experimental study that documents the drastic impact that decoder design can have on task performance. Rather than optimizing one decoder per task separately, we propose new robust layers for the upsampling of high-dimensional encodings. We show that these better suit the structured per pixel output across the board of all tasks. Finally, we turn to the problem of urban scene understanding. There is an elaborate structure in both the input space (multi-view recordings, aerial and street-view scenes) and the output space (multiple fine-grained attributes for holistic building understanding). We design new models that benefit from a relatively simple cuboidal-like geometry of buildings to create a single unified representation from multiple views. To benchmark our model, we build a new multi-view large-scale dataset of buildings images and fine-grained attributes and show systematic improvements when compared to a broad range of strong CNN-based baselines

UCL Discovery

크기, 폐색 및 레이블에 대한 어려운 조건 하 물체 검출 개선

Author: 노준혁
Publication venue: 서울대학교 대학원
Publication date: 01/02/2020
Field of study

학위논문(박사)--서울대학교 대학원 :공과대학 컴퓨터공학부,2020. 2. 김건희.물체 검출은 인스턴스 분할(instance segmentation), 물체 추적(object tracking), 이미지 캡션(image captioning), 장면 이해(scene understanding), 행동 인식(action recognition) 등 고차원의 컴퓨터 비전 태스크(task)뿐만 아니라, 비디오 감시(video surveillance), 자율주행차(self-driving car), 로봇 비전(robot vision), 증강현실(augmented reality) 등 실제 어플리케이션(application)에도 다양하게 적용되는 중요한 분야이다. 이러한 중요성에 의해 본 분야는 수십 년이라는 오랜 기간 동안 연구되어 왔으며, 특히 딥러닝의 등장과 함께 크게 발전하였다. 하지만 이런 발전에도 불구하고, 물체 검출을 하는 데 있어 어려움을 겪게 되는 조건들이 여전히 존재한다. 본 논문에서는 일반적으로 잘 알려진 세 가지 어려운 조건들에 대해, 기존의 대표적인 검출 모형들이 더 잘 대응할 수 있도록 개선하는 것을 목표로 한다. 먼저, 보행자 검출 문제에서 발생하는 폐색(occlusion)과 어려운 비물체(hard negative)에 대한 문제를 다룬다. 가려진 보행자의 경우에는 배경으로, 수직 물체와 같은 비물체는 반대로 보행자로 인식되는 경우가 많아 전체적인 성능 저하의 큰 원인이 된다. 보행자 검출 문제는 실시간 처리를 필요로 하는 경우가 많기 때문에, 본 논문에서는 속도 측면에서 장점이 있는 일단계(single-stage) 검출 모형을 개선하여 두 가지 문제를 완화할 수 있는 방법론을 제안한다. 제안된 방법론은 사람의 부위(part) 및 이미지의 격자(grid)에 대한 신뢰도 높은 분류 결과를 바탕으로, 기존 예측 결과를 보정하는 후처리 방식으로 이루어진다. 다음으로는, 일반적인 물체 검출 문제에서 발생하는 작은 물체에 대한 문제를 다룬다. 특히, 정확도 측면에서 장점이 있는 이단계(two-stage) 검출 모형에서조차 작은 물체 검출에 대한 정확도는 크게 떨어지는 편이다. 그 이유는 작은 영역에 해당하는 피쳐(feature)의 정보가 매우 부족하기 때문이다. 본 논문에서는 이단계 검출 모형을 기반으로, 피쳐 수준 초해상도(super-resolution) 방법론을 도입하여 이 문제를 해결하는 것을 소개한다. 특히, 초해상도를 위한 목표(target) 피쳐를 설정해줌으로써 학습을 안정화하고, 성능을 더욱 향상시킨다. 마지막으로, 학습환경에서 분류 레이블(classification label)만이 주어지는 약지도(weakly supervised) 학습환경에서 발생하는 문제를 다룬다. 약지도 물체 검출(weakly supervised object localization)은 일반적인 물체 검출에 대해, 이미지별 하나의 물체가 주어지고, 그 물체의 클래스(class) 정보만 학습에 활용 가능하다는 제약이 추가된 문제이다. 이에 대한 대표적인 검출 방법론은 피쳐의 활성(activation) 값들을 활용하는 CAM(class activation mapping)을 들 수 있다. 하지만 해당 방법론을 사용하는 경우, 예측 영역이 실제 물체의 영역에 비해 굉장히 좁게 잡히는 문제가 있는데, 이는 분류에 필요한 정보를 가진 피쳐들의 영역이 좁고, 검출 과정에서 이들의 비중을 높게 처리하기 때문에 발생한다. 본 논문에서는 이를 개선하여, 다양한 피쳐들의 정보를 검출에 최적화된 방식으로 활용하여 물체의 영역을 정확히 예측할 수 있는 방법론을 제안한다. 제안한 방법론들은 각 문제의 대표적인 벤치마크(benchmark) 데이터셋들에 대해 기존의 검출 모형들의 성능을 크게 향상시켰으며, 일부 환경에서는 최고 수준(state-of-the-art)의 성능을 달성하였다. 또한 다양한 모형들에 적용 가능한 유연성(flexibility)을 바탕으로, 추후 발전된 모형들에도 적용하여 추가적인 성능향상을 가져올 수 있을 것으로 기대된다.Object detection is one of the most essential and fundamental fields in computer vision. It is the foundation of not only the high-level vision tasks such as instance segmentation, object tracking, image captioning, scene understanding, and action recognition, but also the real-world applications such as video surveillance, self-driving car, robot vision, and augmented reality. Due to its important role in computer vision, object detection has been studied for decades, and drastically developed with the emergence of deep neural networks. Despite the recent rapid advancement, however, the performance of many detection models is limited under certain conditions. In this thesis, we examine three challenging conditions that hinder the robust application of object detection models and propose novel approaches to resolve the problems caused by the challenging conditions. We first investigate how to improve the performance of detecting occluded objects and hard negatives in the domain of pedestrian detection. Occluded pedestrians are often recognized as background, whereas hard negative examples such as vertical objects are considered as pedestrians, which significantly degrades the detection performance. Since pedestrian detection often requires real-time processing, we propose a method that can alleviate two problems by improving a single-stage detection model with the advantage in terms of speed. More specifically, we introduce an additional post-processing module that refines initial prediction results based on reliable classification of a person's body parts and grids of image. We then study how to better detect small objects for general object classes. Although two-stage object detection models significantly outperform single-stage models in terms of accuracy, the performance of two-stage models on small objects is still much lower than human-level performance. It is mainly due to the lack of information in the features of a small region of interest. In this thesis, we propose a feature-level super-resolution method based on two-stage object detection models to improve the performance of detecting small objects. More specifically, by properly pairing input and target features for super-resolution, we stabilize the training process, and as a result, significantly improve the detection performance on small objects. Lastly, we address the object detection problem under the setting of weak supervision. Particularly, weakly supervised object localization (WSOL) assumes there is only one object per image, and only provides class labels for training. For the absence of bounding box annotation, one dominant approach for WSOL has used class activation maps (CAM), which are generated through training for classification, and used to estimate the location of objects. However, since a classification model is trained to concentrate on the discriminative features, the localization results are often limited to small object region. To resolve this problem, we propose the methods that properly utilize the information in class activation maps. Our proposed methods significantly improved the performance of base models on each benchmark dataset and achieved state-of-the-art performance in some settings. Based on the flexibility that is applicable to the various models, it is also expected to be applied to the more recent models, resulting in additional performance improvements.1. Introduction 1 1.1. Contributions 6 1.2. Thesis Organization 8 2. Related Work 9 2.1. General Methods 9 2.2. Methods for Occluded Objects and Hard Negatives 10 2.3. Methods for Small Objects 12 2.4. Methods for Weakly Labeled Objects 14 3. Part and Grid Classification Based Post-Refinement for Occluded Objects and Hard Negatives 17 3.1. Overview 17 3.2. Our Approach 19 3.2.1. A Unified View of Output Tensors 19 3.2.2. Refinement for Occlusion Handling 21 3.2.3. Refinement for Hard Negative Handling 25 3.3. Experiment Settings 28 3.3.1. Datasets 29 3.3.2. Configuration Details 30 3.4. Experiment Results 32 3.4.1. Quantitative Results 32 3.4.2. Ablation Experiments 36 3.4.3. Memory and Computation Time Analysis 40 3.4.4. Qualitative Results 41 3.5. Conclusion 44 4. Self-Supervised Feature Super-Resolution for Small Objects 45 4.1. Overview 45 4.2. Mismatch of Relative Receptive Fields 48 4.3. Our Approach 50 4.3.1. Super-Resolution Target Extractor 52 4.3.2. Super-Resolution Feature Generator and Discriminator 54 4.3.3. Training 56 4.3.4. Inference 57 4.4. Experiment Settings 57 4.4.1. Datasets 57 4.4.2. Configuration Details 58 4.5. Experiment Results 59 4.5.1. Quantitative Results 59 4.5.2. Ablation Experiments 61 4.5.3. Qualitative Results 67 4.6. Conclusion 67 5. Rectified Class Activation Mapping for Weakly Labeled Objects 72 5.1. Overview 72 5.2. Our Approach 75 5.2.1. GAP-based CAM Localization 75 5.2.2. Thresholded Average Pooling 76 5.2.3. Negative Weight Clamping 78 5.2.4. Percentile as a Thresholding Standard 80 5.3. Experiment Settings 81 5.3.1. Datasets 81 5.3.2. Configuration Details 82 5.4. Experiment Results 82 5.4.1. Quantitative Results 83 5.4.2. Ablation Experiments 88 5.4.3. Qualitative Results 90 5.5. Conclusion 96 6. Conclusion 98 6.1. Summary 98 6.2. Future Works 99Docto

SNU Open Repository and Archive