1,035 research outputs found

    Object Detection in 20 Years: A Survey

    Full text link
    Object detection, as of one the most fundamental and challenging problems in computer vision, has received great attention in recent years. Its development in the past two decades can be regarded as an epitome of computer vision history. If we think of today's object detection as a technical aesthetics under the power of deep learning, then turning back the clock 20 years we would witness the wisdom of cold weapon era. This paper extensively reviews 400+ papers of object detection in the light of its technical evolution, spanning over a quarter-century's time (from the 1990s to 2019). A number of topics have been covered in this paper, including the milestone detectors in history, detection datasets, metrics, fundamental building blocks of the detection system, speed up techniques, and the recent state of the art detection methods. This paper also reviews some important detection applications, such as pedestrian detection, face detection, text detection, etc, and makes an in-deep analysis of their challenges as well as technical improvements in recent years.Comment: This work has been submitted to the IEEE TPAMI for possible publicatio

    Enhanced contextual based deep learning model for niqab face detection

    Get PDF
    Human face detection is one of the most investigated areas in computer vision which plays a fundamental role as the first step for all face processing and facial analysis systems, such as face recognition, security monitoring, and facial emotion recognition. Despite the great impact of Deep Learning Convolutional neural network (DL-CNN) approaches on solving many unconstrained face detection problems in recent years, the low performance of current face detection models when detecting highly occluded faces remains a challenging problem and worth of investigation. This challenge tends to be higher when the occlusion covers most of the face which dramatically reduce the number of learned representative features that are used by Feature Extraction Network (FEN) to discriminate face parts from the background. The lack of occluded face dataset with sufficient images for heavily occluded faces is another challenge that degrades the performance. Therefore, this research addressed the issue of low performance and developed an enhanced occluded face detection model for detecting and localizing heavily occluded faces. First, a highly occluded faces dataset was developed to provide sufficient training examples incorporated with contextual-based annotation technique, to maximize the amount of facial salient features. Second, using the training half of the dataset, a deep learning-CNN Occluded Face Detection model (OFD) with an enhanced feature extraction and detection network was proposed and trained. Common deep learning techniques, namely transfer learning and data augmentation techniques were used to speed up the training process. The false-positive reduction based on max-in-out strategy was adopted to reduce the high false-positive rate. The proposed model was evaluated and benchmarked with five current face detection models on the dataset. The obtained results show that OFD achieved improved performance in terms of accuracy (average 37%), and average precision (16.6%) compared to current face detection models. The findings revealed that the proposed model outperformed current face detection models in improving the detection of highly occluded faces. Based on the findings, an improved contextual based labeling technique has been successfully developed to address the insufficient functionalities of current labeling technique. Faculty of Engineering - School of Computing183http://dms.library.utm.my:8080/vital/access/manager/Repository/vital:150777 Deep Learning Convolutional neural network (DL-CNN), Feature Extraction Network (FEN), Occluded Face Detection model (OFD

    Vehicle Detection and Tracking Techniques: A Concise Review

    Get PDF
    Vehicle detection and tracking applications play an important role for civilian and military applications such as in highway traffic surveillance control, management and urban traffic planning. Vehicle detection process on road are used for vehicle tracking, counts, average speed of each individual vehicle, traffic analysis and vehicle categorizing objectives and may be implemented under different environments changes. In this review, we present a concise overview of image processing methods and analysis tools which used in building these previous mentioned applications that involved developing traffic surveillance systems. More precisely and in contrast with other reviews, we classified the processing methods under three categories for more clarification to explain the traffic systems

    Extraction of Vehicle Groups in Airborne Lidar Point Clouds with Two-Level Point Processes

    Get PDF
    In this paper we present a new object based hierarchical model for joint probabilistic extraction of vehicles and groups of corresponding vehicles - called traffic segments - in airborne Lidar point clouds collected from dense urban areas. Firstly, the 3-D point set is classified into terrain, vehicle, roof, vegetation and clutter classes. Then the points with the corresponding class labels and echo strength (i.e. intensity) values are projected to the ground. In the obtained 2-D class and intensity maps we approximate the top view projections of vehicles by rectangles. Since our tasks are simultaneously the extraction of the rectangle population which describes the position, size and orientation of the vehicles and grouping the vehicles into the traffic segments, we propose a hierarchical, Two-Level Marked Point Process (L2MPP) model for the problem. The output vehicle and traffic segment configurations are extracted by an iterative stochastic optimization algorithm. We have tested the proposed method with real data of a discrete return Lidar sensor providing up to four range measurements for each laser pulse. Using manually annotated Ground Truth information on a data set containing 1009 vehicles, we provide quantitative evaluation results showing that the L2MPP model surpasses two earlier grid-based approaches, a 3-D point-cloud-based process and a single layer MPP solution. The accuracy of the proposed method measured in F-rate is 97% at object level, 83% at pixel level and 95% at group level

    A perception pipeline exploiting trademark databases for service robots

    Get PDF

    Predictive World Models from Real-World Partial Observations

    Full text link
    Cognitive scientists believe adaptable intelligent agents like humans perform reasoning through learned causal mental simulations of agents and environments. The problem of learning such simulations is called predictive world modeling. Recently, reinforcement learning (RL) agents leveraging world models have achieved SOTA performance in game environments. However, understanding how to apply the world modeling approach in complex real-world environments relevant to mobile robots remains an open question. In this paper, we present a framework for learning a probabilistic predictive world model for real-world road environments. We implement the model using a hierarchical VAE (HVAE) capable of predicting a diverse set of fully observed plausible worlds from accumulated sensor observations. While prior HVAE methods require complete states as ground truth for learning, we present a novel sequential training method to allow HVAEs to learn to predict complete states from partially observed states only. We experimentally demonstrate accurate spatial structure prediction of deterministic regions achieving 96.21 IoU, and close the gap to perfect prediction by 62% for stochastic regions using the best prediction. By extending HVAEs to cases where complete ground truth states do not exist, we facilitate continual learning of spatial prediction as a step towards realizing explainable and comprehensive predictive world models for real-world mobile robotics applications. Code is available at https://github.com/robin-karlsson0/predictive-world-models.Comment: Accepted for IEEE MOST 202

    Extracting structured information from 2D images

    Get PDF
    Convolutional neural networks can handle an impressive array of supervised learning tasks while relying on a single backbone architecture, suggesting that one solution fits all vision problems. But for many tasks, we can directly make use of the problem structure within neural networks to deliver more accurate predictions. In this thesis, we propose novel deep learning components that exploit the structured output space of an increasingly complex set of problems. We start from Optical Character Recognition (OCR) in natural scenes and leverage the constraints imposed by a spatial outline of letters and language requirements. Conventional OCR systems do not work well in natural scenes due to distortions, blur, or letter variability. We introduce a new attention-based model, equipped with extra information about the neuron positions to guide its focus across characters sequentially. It beats the previous state-of-the-art benchmark by a significant margin. We then turn to dense labeling tasks employing encoder-decoder architectures. We start with an experimental study that documents the drastic impact that decoder design can have on task performance. Rather than optimizing one decoder per task separately, we propose new robust layers for the upsampling of high-dimensional encodings. We show that these better suit the structured per pixel output across the board of all tasks. Finally, we turn to the problem of urban scene understanding. There is an elaborate structure in both the input space (multi-view recordings, aerial and street-view scenes) and the output space (multiple fine-grained attributes for holistic building understanding). We design new models that benefit from a relatively simple cuboidal-like geometry of buildings to create a single unified representation from multiple views. To benchmark our model, we build a new multi-view large-scale dataset of buildings images and fine-grained attributes and show systematic improvements when compared to a broad range of strong CNN-based baselines

    ํฌ๊ธฐ, ํ์ƒ‰ ๋ฐ ๋ ˆ์ด๋ธ”์— ๋Œ€ํ•œ ์–ด๋ ค์šด ์กฐ๊ฑด ํ•˜ ๋ฌผ์ฒด ๊ฒ€์ถœ ๊ฐœ์„ 

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ)--์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› :๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€,2020. 2. ๊น€๊ฑดํฌ.๋ฌผ์ฒด ๊ฒ€์ถœ์€ ์ธ์Šคํ„ด์Šค ๋ถ„ํ• (instance segmentation), ๋ฌผ์ฒด ์ถ”์ (object tracking), ์ด๋ฏธ์ง€ ์บก์…˜(image captioning), ์žฅ๋ฉด ์ดํ•ด(scene understanding), ํ–‰๋™ ์ธ์‹(action recognition) ๋“ฑ ๊ณ ์ฐจ์›์˜ ์ปดํ“จํ„ฐ ๋น„์ „ ํƒœ์Šคํฌ(task)๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ๋น„๋””์˜ค ๊ฐ์‹œ(video surveillance), ์ž์œจ์ฃผํ–‰์ฐจ(self-driving car), ๋กœ๋ด‡ ๋น„์ „(robot vision), ์ฆ๊ฐ•ํ˜„์‹ค(augmented reality) ๋“ฑ ์‹ค์ œ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜(application)์—๋„ ๋‹ค์–‘ํ•˜๊ฒŒ ์ ์šฉ๋˜๋Š” ์ค‘์š”ํ•œ ๋ถ„์•ผ์ด๋‹ค. ์ด๋Ÿฌํ•œ ์ค‘์š”์„ฑ์— ์˜ํ•ด ๋ณธ ๋ถ„์•ผ๋Š” ์ˆ˜์‹ญ ๋…„์ด๋ผ๋Š” ์˜ค๋žœ ๊ธฐ๊ฐ„ ๋™์•ˆ ์—ฐ๊ตฌ๋˜์–ด ์™”์œผ๋ฉฐ, ํŠนํžˆ ๋”ฅ๋Ÿฌ๋‹์˜ ๋“ฑ์žฅ๊ณผ ํ•จ๊ป˜ ํฌ๊ฒŒ ๋ฐœ์ „ํ•˜์˜€๋‹ค. ํ•˜์ง€๋งŒ ์ด๋Ÿฐ ๋ฐœ์ „์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ , ๋ฌผ์ฒด ๊ฒ€์ถœ์„ ํ•˜๋Š” ๋ฐ ์žˆ์–ด ์–ด๋ ค์›€์„ ๊ฒช๊ฒŒ ๋˜๋Š” ์กฐ๊ฑด๋“ค์ด ์—ฌ์ „ํžˆ ์กด์žฌํ•œ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ์ž˜ ์•Œ๋ ค์ง„ ์„ธ ๊ฐ€์ง€ ์–ด๋ ค์šด ์กฐ๊ฑด๋“ค์— ๋Œ€ํ•ด, ๊ธฐ์กด์˜ ๋Œ€ํ‘œ์ ์ธ ๊ฒ€์ถœ ๋ชจํ˜•๋“ค์ด ๋” ์ž˜ ๋Œ€์‘ํ•  ์ˆ˜ ์žˆ๋„๋ก ๊ฐœ์„ ํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•œ๋‹ค. ๋จผ์ €, ๋ณดํ–‰์ž ๊ฒ€์ถœ ๋ฌธ์ œ์—์„œ ๋ฐœ์ƒํ•˜๋Š” ํ์ƒ‰(occlusion)๊ณผ ์–ด๋ ค์šด ๋น„๋ฌผ์ฒด(hard negative)์— ๋Œ€ํ•œ ๋ฌธ์ œ๋ฅผ ๋‹ค๋ฃฌ๋‹ค. ๊ฐ€๋ ค์ง„ ๋ณดํ–‰์ž์˜ ๊ฒฝ์šฐ์—๋Š” ๋ฐฐ๊ฒฝ์œผ๋กœ, ์ˆ˜์ง ๋ฌผ์ฒด์™€ ๊ฐ™์€ ๋น„๋ฌผ์ฒด๋Š” ๋ฐ˜๋Œ€๋กœ ๋ณดํ–‰์ž๋กœ ์ธ์‹๋˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์•„ ์ „์ฒด์ ์ธ ์„ฑ๋Šฅ ์ €ํ•˜์˜ ํฐ ์›์ธ์ด ๋œ๋‹ค. ๋ณดํ–‰์ž ๊ฒ€์ถœ ๋ฌธ์ œ๋Š” ์‹ค์‹œ๊ฐ„ ์ฒ˜๋ฆฌ๋ฅผ ํ•„์š”๋กœ ํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๊ธฐ ๋•Œ๋ฌธ์—, ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์†๋„ ์ธก๋ฉด์—์„œ ์žฅ์ ์ด ์žˆ๋Š” ์ผ๋‹จ๊ณ„(single-stage) ๊ฒ€์ถœ ๋ชจํ˜•์„ ๊ฐœ์„ ํ•˜์—ฌ ๋‘ ๊ฐ€์ง€ ๋ฌธ์ œ๋ฅผ ์™„ํ™”ํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•๋ก ์€ ์‚ฌ๋žŒ์˜ ๋ถ€์œ„(part) ๋ฐ ์ด๋ฏธ์ง€์˜ ๊ฒฉ์ž(grid)์— ๋Œ€ํ•œ ์‹ ๋ขฐ๋„ ๋†’์€ ๋ถ„๋ฅ˜ ๊ฒฐ๊ณผ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ, ๊ธฐ์กด ์˜ˆ์ธก ๊ฒฐ๊ณผ๋ฅผ ๋ณด์ •ํ•˜๋Š” ํ›„์ฒ˜๋ฆฌ ๋ฐฉ์‹์œผ๋กœ ์ด๋ฃจ์–ด์ง„๋‹ค. ๋‹ค์Œ์œผ๋กœ๋Š”, ์ผ๋ฐ˜์ ์ธ ๋ฌผ์ฒด ๊ฒ€์ถœ ๋ฌธ์ œ์—์„œ ๋ฐœ์ƒํ•˜๋Š” ์ž‘์€ ๋ฌผ์ฒด์— ๋Œ€ํ•œ ๋ฌธ์ œ๋ฅผ ๋‹ค๋ฃฌ๋‹ค. ํŠนํžˆ, ์ •ํ™•๋„ ์ธก๋ฉด์—์„œ ์žฅ์ ์ด ์žˆ๋Š” ์ด๋‹จ๊ณ„(two-stage) ๊ฒ€์ถœ ๋ชจํ˜•์—์„œ์กฐ์ฐจ ์ž‘์€ ๋ฌผ์ฒด ๊ฒ€์ถœ์— ๋Œ€ํ•œ ์ •ํ™•๋„๋Š” ํฌ๊ฒŒ ๋–จ์–ด์ง€๋Š” ํŽธ์ด๋‹ค. ๊ทธ ์ด์œ ๋Š” ์ž‘์€ ์˜์—ญ์— ํ•ด๋‹นํ•˜๋Š” ํ”ผ์ณ(feature)์˜ ์ •๋ณด๊ฐ€ ๋งค์šฐ ๋ถ€์กฑํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋‹จ๊ณ„ ๊ฒ€์ถœ ๋ชจํ˜•์„ ๊ธฐ๋ฐ˜์œผ๋กœ, ํ”ผ์ณ ์ˆ˜์ค€ ์ดˆํ•ด์ƒ๋„(super-resolution) ๋ฐฉ๋ฒ•๋ก ์„ ๋„์ž…ํ•˜์—ฌ ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ๊ฒƒ์„ ์†Œ๊ฐœํ•œ๋‹ค. ํŠนํžˆ, ์ดˆํ•ด์ƒ๋„๋ฅผ ์œ„ํ•œ ๋ชฉํ‘œ(target) ํ”ผ์ณ๋ฅผ ์„ค์ •ํ•ด์คŒ์œผ๋กœ์จ ํ•™์Šต์„ ์•ˆ์ •ํ™”ํ•˜๊ณ , ์„ฑ๋Šฅ์„ ๋”์šฑ ํ–ฅ์ƒ์‹œํ‚จ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, ํ•™์Šตํ™˜๊ฒฝ์—์„œ ๋ถ„๋ฅ˜ ๋ ˆ์ด๋ธ”(classification label)๋งŒ์ด ์ฃผ์–ด์ง€๋Š” ์•ฝ์ง€๋„(weakly supervised) ํ•™์Šตํ™˜๊ฒฝ์—์„œ ๋ฐœ์ƒํ•˜๋Š” ๋ฌธ์ œ๋ฅผ ๋‹ค๋ฃฌ๋‹ค. ์•ฝ์ง€๋„ ๋ฌผ์ฒด ๊ฒ€์ถœ(weakly supervised object localization)์€ ์ผ๋ฐ˜์ ์ธ ๋ฌผ์ฒด ๊ฒ€์ถœ์— ๋Œ€ํ•ด, ์ด๋ฏธ์ง€๋ณ„ ํ•˜๋‚˜์˜ ๋ฌผ์ฒด๊ฐ€ ์ฃผ์–ด์ง€๊ณ , ๊ทธ ๋ฌผ์ฒด์˜ ํด๋ž˜์Šค(class) ์ •๋ณด๋งŒ ํ•™์Šต์— ํ™œ์šฉ ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ์ œ์•ฝ์ด ์ถ”๊ฐ€๋œ ๋ฌธ์ œ์ด๋‹ค. ์ด์— ๋Œ€ํ•œ ๋Œ€ํ‘œ์ ์ธ ๊ฒ€์ถœ ๋ฐฉ๋ฒ•๋ก ์€ ํ”ผ์ณ์˜ ํ™œ์„ฑ(activation) ๊ฐ’๋“ค์„ ํ™œ์šฉํ•˜๋Š” CAM(class activation mapping)์„ ๋“ค ์ˆ˜ ์žˆ๋‹ค. ํ•˜์ง€๋งŒ ํ•ด๋‹น ๋ฐฉ๋ฒ•๋ก ์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ, ์˜ˆ์ธก ์˜์—ญ์ด ์‹ค์ œ ๋ฌผ์ฒด์˜ ์˜์—ญ์— ๋น„ํ•ด ๊ต‰์žฅํžˆ ์ข๊ฒŒ ์žกํžˆ๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ๋Š”๋ฐ, ์ด๋Š” ๋ถ„๋ฅ˜์— ํ•„์š”ํ•œ ์ •๋ณด๋ฅผ ๊ฐ€์ง„ ํ”ผ์ณ๋“ค์˜ ์˜์—ญ์ด ์ข๊ณ , ๊ฒ€์ถœ ๊ณผ์ •์—์„œ ์ด๋“ค์˜ ๋น„์ค‘์„ ๋†’๊ฒŒ ์ฒ˜๋ฆฌํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ฐœ์ƒํ•œ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋ฅผ ๊ฐœ์„ ํ•˜์—ฌ, ๋‹ค์–‘ํ•œ ํ”ผ์ณ๋“ค์˜ ์ •๋ณด๋ฅผ ๊ฒ€์ถœ์— ์ตœ์ ํ™”๋œ ๋ฐฉ์‹์œผ๋กœ ํ™œ์šฉํ•˜์—ฌ ๋ฌผ์ฒด์˜ ์˜์—ญ์„ ์ •ํ™•ํžˆ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆํ•œ ๋ฐฉ๋ฒ•๋ก ๋“ค์€ ๊ฐ ๋ฌธ์ œ์˜ ๋Œ€ํ‘œ์ ์ธ ๋ฒค์น˜๋งˆํฌ(benchmark) ๋ฐ์ดํ„ฐ์…‹๋“ค์— ๋Œ€ํ•ด ๊ธฐ์กด์˜ ๊ฒ€์ถœ ๋ชจํ˜•๋“ค์˜ ์„ฑ๋Šฅ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œ์ผฐ์œผ๋ฉฐ, ์ผ๋ถ€ ํ™˜๊ฒฝ์—์„œ๋Š” ์ตœ๊ณ  ์ˆ˜์ค€(state-of-the-art)์˜ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค. ๋˜ํ•œ ๋‹ค์–‘ํ•œ ๋ชจํ˜•๋“ค์— ์ ์šฉ ๊ฐ€๋Šฅํ•œ ์œ ์—ฐ์„ฑ(flexibility)์„ ๋ฐ”ํƒ•์œผ๋กœ, ์ถ”ํ›„ ๋ฐœ์ „๋œ ๋ชจํ˜•๋“ค์—๋„ ์ ์šฉํ•˜์—ฌ ์ถ”๊ฐ€์ ์ธ ์„ฑ๋Šฅํ–ฅ์ƒ์„ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ์„ ๊ฒƒ์œผ๋กœ ๊ธฐ๋Œ€๋œ๋‹ค.Object detection is one of the most essential and fundamental fields in computer vision. It is the foundation of not only the high-level vision tasks such as instance segmentation, object tracking, image captioning, scene understanding, and action recognition, but also the real-world applications such as video surveillance, self-driving car, robot vision, and augmented reality. Due to its important role in computer vision, object detection has been studied for decades, and drastically developed with the emergence of deep neural networks. Despite the recent rapid advancement, however, the performance of many detection models is limited under certain conditions. In this thesis, we examine three challenging conditions that hinder the robust application of object detection models and propose novel approaches to resolve the problems caused by the challenging conditions. We first investigate how to improve the performance of detecting occluded objects and hard negatives in the domain of pedestrian detection. Occluded pedestrians are often recognized as background, whereas hard negative examples such as vertical objects are considered as pedestrians, which significantly degrades the detection performance. Since pedestrian detection often requires real-time processing, we propose a method that can alleviate two problems by improving a single-stage detection model with the advantage in terms of speed. More specifically, we introduce an additional post-processing module that refines initial prediction results based on reliable classification of a person's body parts and grids of image. We then study how to better detect small objects for general object classes. Although two-stage object detection models significantly outperform single-stage models in terms of accuracy, the performance of two-stage models on small objects is still much lower than human-level performance. It is mainly due to the lack of information in the features of a small region of interest. In this thesis, we propose a feature-level super-resolution method based on two-stage object detection models to improve the performance of detecting small objects. More specifically, by properly pairing input and target features for super-resolution, we stabilize the training process, and as a result, significantly improve the detection performance on small objects. Lastly, we address the object detection problem under the setting of weak supervision. Particularly, weakly supervised object localization (WSOL) assumes there is only one object per image, and only provides class labels for training. For the absence of bounding box annotation, one dominant approach for WSOL has used class activation maps (CAM), which are generated through training for classification, and used to estimate the location of objects. However, since a classification model is trained to concentrate on the discriminative features, the localization results are often limited to small object region. To resolve this problem, we propose the methods that properly utilize the information in class activation maps. Our proposed methods significantly improved the performance of base models on each benchmark dataset and achieved state-of-the-art performance in some settings. Based on the flexibility that is applicable to the various models, it is also expected to be applied to the more recent models, resulting in additional performance improvements.1. Introduction 1 1.1. Contributions 6 1.2. Thesis Organization 8 2. Related Work 9 2.1. General Methods 9 2.2. Methods for Occluded Objects and Hard Negatives 10 2.3. Methods for Small Objects 12 2.4. Methods for Weakly Labeled Objects 14 3. Part and Grid Classification Based Post-Refinement for Occluded Objects and Hard Negatives 17 3.1. Overview 17 3.2. Our Approach 19 3.2.1. A Unified View of Output Tensors 19 3.2.2. Refinement for Occlusion Handling 21 3.2.3. Refinement for Hard Negative Handling 25 3.3. Experiment Settings 28 3.3.1. Datasets 29 3.3.2. Configuration Details 30 3.4. Experiment Results 32 3.4.1. Quantitative Results 32 3.4.2. Ablation Experiments 36 3.4.3. Memory and Computation Time Analysis 40 3.4.4. Qualitative Results 41 3.5. Conclusion 44 4. Self-Supervised Feature Super-Resolution for Small Objects 45 4.1. Overview 45 4.2. Mismatch of Relative Receptive Fields 48 4.3. Our Approach 50 4.3.1. Super-Resolution Target Extractor 52 4.3.2. Super-Resolution Feature Generator and Discriminator 54 4.3.3. Training 56 4.3.4. Inference 57 4.4. Experiment Settings 57 4.4.1. Datasets 57 4.4.2. Configuration Details 58 4.5. Experiment Results 59 4.5.1. Quantitative Results 59 4.5.2. Ablation Experiments 61 4.5.3. Qualitative Results 67 4.6. Conclusion 67 5. Rectified Class Activation Mapping for Weakly Labeled Objects 72 5.1. Overview 72 5.2. Our Approach 75 5.2.1. GAP-based CAM Localization 75 5.2.2. Thresholded Average Pooling 76 5.2.3. Negative Weight Clamping 78 5.2.4. Percentile as a Thresholding Standard 80 5.3. Experiment Settings 81 5.3.1. Datasets 81 5.3.2. Configuration Details 82 5.4. Experiment Results 82 5.4.1. Quantitative Results 83 5.4.2. Ablation Experiments 88 5.4.3. Qualitative Results 90 5.5. Conclusion 96 6. Conclusion 98 6.1. Summary 98 6.2. Future Works 99Docto
    • โ€ฆ
    corecore