Search CORE

8 research outputs found

Activity Driven Weakly Supervised Object Detection

Author: Ghadiyaram Deepti
Mahajan Dhruv
Nevatia Ram
Ramanathan Vignesh
Yang Zhenheng
Publication venue
Publication date: 02/04/2019
Field of study

Weakly supervised object detection aims at reducing the amount of supervision required to train detection models. Such models are traditionally learned from images/videos labelled only with the object class and not the object bounding box. In our work, we try to leverage not only the object class labels but also the action labels associated with the data. We show that the action depicted in the image/video can provide strong cues about the location of the associated object. We learn a spatial prior for the object dependent on the action (e.g. "ball" is closer to "leg of the person" in "kicking ball"), and incorporate this prior to simultaneously train a joint object detection and action classification model. We conducted experiments on both video datasets and image datasets to evaluate the performance of our weakly supervised object detection model. Our approach outperformed the current state-of-the-art (SOTA) method by more than 6% in mAP on the Charades video dataset.Comment: CVPR'19 camera read

arXiv.org e-Print Archive

Crossref

SelfText Beyond Polygon: Unconstrained Text Detection with Box Supervision and Dynamic Self-Training

Author: Li Zhen
Luo Ping
Pang Guan
Wang Wenhai
Wu Weijia
Xie Enze
Zhang Ruimao
Zhou Hong
Publication venue
Publication date: 05/12/2020
Field of study

Although a polygon is a more accurate representation than an upright bounding box for text detection, the annotations of polygons are extremely expensive and challenging. Unlike existing works that employ fully-supervised training with polygon annotations, we propose a novel text detection system termed SelfText Beyond Polygon (SBP) with Bounding Box Supervision (BBS) and Dynamic Self Training (DST), where training a polygon-based text detector with only a limited set of upright bounding box annotations. For BBS, we firstly utilize the synthetic data with character-level annotations to train a Skeleton Attention Segmentation Network (SASN). Then the box-level annotations are adopted to guide the generation of high-quality polygon-liked pseudo labels, which can be used to train any detectors. In this way, our method achieves the same performance as text detectors trained with polygon annotations (i.e., both are 85.0% F-score for PSENet on ICDAR2015 ). For DST, through dynamically removing the false alarms, it is able to leverage limited labeled data as well as massive unlabeled data to further outperform the expensive baseline. We hope SBP can provide a new perspective for text detection to save huge labeling costs. Code is available at: github.com/weijiawu/SBP

arXiv.org e-Print Archive

Self-supervised object detection from audio-visual correspondence

Author: Afouras Triantafyllos
Asano Yuki M.
Fagan Francois
Metze Florian
Vedaldi Andrea
Publication venue
Publication date: 13/04/2021
Field of study

We tackle the problem of learning object detectors without supervision. Differently from weakly-supervised object detection, we do not assume image-level class labels. Instead, we extract a supervisory signal from audio-visual data, using the audio component to "teach" the object detector. While this problem is related to sound source localisation, it is considerably harder because the detector must classify the objects by type, enumerate each instance of the object, and do so even when the object is silent. We tackle this problem by first designing a self-supervised framework with a contrastive objective that jointly learns to classify and localise objects. Then, without using any supervision, we simply use these self-supervised labels and boxes to train an image-based object detector. With this, we outperform previous unsupervised and weakly-supervised detectors for the task of object detection and sound source localization. We also show that we can align this detector to ground-truth classes with as little as one label per pseudo-class, and show how our method can learn to detect generic objects that go beyond instruments, such as airplanes and cats.Comment: Under revie

arXiv.org e-Print Archive

UvA-DARE

International Migration, Integration and Social Cohesion online publications

Tell Me What They're Holding: Weakly-supervised Object Detection with Transferable Knowledge from Human-object Interaction

Author: Jeong Jisoo
Kim Daesik
Kwak Nojun
Lee Gyujeong
Publication venue
Publication date: 19/11/2019
Field of study

In this work, we introduce a novel weakly supervised object detection (WSOD) paradigm to detect objects belonging to rare classes that have not many examples using transferable knowledge from human-object interactions (HOI). While WSOD shows lower performance than full supervision, we mainly focus on HOI as the main context which can strongly supervise complex semantics in images. Therefore, we propose a novel module called RRPN (relational region proposal network) which outputs an object-localizing attention map only with human poses and action verbs. In the source domain, we fully train an object detector and the RRPN with full supervision of HOI. With transferred knowledge about localization map from the trained RRPN, a new object detector can learn unseen objects with weak verbal supervision of HOI without bounding box annotations in the target domain. Because the RRPN is designed as an add-on type, we can apply it not only to the object detection but also to other domains such as semantic segmentation. The experimental results on HICO-DET dataset show the possibility that the proposed method can be a cheap alternative for the current supervised object detection paradigm. Moreover, qualitative results demonstrate that our model can properly localize unseen objects on HICO-DET and V-COCO datasets.Comment: AAAI 2020 Oral Camera Read

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

A Survey of Deep Learning-Based Object Detection

Author: Feng Zhixi
Jiao Licheng
Li Lingling
Liu Fang
Qu Rong
Yang Shuyuan
Zhang Fan
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 20/09/2019
Field of study

Object detection is one of the most important and challenging branches of computer vision, which has been widely applied in peoples life, such as monitoring security, autonomous driving and so on, with the purpose of locating instances of semantic objects of a certain class. With the rapid development of deep learning networks for detection tasks, the performance of object detectors has been greatly improved. In order to understand the main development status of object detection pipeline, thoroughly and deeply, in this survey, we first analyze the methods of existing typical detection models and describe the benchmark datasets. Afterwards and primarily, we provide a comprehensive overview of a variety of object detection methods in a systematic manner, covering the one-stage and two-stage detectors. Moreover, we list the traditional and new applications. Some representative branches of object detection are analyzed as well. Finally, we discuss the architecture of exploiting these object detection methods to build an effective and efficient system and point out a set of development trends to better follow the state-of-the-art algorithms and further research.Comment: 30 pages,12 figure

arXiv.org e-Print Archive

Repository@Nottingham

이미지의 의미적 이해를 위한 시각적 관계의 이용

Author: 김대식
Publication venue: 서울대학교 대학원
Publication date: 01/08/2019
Field of study

학위논문(박사)--서울대학교 대학원 :융합과학기술대학원 융합과학부(디지털정보융합전공),2019. 8. 곽노준.이미지를 이해하는 것은 컴퓨터 비전 분야에서 가장 근본적인 목적 중 하나이다. 이러한 이해는 다양한 산업 분야의 문제를 해결 할 수 있는 혁신이 될 수 있다. 최근 딥러닝의 발전과 함께, 이미지에서 객관적인 요소를 인식하는 기술은 매우 발전되어 왔다. 그러나 시각 정보를 제대로 이해하기 위해서는 사람처럼 맥락 정보를 이해하는 것이 중요하다. 인간은 주로 직접적인 시각정보와 함께 맥락을 이해하여 의미 있는 지식 정보로 활용한다. 본 논문에서는 객체간의 의미적 관계정보를 구축하과 활용하는 방법론을 제시하여 보다 나은 이미지의 이해 방법을 연구하였다. 첫 번째로, 다이어그램에서 관계 지식을 표현하는 관계 그래프를 생성하는 알고리즘을 제안하였다. 다이어그램이 가진 정보를 축약하는 능력이 다른 형태의 지식 저장 방법에 비해 뛰어나지만, 그에 따라 해석하기에는 다양한 요소와 유연한 레이아웃 때문에 풀기 어려운 문제였다. 우리는 다이어그램에서 객체를 찾고 그것들의 관계를 찾는 통합 네트워크를 제안한다. 그리고 이러한 능동적인 그래프 생성을 위한 특수 모듈은 DGGN을 제안한다. 이 모듈의 성능을 나타내기 위해 모듈안의 활성화 게이트의 정보 역학을 비주얼라이즈 하여 분석하였다. 또한 공개된 다이어그램 데이터셋에서 기존의 알고리즘을 뛰어넘는 성능을 증명하였다.마지막으로 질의 응답 데이터셋을 이용한 실험으로 향후 다양한 응용 가능성도 증명하였다. 두 번째로, 우리는 현존하는 질의 응답 데이터셋 중 가장 복잡한 형태를 가진 교과서에서 질의응답 (TQA) 문제를 풀기위한 솔루션을 제안하였다. TQA 데이터셋은 질문 파트와 본문 파트 모두에 이미지와 텍스트 형태를 가진 데이터를 포함하고 있다. 이러한 복잡한 구조를 해결하기 위해 우리는 f-GCN이라는 다중 모달 그래프를 처리할 수 있는 모듈을 제안하였다. 이 모듈을 통해 보다 효율적으로 다중 모달을 그래프 형태로 처리하여 활용하기 쉬운 피쳐로 바꿔줄 수 있다. 그 다음으로 교과서의 경우 다양한 주제가 포함되어 있고 그에 따라 용어나 내용이 겹치지 않고 기술되어 있다. 그로인해 완전 새로운 내용의 문제를 풀어야하는 out-of-domain 이슈가 있다. 이를 해결하기위해 정답을 보지 않고 본문만으로 자가 학습을 하는 알고리즘을 제안하였다. 이 두 알고리즘을 통해 기존 연구보다 훨씬 좋은 성능을 보이는 실험 결과를 제시하였고 각각의 모듈의 기능성에 대해 검증하였다. 마지막으로, 인간과 물건의 관계정보를 활용하여 객체 검출을 약지도 학습으로 배우는 프레임워크를 제안하였다. 객체 검출 문제를 풀기위해 노동력이 많이 필요한 데이터 라벨링 작업이 필요하다. 그 중 가장 노력이 많이 필요한 위치 라벨링인데, 새로운 방법론은 인간과 물건의 관계를 이용하여 이부분을 해결하였다. 우리는 RRPN이란 모듈을 제안하여 인간의 포즈정보와 관계에 관한 동사를 이용하여 처음보는 물건의 위치를 추정할 수 있다. 이를 통해 새롭게 배우는 목표 라벨에 대해, 정답 라벨 없이 위치를 추정하여 학습할 수 있어 훨씬 적은 노력만 사용해도 된다. 또한 RRPN은 추가 방식의 구조로 다양한 태스크에 관한 네트워크에 추가 할 수 있다. HICO-DET 데이터셋을 사용하여 실험한 결과 현재의 지도학습을 대신할 가능성을 보여주었다. 또한 우리 모델이 처음 본 물건의 위치를 잘 추정하고 있음을 시각화를 통해 보여주었다.Understanding an image is one of the fundamental goals of computer vision and can provide important breakthroughs for various industries. In particular, the ability to recognize objective instances such as objects and poses has been developed due to recent deep learning approaches. However, deeply comprehending a visual scene requires higher understanding, such as is found in human beings. Humans usually exploit contextual information from visual inputs to detect meaningful features. In this dissertation, visual relation in various contexts, from the construction phase to the application phase, is studied with three tasks. We first propose a new algorithm for constructing relation graphs that contains relational knowledge in diagrams . Although diagrams contain richer information compared to individual image-based or language-based data, proper solutions for automatically understanding diagrams have not been proposed due to their innate multimodality and the arbitrariness of their layouts. To address this problem, we propose a unified diagram-parsing network for generating knowledge from diagrams based on an object detector and a recurrent neural network designed for a graphical structure. Specifically, we propose a dynamic graph-generation network that is based on dynamic memory and graph theory. We explore the dynamics of information in a diagram with the activation of gates in gated recurrent unit (GRU) cells. Using publicly available diagram datasets, our model demonstrates a state-of-the-art result that outperforms other baselines. Moreover, further experiments on question answering demonstrate the potential of the proposed method for use in various applications. Next, we introduce a novel algorithm to solve the Textbook Question Answering (TQA) task; this task describes more realistic QA (Question Answering) problems compared to other recent tasks. We mainly focus on two issues related to the analysis of the TQA dataset. First, solving the TQA problems requires an understanding of multimodal contexts in complicated input data. To overcome this issue of extracting knowledge features from long text lessons and merging them with visual features, we establish a context graph from texts and images and propose a new module f-GCN based on graph convolutional networks (GCN). Second, in the TQA dataset , scientific terms are not spread over the chapters and subjects are split. To overcome this so-called ``out-of-domain issue, before learning QA problems we introduce a novel, self-supervised, open-set learning process without any annotations. The experimental results indicate that our model significantly outperforms prior state-of-the-art methods. Moreover, ablation studies confirm that both methods (incorporating f-GCN to extract knowledge from multimodal contexts and our newly proposed, self-supervised learning process) are effective for TQA problems. Third, we introduce a novel, weakly supervised object detection (WSOD) paradigm to detect objects belonging to rare classes that do not have many examples. We use transferable knowledge from human-object interactions (HOI). While WSOD has lower performance than full supervision, we mainly focus on HOI that can strongly supervise complex semantics in images. Therefore, we propose a novel module called the ``relational region proposal network (RRPN) that outputs an object-localizing attention map with only human poses and action verbs. In the source domain, we fully train an object detector and the RRPN with full supervision of HOI. With transferred knowledge about the localization map from the trained RRPN, a new object detector can learn unseen objects with weak verbal supervisions of HOI without bounding box annotations in the target domain. Because the RRPN is designed as an add-on type, we can apply it not only to object detection but also to other domains such as semantic segmentation. The experimental results using a HICO-DET dataset suggest the possibility that the proposed method can be a cheap alternative for the current supervised object detection paradigm. Moreover, qualitative results demonstrate that our model can properly localize unseen objects in HICO-DET and V-COCO datasets.1. Introduction 1 1.1 Problem Definition 4 1.2 Motivation 6 1.3 Challenges 7 1.4 Contributions 9 1.4.1 Generating Visual Relation Graphs from Diagrams 9 1.4.2 Application of the Relation Graph in Textbook Question Answering 10 1.4.3 Weakly Supervised Object Detection with Human-object Interaction 11 1.5 Outline 11 2. Background 13 2.1 Visual relationships 13 2.2 Neural networks on a graph 16 2.3 Human-object interaction 17 3. Generating Visual Relation Graphs from Diagrams 18 3.1 Related Work 20 3.2 Proposed Method 21 3.2.1 Detecting Constituents in a Diagram 21 3.2.2 Generating a Graph of relationships 22 3.2.3 Multi-task Training and Cascaded Inference 27 3.2.4 Details of Post-processing 29 3.3 Experiment 29 3.3.1 Datasets 29 3.3.2 Baseline 32 3.3.3 Metrics 32 3.3.4 Implementation Details 33 3.3.5 Quantitative Results 35 3.3.6 Qualitative Results 37 3.4 Discussion 38 3.5 Conclusion 41 4. Application of the Relation Graph in Textbook Question Answering 46 4.1 Related Work 48 4.2 Problem 50 4.3 Proposed Method 53 4.3.1 Multi-modal Context Graph Understanding 53 4.3.2 Multi-modal Problem Solving 55 4.3.3 Self-supervised open-set comprehension 57 4.3.4 Process of Building Textual Context Graph 61 4.4 Experiment 62 4.4.1 Implementation Details 62 4.4.2 Dataset 62 4.4.3 Baselines 63 4.4.4 Quantitative Results 64 4.4.5 Qualitative Results 67 4.5 Conclusion 70 5. Weakly Supervised Object Detection with Human-object Interaction 77 5.1 Related Work 80 5.2 Algorithm Overview 81 5.3 Proposed Method 84 5.3.1 Training on the Source classes Ds 86 5.3.2 Training on the Target classes Dt 89 5.4 Experiment 90 5.4.1 Implementation details 90 5.4.2 Dataset and Pre-processing 91 5.4.3 Metrics 91 5.4.4 Comparison with different feature combination 92 5.4.5 Comparison with different attention loss balance and box threshold 95 5.4.6 Comparison with prior works 96 5.4.7 Qualitative results 96 5.5 Conclusion 100 6. Concluding Remarks 105 6.1 Summary 105 6.2 Limitation and Future Directions 106Docto

SNU Open Repository and Archive

Deep Video Understanding with Model Efficiency and Sparse Active Labeling

Author: Rana Aayush Jung Bahadur
Publication venue: 'Information Bulletin on Variable Stars (IBVS)'
Publication date: 15/08/2023
Field of study

Videos capture the inherently sequential nature of the real world, making automatic video understanding an essential need for automatic understanding of the real world. Due to major advancements in camera, communication, and storage hardware, videos have become a widely used data format for crucial applications such as home automation, security, analysis, robotics, and autonomous driving. Existing methods for video understanding require heavy computation and large training data for good performance, this limits how quick the videos can be processed and how much data can be labeled for training. Real-world video understanding requires analyzing dense scenes and sequential information, which increases the processing time and labeling cost as the video increases in scene density and video length. Therefore, it is crucial to develop video understanding methods that reduces the processing time and labeling cost. In this dissertation, we first propose a method to improve network efficiency for video understanding task and then provide methods to improve annotation efficiency for video understanding task. Through these works, we aim to improve the network efficiency as well as data annotation efficiency, as an effort to encourage wider development and adaptation of large scale video understanding methods. First, we propose an end-to-end neural network which performs faster video actor-action detection. Our proposed network reduces the need for extra region proposal computation and post-process filter, making the network training easy as well as increasing the inference speed. Next, we propose an active learning based sparse labeling method that makes large video dataset annotation efficient. It selects a few useful frames for annotation from videos, reducing annotation cost while maintaining the dataset usefulness for video understanding task. We also provide a method to train existing video understanding models using such sparse annotations. Then, we propose a clustering-based hybrid active learning method that also selects useful videos along with useful frames for annotation, reducing annotation cost even further. Finally, we study the relation between different types of annotations and how they impact video understanding task. We extensively evaluate and analyze our methods on various dataset and downstream tasks to show that they can do efficient video understanding with faster network and limited sparse annotations

University of Central Florida (UCF): STARS (Showcase of Text, Archives, Research & Scholarship)