8 research outputs found

    Activity Driven Weakly Supervised Object Detection

    Full text link
    Weakly supervised object detection aims at reducing the amount of supervision required to train detection models. Such models are traditionally learned from images/videos labelled only with the object class and not the object bounding box. In our work, we try to leverage not only the object class labels but also the action labels associated with the data. We show that the action depicted in the image/video can provide strong cues about the location of the associated object. We learn a spatial prior for the object dependent on the action (e.g. "ball" is closer to "leg of the person" in "kicking ball"), and incorporate this prior to simultaneously train a joint object detection and action classification model. We conducted experiments on both video datasets and image datasets to evaluate the performance of our weakly supervised object detection model. Our approach outperformed the current state-of-the-art (SOTA) method by more than 6% in mAP on the Charades video dataset.Comment: CVPR'19 camera read

    SelfText Beyond Polygon: Unconstrained Text Detection with Box Supervision and Dynamic Self-Training

    Full text link
    Although a polygon is a more accurate representation than an upright bounding box for text detection, the annotations of polygons are extremely expensive and challenging. Unlike existing works that employ fully-supervised training with polygon annotations, we propose a novel text detection system termed SelfText Beyond Polygon (SBP) with Bounding Box Supervision (BBS) and Dynamic Self Training (DST), where training a polygon-based text detector with only a limited set of upright bounding box annotations. For BBS, we firstly utilize the synthetic data with character-level annotations to train a Skeleton Attention Segmentation Network (SASN). Then the box-level annotations are adopted to guide the generation of high-quality polygon-liked pseudo labels, which can be used to train any detectors. In this way, our method achieves the same performance as text detectors trained with polygon annotations (i.e., both are 85.0% F-score for PSENet on ICDAR2015 ). For DST, through dynamically removing the false alarms, it is able to leverage limited labeled data as well as massive unlabeled data to further outperform the expensive baseline. We hope SBP can provide a new perspective for text detection to save huge labeling costs. Code is available at: github.com/weijiawu/SBP

    Self-supervised object detection from audio-visual correspondence

    Get PDF
    We tackle the problem of learning object detectors without supervision. Differently from weakly-supervised object detection, we do not assume image-level class labels. Instead, we extract a supervisory signal from audio-visual data, using the audio component to "teach" the object detector. While this problem is related to sound source localisation, it is considerably harder because the detector must classify the objects by type, enumerate each instance of the object, and do so even when the object is silent. We tackle this problem by first designing a self-supervised framework with a contrastive objective that jointly learns to classify and localise objects. Then, without using any supervision, we simply use these self-supervised labels and boxes to train an image-based object detector. With this, we outperform previous unsupervised and weakly-supervised detectors for the task of object detection and sound source localization. We also show that we can align this detector to ground-truth classes with as little as one label per pseudo-class, and show how our method can learn to detect generic objects that go beyond instruments, such as airplanes and cats.Comment: Under revie

    Tell Me What They're Holding: Weakly-supervised Object Detection with Transferable Knowledge from Human-object Interaction

    Full text link
    In this work, we introduce a novel weakly supervised object detection (WSOD) paradigm to detect objects belonging to rare classes that have not many examples using transferable knowledge from human-object interactions (HOI). While WSOD shows lower performance than full supervision, we mainly focus on HOI as the main context which can strongly supervise complex semantics in images. Therefore, we propose a novel module called RRPN (relational region proposal network) which outputs an object-localizing attention map only with human poses and action verbs. In the source domain, we fully train an object detector and the RRPN with full supervision of HOI. With transferred knowledge about localization map from the trained RRPN, a new object detector can learn unseen objects with weak verbal supervision of HOI without bounding box annotations in the target domain. Because the RRPN is designed as an add-on type, we can apply it not only to the object detection but also to other domains such as semantic segmentation. The experimental results on HICO-DET dataset show the possibility that the proposed method can be a cheap alternative for the current supervised object detection paradigm. Moreover, qualitative results demonstrate that our model can properly localize unseen objects on HICO-DET and V-COCO datasets.Comment: AAAI 2020 Oral Camera Read

    A Survey of Deep Learning-Based Object Detection

    Get PDF
    Object detection is one of the most important and challenging branches of computer vision, which has been widely applied in peoples life, such as monitoring security, autonomous driving and so on, with the purpose of locating instances of semantic objects of a certain class. With the rapid development of deep learning networks for detection tasks, the performance of object detectors has been greatly improved. In order to understand the main development status of object detection pipeline, thoroughly and deeply, in this survey, we first analyze the methods of existing typical detection models and describe the benchmark datasets. Afterwards and primarily, we provide a comprehensive overview of a variety of object detection methods in a systematic manner, covering the one-stage and two-stage detectors. Moreover, we list the traditional and new applications. Some representative branches of object detection are analyzed as well. Finally, we discuss the architecture of exploiting these object detection methods to build an effective and efficient system and point out a set of development trends to better follow the state-of-the-art algorithms and further research.Comment: 30 pages,12 figure

    ์ด๋ฏธ์ง€์˜ ์˜๋ฏธ์  ์ดํ•ด๋ฅผ ์œ„ํ•œ ์‹œ๊ฐ์  ๊ด€๊ณ„์˜ ์ด์šฉ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ)--์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› :์œตํ•ฉ๊ณผํ•™๊ธฐ์ˆ ๋Œ€ํ•™์› ์œตํ•ฉ๊ณผํ•™๋ถ€(๋””์ง€ํ„ธ์ •๋ณด์œตํ•ฉ์ „๊ณต),2019. 8. ๊ณฝ๋…ธ์ค€.์ด๋ฏธ์ง€๋ฅผ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์€ ์ปดํ“จํ„ฐ ๋น„์ „ ๋ถ„์•ผ์—์„œ ๊ฐ€์žฅ ๊ทผ๋ณธ์ ์ธ ๋ชฉ์  ์ค‘ ํ•˜๋‚˜์ด๋‹ค. ์ด๋Ÿฌํ•œ ์ดํ•ด๋Š” ๋‹ค์–‘ํ•œ ์‚ฐ์—… ๋ถ„์•ผ์˜ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐ ํ•  ์ˆ˜ ์žˆ๋Š” ํ˜์‹ ์ด ๋  ์ˆ˜ ์žˆ๋‹ค. ์ตœ๊ทผ ๋”ฅ๋Ÿฌ๋‹์˜ ๋ฐœ์ „๊ณผ ํ•จ๊ป˜, ์ด๋ฏธ์ง€์—์„œ ๊ฐ๊ด€์ ์ธ ์š”์†Œ๋ฅผ ์ธ์‹ํ•˜๋Š” ๊ธฐ์ˆ ์€ ๋งค์šฐ ๋ฐœ์ „๋˜์–ด ์™”๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์‹œ๊ฐ ์ •๋ณด๋ฅผ ์ œ๋Œ€๋กœ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์‚ฌ๋žŒ์ฒ˜๋Ÿผ ๋งฅ๋ฝ ์ •๋ณด๋ฅผ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•˜๋‹ค. ์ธ๊ฐ„์€ ์ฃผ๋กœ ์ง์ ‘์ ์ธ ์‹œ๊ฐ์ •๋ณด์™€ ํ•จ๊ป˜ ๋งฅ๋ฝ์„ ์ดํ•ดํ•˜์—ฌ ์˜๋ฏธ ์žˆ๋Š” ์ง€์‹ ์ •๋ณด๋กœ ํ™œ์šฉํ•œ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๊ฐ์ฒด๊ฐ„์˜ ์˜๋ฏธ์  ๊ด€๊ณ„์ •๋ณด๋ฅผ ๊ตฌ์ถ•ํ•˜๊ณผ ํ™œ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์‹œํ•˜์—ฌ ๋ณด๋‹ค ๋‚˜์€ ์ด๋ฏธ์ง€์˜ ์ดํ•ด ๋ฐฉ๋ฒ•์„ ์—ฐ๊ตฌํ•˜์˜€๋‹ค. ์ฒซ ๋ฒˆ์งธ๋กœ, ๋‹ค์ด์–ด๊ทธ๋žจ์—์„œ ๊ด€๊ณ„ ์ง€์‹์„ ํ‘œํ˜„ํ•˜๋Š” ๊ด€๊ณ„ ๊ทธ๋ž˜ํ”„๋ฅผ ์ƒ์„ฑํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ๋‹ค์ด์–ด๊ทธ๋žจ์ด ๊ฐ€์ง„ ์ •๋ณด๋ฅผ ์ถ•์•ฝํ•˜๋Š” ๋Šฅ๋ ฅ์ด ๋‹ค๋ฅธ ํ˜•ํƒœ์˜ ์ง€์‹ ์ €์žฅ ๋ฐฉ๋ฒ•์— ๋น„ํ•ด ๋›ฐ์–ด๋‚˜์ง€๋งŒ, ๊ทธ์— ๋”ฐ๋ผ ํ•ด์„ํ•˜๊ธฐ์—๋Š” ๋‹ค์–‘ํ•œ ์š”์†Œ์™€ ์œ ์—ฐํ•œ ๋ ˆ์ด์•„์›ƒ ๋•Œ๋ฌธ์— ํ’€๊ธฐ ์–ด๋ ค์šด ๋ฌธ์ œ์˜€๋‹ค. ์šฐ๋ฆฌ๋Š” ๋‹ค์ด์–ด๊ทธ๋žจ์—์„œ ๊ฐ์ฒด๋ฅผ ์ฐพ๊ณ  ๊ทธ๊ฒƒ๋“ค์˜ ๊ด€๊ณ„๋ฅผ ์ฐพ๋Š” ํ†ตํ•ฉ ๋„คํŠธ์›Œํฌ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ด๋Ÿฌํ•œ ๋Šฅ๋™์ ์ธ ๊ทธ๋ž˜ํ”„ ์ƒ์„ฑ์„ ์œ„ํ•œ ํŠน์ˆ˜ ๋ชจ๋“ˆ์€ DGGN์„ ์ œ์•ˆํ•œ๋‹ค. ์ด ๋ชจ๋“ˆ์˜ ์„ฑ๋Šฅ์„ ๋‚˜ํƒ€๋‚ด๊ธฐ ์œ„ํ•ด ๋ชจ๋“ˆ์•ˆ์˜ ํ™œ์„ฑํ™” ๊ฒŒ์ดํŠธ์˜ ์ •๋ณด ์—ญํ•™์„ ๋น„์ฃผ์–ผ๋ผ์ด์ฆˆ ํ•˜์—ฌ ๋ถ„์„ํ•˜์˜€๋‹ค. ๋˜ํ•œ ๊ณต๊ฐœ๋œ ๋‹ค์ด์–ด๊ทธ๋žจ ๋ฐ์ดํ„ฐ์…‹์—์„œ ๊ธฐ์กด์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋›ฐ์–ด๋„˜๋Š” ์„ฑ๋Šฅ์„ ์ฆ๋ช…ํ•˜์˜€๋‹ค.๋งˆ์ง€๋ง‰์œผ๋กœ ์งˆ์˜ ์‘๋‹ต ๋ฐ์ดํ„ฐ์…‹์„ ์ด์šฉํ•œ ์‹คํ—˜์œผ๋กœ ํ–ฅํ›„ ๋‹ค์–‘ํ•œ ์‘์šฉ ๊ฐ€๋Šฅ์„ฑ๋„ ์ฆ๋ช…ํ•˜์˜€๋‹ค. ๋‘ ๋ฒˆ์งธ๋กœ, ์šฐ๋ฆฌ๋Š” ํ˜„์กดํ•˜๋Š” ์งˆ์˜ ์‘๋‹ต ๋ฐ์ดํ„ฐ์…‹ ์ค‘ ๊ฐ€์žฅ ๋ณต์žกํ•œ ํ˜•ํƒœ๋ฅผ ๊ฐ€์ง„ ๊ต๊ณผ์„œ์—์„œ ์งˆ์˜์‘๋‹ต (TQA) ๋ฌธ์ œ๋ฅผ ํ’€๊ธฐ์œ„ํ•œ ์†”๋ฃจ์…˜์„ ์ œ์•ˆํ•˜์˜€๋‹ค. TQA ๋ฐ์ดํ„ฐ์…‹์€ ์งˆ๋ฌธ ํŒŒํŠธ์™€ ๋ณธ๋ฌธ ํŒŒํŠธ ๋ชจ๋‘์— ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ํ˜•ํƒœ๋ฅผ ๊ฐ€์ง„ ๋ฐ์ดํ„ฐ๋ฅผ ํฌํ•จํ•˜๊ณ  ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ณต์žกํ•œ ๊ตฌ์กฐ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์šฐ๋ฆฌ๋Š” f-GCN์ด๋ผ๋Š” ๋‹ค์ค‘ ๋ชจ๋‹ฌ ๊ทธ๋ž˜ํ”„๋ฅผ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋“ˆ์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ์ด ๋ชจ๋“ˆ์„ ํ†ตํ•ด ๋ณด๋‹ค ํšจ์œจ์ ์œผ๋กœ ๋‹ค์ค‘ ๋ชจ๋‹ฌ์„ ๊ทธ๋ž˜ํ”„ ํ˜•ํƒœ๋กœ ์ฒ˜๋ฆฌํ•˜์—ฌ ํ™œ์šฉํ•˜๊ธฐ ์‰ฌ์šด ํ”ผ์ณ๋กœ ๋ฐ”๊ฟ”์ค„ ์ˆ˜ ์žˆ๋‹ค. ๊ทธ ๋‹ค์Œ์œผ๋กœ ๊ต๊ณผ์„œ์˜ ๊ฒฝ์šฐ ๋‹ค์–‘ํ•œ ์ฃผ์ œ๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ๊ณ  ๊ทธ์— ๋”ฐ๋ผ ์šฉ์–ด๋‚˜ ๋‚ด์šฉ์ด ๊ฒน์น˜์ง€ ์•Š๊ณ  ๊ธฐ์ˆ ๋˜์–ด ์žˆ๋‹ค. ๊ทธ๋กœ์ธํ•ด ์™„์ „ ์ƒˆ๋กœ์šด ๋‚ด์šฉ์˜ ๋ฌธ์ œ๋ฅผ ํ’€์–ด์•ผํ•˜๋Š” out-of-domain ์ด์Šˆ๊ฐ€ ์žˆ๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ์œ„ํ•ด ์ •๋‹ต์„ ๋ณด์ง€ ์•Š๊ณ  ๋ณธ๋ฌธ๋งŒ์œผ๋กœ ์ž๊ฐ€ ํ•™์Šต์„ ํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ์ด ๋‘ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ†ตํ•ด ๊ธฐ์กด ์—ฐ๊ตฌ๋ณด๋‹ค ํ›จ์”ฌ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š” ์‹คํ—˜ ๊ฒฐ๊ณผ๋ฅผ ์ œ์‹œํ•˜์˜€๊ณ  ๊ฐ๊ฐ์˜ ๋ชจ๋“ˆ์˜ ๊ธฐ๋Šฅ์„ฑ์— ๋Œ€ํ•ด ๊ฒ€์ฆํ•˜์˜€๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, ์ธ๊ฐ„๊ณผ ๋ฌผ๊ฑด์˜ ๊ด€๊ณ„์ •๋ณด๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๊ฐ์ฒด ๊ฒ€์ถœ์„ ์•ฝ์ง€๋„ ํ•™์Šต์œผ๋กœ ๋ฐฐ์šฐ๋Š” ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•˜์˜€๋‹ค. ๊ฐ์ฒด ๊ฒ€์ถœ ๋ฌธ์ œ๋ฅผ ํ’€๊ธฐ์œ„ํ•ด ๋…ธ๋™๋ ฅ์ด ๋งŽ์ด ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ ๋ผ๋ฒจ๋ง ์ž‘์—…์ด ํ•„์š”ํ•˜๋‹ค. ๊ทธ ์ค‘ ๊ฐ€์žฅ ๋…ธ๋ ฅ์ด ๋งŽ์ด ํ•„์š”ํ•œ ์œ„์น˜ ๋ผ๋ฒจ๋ง์ธ๋ฐ, ์ƒˆ๋กœ์šด ๋ฐฉ๋ฒ•๋ก ์€ ์ธ๊ฐ„๊ณผ ๋ฌผ๊ฑด์˜ ๊ด€๊ณ„๋ฅผ ์ด์šฉํ•˜์—ฌ ์ด๋ถ€๋ถ„์„ ํ•ด๊ฒฐํ•˜์˜€๋‹ค. ์šฐ๋ฆฌ๋Š” RRPN์ด๋ž€ ๋ชจ๋“ˆ์„ ์ œ์•ˆํ•˜์—ฌ ์ธ๊ฐ„์˜ ํฌ์ฆˆ์ •๋ณด์™€ ๊ด€๊ณ„์— ๊ด€ํ•œ ๋™์‚ฌ๋ฅผ ์ด์šฉํ•˜์—ฌ ์ฒ˜์Œ๋ณด๋Š” ๋ฌผ๊ฑด์˜ ์œ„์น˜๋ฅผ ์ถ”์ •ํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ์ƒˆ๋กญ๊ฒŒ ๋ฐฐ์šฐ๋Š” ๋ชฉํ‘œ ๋ผ๋ฒจ์— ๋Œ€ํ•ด, ์ •๋‹ต ๋ผ๋ฒจ ์—†์ด ์œ„์น˜๋ฅผ ์ถ”์ •ํ•˜์—ฌ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์–ด ํ›จ์”ฌ ์ ์€ ๋…ธ๋ ฅ๋งŒ ์‚ฌ์šฉํ•ด๋„ ๋œ๋‹ค. ๋˜ํ•œ RRPN์€ ์ถ”๊ฐ€ ๋ฐฉ์‹์˜ ๊ตฌ์กฐ๋กœ ๋‹ค์–‘ํ•œ ํƒœ์Šคํฌ์— ๊ด€ํ•œ ๋„คํŠธ์›Œํฌ์— ์ถ”๊ฐ€ ํ•  ์ˆ˜ ์žˆ๋‹ค. HICO-DET ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•˜์—ฌ ์‹คํ—˜ํ•œ ๊ฒฐ๊ณผ ํ˜„์žฌ์˜ ์ง€๋„ํ•™์Šต์„ ๋Œ€์‹ ํ•  ๊ฐ€๋Šฅ์„ฑ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค. ๋˜ํ•œ ์šฐ๋ฆฌ ๋ชจ๋ธ์ด ์ฒ˜์Œ ๋ณธ ๋ฌผ๊ฑด์˜ ์œ„์น˜๋ฅผ ์ž˜ ์ถ”์ •ํ•˜๊ณ  ์žˆ์Œ์„ ์‹œ๊ฐํ™”๋ฅผ ํ†ตํ•ด ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค.Understanding an image is one of the fundamental goals of computer vision and can provide important breakthroughs for various industries. In particular, the ability to recognize objective instances such as objects and poses has been developed due to recent deep learning approaches. However, deeply comprehending a visual scene requires higher understanding, such as is found in human beings. Humans usually exploit contextual information from visual inputs to detect meaningful features. In this dissertation, visual relation in various contexts, from the construction phase to the application phase, is studied with three tasks. We first propose a new algorithm for constructing relation graphs that contains relational knowledge in diagrams . Although diagrams contain richer information compared to individual image-based or language-based data, proper solutions for automatically understanding diagrams have not been proposed due to their innate multimodality and the arbitrariness of their layouts. To address this problem, we propose a unified diagram-parsing network for generating knowledge from diagrams based on an object detector and a recurrent neural network designed for a graphical structure. Specifically, we propose a dynamic graph-generation network that is based on dynamic memory and graph theory. We explore the dynamics of information in a diagram with the activation of gates in gated recurrent unit (GRU) cells. Using publicly available diagram datasets, our model demonstrates a state-of-the-art result that outperforms other baselines. Moreover, further experiments on question answering demonstrate the potential of the proposed method for use in various applications. Next, we introduce a novel algorithm to solve the Textbook Question Answering (TQA) task; this task describes more realistic QA (Question Answering) problems compared to other recent tasks. We mainly focus on two issues related to the analysis of the TQA dataset. First, solving the TQA problems requires an understanding of multimodal contexts in complicated input data. To overcome this issue of extracting knowledge features from long text lessons and merging them with visual features, we establish a context graph from texts and images and propose a new module f-GCN based on graph convolutional networks (GCN). Second, in the TQA dataset , scientific terms are not spread over the chapters and subjects are split. To overcome this so-called ``out-of-domain issue, before learning QA problems we introduce a novel, self-supervised, open-set learning process without any annotations. The experimental results indicate that our model significantly outperforms prior state-of-the-art methods. Moreover, ablation studies confirm that both methods (incorporating f-GCN to extract knowledge from multimodal contexts and our newly proposed, self-supervised learning process) are effective for TQA problems. Third, we introduce a novel, weakly supervised object detection (WSOD) paradigm to detect objects belonging to rare classes that do not have many examples. We use transferable knowledge from human-object interactions (HOI). While WSOD has lower performance than full supervision, we mainly focus on HOI that can strongly supervise complex semantics in images. Therefore, we propose a novel module called the ``relational region proposal network (RRPN) that outputs an object-localizing attention map with only human poses and action verbs. In the source domain, we fully train an object detector and the RRPN with full supervision of HOI. With transferred knowledge about the localization map from the trained RRPN, a new object detector can learn unseen objects with weak verbal supervisions of HOI without bounding box annotations in the target domain. Because the RRPN is designed as an add-on type, we can apply it not only to object detection but also to other domains such as semantic segmentation. The experimental results using a HICO-DET dataset suggest the possibility that the proposed method can be a cheap alternative for the current supervised object detection paradigm. Moreover, qualitative results demonstrate that our model can properly localize unseen objects in HICO-DET and V-COCO datasets.1. Introduction 1 1.1 Problem Definition 4 1.2 Motivation 6 1.3 Challenges 7 1.4 Contributions 9 1.4.1 Generating Visual Relation Graphs from Diagrams 9 1.4.2 Application of the Relation Graph in Textbook Question Answering 10 1.4.3 Weakly Supervised Object Detection with Human-object Interaction 11 1.5 Outline 11 2. Background 13 2.1 Visual relationships 13 2.2 Neural networks on a graph 16 2.3 Human-object interaction 17 3. Generating Visual Relation Graphs from Diagrams 18 3.1 Related Work 20 3.2 Proposed Method 21 3.2.1 Detecting Constituents in a Diagram 21 3.2.2 Generating a Graph of relationships 22 3.2.3 Multi-task Training and Cascaded Inference 27 3.2.4 Details of Post-processing 29 3.3 Experiment 29 3.3.1 Datasets 29 3.3.2 Baseline 32 3.3.3 Metrics 32 3.3.4 Implementation Details 33 3.3.5 Quantitative Results 35 3.3.6 Qualitative Results 37 3.4 Discussion 38 3.5 Conclusion 41 4. Application of the Relation Graph in Textbook Question Answering 46 4.1 Related Work 48 4.2 Problem 50 4.3 Proposed Method 53 4.3.1 Multi-modal Context Graph Understanding 53 4.3.2 Multi-modal Problem Solving 55 4.3.3 Self-supervised open-set comprehension 57 4.3.4 Process of Building Textual Context Graph 61 4.4 Experiment 62 4.4.1 Implementation Details 62 4.4.2 Dataset 62 4.4.3 Baselines 63 4.4.4 Quantitative Results 64 4.4.5 Qualitative Results 67 4.5 Conclusion 70 5. Weakly Supervised Object Detection with Human-object Interaction 77 5.1 Related Work 80 5.2 Algorithm Overview 81 5.3 Proposed Method 84 5.3.1 Training on the Source classes Ds 86 5.3.2 Training on the Target classes Dt 89 5.4 Experiment 90 5.4.1 Implementation details 90 5.4.2 Dataset and Pre-processing 91 5.4.3 Metrics 91 5.4.4 Comparison with different feature combination 92 5.4.5 Comparison with different attention loss balance and box threshold 95 5.4.6 Comparison with prior works 96 5.4.7 Qualitative results 96 5.5 Conclusion 100 6. Concluding Remarks 105 6.1 Summary 105 6.2 Limitation and Future Directions 106Docto

    Deep Video Understanding with Model Efficiency and Sparse Active Labeling

    Get PDF
    Videos capture the inherently sequential nature of the real world, making automatic video understanding an essential need for automatic understanding of the real world. Due to major advancements in camera, communication, and storage hardware, videos have become a widely used data format for crucial applications such as home automation, security, analysis, robotics, and autonomous driving. Existing methods for video understanding require heavy computation and large training data for good performance, this limits how quick the videos can be processed and how much data can be labeled for training. Real-world video understanding requires analyzing dense scenes and sequential information, which increases the processing time and labeling cost as the video increases in scene density and video length. Therefore, it is crucial to develop video understanding methods that reduces the processing time and labeling cost. In this dissertation, we first propose a method to improve network efficiency for video understanding task and then provide methods to improve annotation efficiency for video understanding task. Through these works, we aim to improve the network efficiency as well as data annotation efficiency, as an effort to encourage wider development and adaptation of large scale video understanding methods. First, we propose an end-to-end neural network which performs faster video actor-action detection. Our proposed network reduces the need for extra region proposal computation and post-process filter, making the network training easy as well as increasing the inference speed. Next, we propose an active learning based sparse labeling method that makes large video dataset annotation efficient. It selects a few useful frames for annotation from videos, reducing annotation cost while maintaining the dataset usefulness for video understanding task. We also provide a method to train existing video understanding models using such sparse annotations. Then, we propose a clustering-based hybrid active learning method that also selects useful videos along with useful frames for annotation, reducing annotation cost even further. Finally, we study the relation between different types of annotations and how they impact video understanding task. We extensively evaluate and analyze our methods on various dataset and downstream tasks to show that they can do efficient video understanding with faster network and limited sparse annotations
    corecore