Search CORE

2,936 research outputs found

The role of object instance re-identification in 3D object localization and semantic 3D reconstruction.

Author: BANSAL VAIBHAV
Publication venue: Universit\ue0 degli studi di Genova
Publication date: 28/02/2020
Field of study

For an autonomous system to completely understand a particular scene, a 3D reconstruction of the world is required which has both the geometric information such as camera pose and semantic information such as the label associated with an object (tree, chair, dog, etc.) mapped within the 3D reconstruction. In this thesis, we will study the problem of an object-centric 3D reconstruction of a scene in contrast with most of the previous work in the literature which focuses on building a 3D point cloud that has only the structure but lacking any semantic information. We will study how crucial 3D object localization is for this problem and will discuss the limitations faced by the previous related methods. We will present an approach for 3D object localization using only 2D detections observed in multiple views by including 3D object shape priors. Since our first approach relies on associating 2D detections in multiple views, we will also study an approach to re-identify multiple object instances of an object in rigid scenes and will propose a novel method of joint learning of the foreground and background of an object instance using a triplet-based network in order to identify multiple instances of the same object in multiple views. We will also propose an Augmented Reality-based application using Google's Tango by integrating both the proposed approaches. Finally, we will conclude with some open problems that might benefit from the suggested future work

Archivio istituzionale della ricerca - Università di Genova

Benchmarking and Error Diagnosis in Multi-Instance Pose Estimation

Author: Perona Pietro
Ronchi Matteo Ruggero
Publication venue
Publication date: 04/08/2017
Field of study

We propose a new method to analyze the impact of errors in algorithms for multi-instance pose estimation and a principled benchmark that can be used to compare them. We define and characterize three classes of errors - localization, scoring, and background - study how they are influenced by instance attributes and their impact on an algorithm's performance. Our technique is applied to compare the two leading methods for human pose estimation on the COCO Dataset, measure the sensitivity of pose estimation with respect to instance size, type and number of visible keypoints, clutter due to multiple instances, and the relative score of instances. The performance of algorithms, and the types of error they make, are highly dependent on all these variables, but mostly on the number of keypoints and the clutter. The analysis and software tools we propose offer a novel and insightful approach for understanding the behavior of pose estimation algorithms and an effective method for measuring their strengths and weaknesses.Comment: Project page available at http://www.vision.caltech.edu/~mronchi/projects/PoseErrorDiagnosis/; Code available at https://github.com/matteorr/coco-analyze; published at ICCV 1

arXiv.org e-Print Archive

Caltech Authors

Backtracking Spatial Pyramid Pooling (SPP)-based Image Classifier for Weakly Supervised Top-down Salient Object Detection

Author: Cholakkal Hisham
Johnson Jubin
Rajan Deepu
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2018
Field of study

Top-down saliency models produce a probability map that peaks at target locations specified by a task/goal such as object detection. They are usually trained in a fully supervised setting involving pixel-level annotations of objects. We propose a weakly supervised top-down saliency framework using only binary labels that indicate the presence/absence of an object in an image. First, the probabilistic contribution of each image region to the confidence of a CNN-based image classifier is computed through a backtracking strategy to produce top-down saliency. From a set of saliency maps of an image produced by fast bottom-up saliency approaches, we select the best saliency map suitable for the top-down task. The selected bottom-up saliency map is combined with the top-down saliency map. Features having high combined saliency are used to train a linear SVM classifier to estimate feature saliency. This is integrated with combined saliency and further refined through a multi-scale superpixel-averaging of saliency map. We evaluate the performance of the proposed weakly supervised topdown saliency and achieve comparable performance with fully supervised approaches. Experiments are carried out on seven challenging datasets and quantitative results are compared with 40 closely related approaches across 4 different applications.Comment: 14 pages, 7 figure

arXiv.org e-Print Archive

DR-NTU (Digital Repository of NTU)

Text Localization in Video Using Multiscale Weber's Local Descriptor

Author: L. Smitha M.
Shekar B. H.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 15/04/2015
Field of study

In this paper, we propose a novel approach for detecting the text present in videos and scene images based on the Multiscale Weber's Local Descriptor (MWLD). Given an input video, the shots are identified and the key frames are extracted based on their spatio-temporal relationship. From each key frame, we detect the local region information using WLD with different radius and neighborhood relationship of pixel values and hence obtained intensity enhanced key frames at multiple scales. These multiscale WLD key frames are merged together and then the horizontal gradients are computed using morphological operations. The obtained results are then binarized and the false positives are eliminated based on geometrical properties. Finally, we employ connected component analysis and morphological dilation operation to determine the text regions that aids in text localization. The experimental results obtained on publicly available standard Hua, Horizontal-1 and Horizontal-2 video dataset illustrate that the proposed method can accurately detect and localize texts of various sizes, fonts and colors in videos.Comment: IEEE SPICES, 201

arXiv.org e-Print Archive

Crossref