Search CORE

30 research outputs found

Recommended from our members

Indirect deep structured learning for 3D human body shape and pose prediction

Author: Budvytis I
Cipolla R
Tan JKV
Publication venue: British Machine Vision Conference 2017, BMVC 2017
Publication date: 01/01/2017
Field of study

In this paper we present a novel method for 3D human body shape and pose prediction. Our work is motivated by the need to reduce our reliance on costly-to-obtain ground truth labels. To achieve this, we propose training an encoder-decoder network using a two step procedure as follows. During the first step, a decoder is trained to predict a body silhouette using SMPL (a statistical body shape model) parameters as an input. During the second step, the whole network is trained on real image and corresponding silhouette pairs while the decoder is kept fixed. Such a procedure allows for an indirect learning of body shape and pose parameters from real images without requiring any ground truth parameter data. Our key contributions include: (a) a novel encoder-decoder architecture for 3D body shape and pose prediction, (b) corresponding training procedure as well as (c) quantitative and qualitative analysis of the proposed method on artificial and real image datasets

Apollo (Cambridge)

Recommended from our members

Semantic localisation via globally unique instance segmentation

Author: Budvytis I
Cipolla R
Sauer P
Publication venue: British Machine Vision Conference 2018, BMVC 2018
Publication date: 01/01/2019
Field of study

In this work we propose a novel approach to semantic localisation. Our work is motivated by the need for environment perception techniques which not only perform self-localisation within a map but also simultaneously recognise surrounding objects. Such capabilities are crucial for computer vision applications which interact with the environment: autonomous driving, augmented reality or robotics. In order to achieve this goal we propose a solution which consists of three key steps. Firstly, a database of panoramic RGB images and corresponding globally unique, per-pixel object instance labels is built for the desired environment where we typically consider objects from static categories such as "building" or "tree". Secondly, a semantic segmentation network capable of predicting more than 3000 labels is trained on the collected data. Finally, for a given panoramic query image, the corresponding instance label image predicted by the network is used for semantic matching within the database. The matching is performed in two stages: (i) a fast retrieval of a small subset of database images (~100) with highly overlapping instance label histograms, followed by (ii) an explicit approximate 3 DoF (yaw, pitch, roll) alignment of the selected subset of images and the query image. We evaluate our approach in challenging indoor and outdoor navigation scenarios, achieving better or similar performance when compared to state-of-the-art image retrieval-based localisation approaches using key-point matching and image level embedding. Our contribution includes: (i) a description of a novel semantic localisation approach using globally unique instance segmentation, (ii) corresponding quantitative and qualitative analysis and (iii) a novel CamVid-360 dataset containing 986 labelled instances of buildings, trees, road signs and poles

Apollo (Cambridge)

Real-time factored ConvNets: Extracting the x factor in human parsing

Author: Budvytis I
Charles J
Cipolla R
Publication venue: British Machine Vision Conference 2017, BMVC 2017
Publication date: 01/01/2017
Field of study

© 2017. The copyright of this document resides with its authors. We propose a real-time and lightweight multi-task style ConvNet (termed a Factored ConvNet) for human body parsing in images or video. Factored ConvNets have isolated areas which perform known sub-tasks, such as object localization or edge detection. We call this area and sub-task pair an X factor. Unlike multi-task ConvNets which have independent tasks, the Factored ConvNet’s sub-task has direct effect on the main task outcome. In this paper we show how to isolate the X factor of foreground/background (f/b) subtraction from the main task of segmenting human body images into 31 different body part types. Knowledge of this X factor leads to a number of benefits for the Factored ConvNet: 1) Ease of network transfer to other image domains, 2) ability to personalize to humans in video and 3) easy model performance boosts. All achieved by either efficient network update or replacement of the X factor whilst avoiding catastrophic forgetting of previously learnt body part dependencies and structure. We show these benefits on a large dataset of images and also on YouTube videos.SeeQuesto

Crossref

Apollo (Cambridge)

Large scale joint semantic re-localisation and scene understanding via globally unique instance coordinate regression

Author: Budvytis I
Cipolla R
Teichmann M
Vojir T
Publication venue: 30th British Machine Vision Conference 2019, BMVC 2019
Publication date: 10/09/2019
Field of study

In this work we present a novel approach to joint semantic localisation and scene understanding. Our work is motivated by the need for localisation algorithms which not only predict 6-DoF camera pose but also simultaneously recognise surrounding objects and estimate 3D geometry. Such capabilities are crucial for computer vision guided systems which interact with the environment: autonomous driving, augmented reality and robotics. In particular, we propose a two step procedure. During the first step we train a convolutional neural network to jointly predict per-pixel globally unique instance labels and corresponding local coordinates for each instance of a static object (e.g. a building). During the second step we obtain scene coordinates by combining object center coordinates and local coordinates and use them to perform 6-DoF camera pose estimation. We evaluate our approach on real world (CamVid-360) and artificial (SceneCity) autonomous driving datasets. We obtain smaller mean distance and angular errors than state-of-the-art 6-DoF pose estimation algorithms based on direct pose regression and pose estimation from scene coordinates on all datasets. Our contributions include: (i) a novel formulation of scene coordinate regression as two separate tasks of object instance recognition and local coordinate regression and a demonstration that our proposed solution allows to predict accurate 3D geometry of static objects and estimate 6-DoF pose of camera on (ii) maps larger by several orders of magnitude than previously attempted by scene coordinate regression methods, as well as on (iii) lightweight, approximate 3D maps built from 3D primitives such as building-aligned cuboids.Toyota Corporatio

arXiv.org e-Print Archive

Apollo (Cambridge)

Recommended from our members

Large Scale Labelled Video Data Augmentation for Semantic Segmentation in Driving Scenarios

Author: Breen K
Budvytis I
Cipolla R
Roddick T
Sauer P
Publication venue: Proceedings - 2017 IEEE International Conference on Computer Vision Workshops, ICCVW 2017
Publication date: 01/01/2017
Field of study

In this paper we present an analysis of the effect of large scale video data augmentation for semantic segmentation in driving scenarios. Our work is motivated by a strong correlation between the high performance of most recent deep learning based methods and the availability of large volumes of ground truth labels. To generate additional labelled data, we make use of an occlusion-aware and uncertainty-enabled label propagation algorithm. As a result we increase the availability of high-resolution labelled frames by a factor of 20, yielding in a 6.8% to 10.8% rise in average classification accuracy and/or IoU scores for several semantic segmentation networks. Our key contributions include: (a) augmented CityScapes and CamVid datasets providing 56.2K and 6.5K additional labelled frames of object classes respectively, (b) detailed empirical analysis of the effect of the use of augmented data as well as (c) extension of proposed framework to instance segmentation

Apollo (Cambridge)

Making a Shallow Network Deep: Conversion of a Boosting Classifier into a Decision Tree by Boolean Optimisation

Author: Budvytis I
Cipolla R
Kim T-K
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/11/2012
Field of study

Spiral - Imperial College Digital Repository

Making a shallow network deep: Conversion of a boosting classifier into a decision tree by boolean optimisation

Author: Budvytis I
Cipolla R
Kim TK
Publication venue
Publication date: 01/01/2011
Field of study

This paper presents a novel way to speed up the evaluation time of a boosting classifier. We make a shallow (flat) network deep (hierarchical) by growing a tree from decision regions of a given boosting classifier. The tree provides many short paths for speeding up while preserving the reasonably smooth decision regions of the boosting classifier for good generalisation. For converting a boosting classifier into a decision tree, we formulate a Boolean optimization problem, which has been previously studied for circuit design but limited to a small number of binary variables. In this work, a novel optimisation method is proposed for, firstly, several tens of variables i.e. weak-learners of a boosting classifier, and then any larger number of weak-learners by using a two-stage cascade. Experiments on the synthetic and face image data sets show that the obtained tree achieves a significant speed up both over a standard boosting classifier and the Fast-exit-a previously described method for speeding-up boosting classification, at the same accuracy. The proposed method as a general meta-algorithm is also useful for a boosting cascade, where it speeds up individual stage classifiers by different gains. The proposed method is further demonstrated for fast-moving object tracking and segmentation problems. © 2011 Springer Science+Business Media, LLC

CUED - Cambridge University Engineering Department

MoT - Mixture of trees probabilistic graphical model for video segmentation

Author: Badrinarayanan V
Budvytis I
Cipolla R
Publication venue
Publication date: 01/01/2012
Field of study

We present a novel mixture of trees (MoT) graphical model for video segmentation. Each component in this mixture represents a tree structured temporal linkage between super-pixels from the first to the last frame of a video sequence. Our time-series model explicitly captures the uncertainty in temporal linkage between adjacent frames which improves segmentation accuracy. We provide a variational inference scheme for this model to estimate super-pixel labels and their confidences in nearly realtime. The efficacy of our approach is demonstrated via quantitative comparisons on the challenging SegTrack joint segmentation and tracking dataset [23]

CUED - Cambridge University Engineering Department

Semantic localisation via globally unique instance segmentation

Author: Budvytis I
Cipolla R
Sauer P
Publication venue
Publication date: 01/07/2018
Field of study

© 2018. The copyright of this document resides with its authors. In this work we propose a novel approach to semantic localisation. Our work is motivated by the need for environment perception techniques which not only perform self-localisation within a map but also simultaneously recognise surrounding objects. Such capabilities are crucial for computer vision applications which interact with the environment: autonomous driving, augmented reality or robotics. In order to achieve this goal we propose a solution which consists of three key steps. Firstly, a database of panoramic RGB images and corresponding globally unique, per-pixel object instance labels is built for the desired environment where we typically consider objects from static categories such as "building" or "tree". Secondly, a semantic segmentation network capable of predicting more than 3000 labels is trained on the collected data. Finally, for a given panoramic query image, the corresponding instance label image predicted by the network is used for semantic matching within the database. The matching is performed in two stages: (i) a fast retrieval of a small subset of database images (~100) with highly overlapping instance label histograms, followed by (ii) an explicit approximate 3 DoF (yaw, pitch, roll) alignment of the selected subset of images and the query image. We evaluate our approach in challenging indoor and outdoor navigation scenarios, achieving better or similar performance when compared to state-of-the-art image retrieval-based localisation approaches using key-point matching [29, 63] and image level embedding [3]. Our contribution includes: (i) a description of a novel semantic localisation approach using globally unique instance segmentation, (ii) corresponding quantitative and qualitative analysis and (iii) a novel CamVid-360 dataset containing 986 labelled instances of buildings, trees, road signs and poles

CUED - Cambridge University Engineering Department

Semi-Supervised Video Segmentation Using Tree Structured Graphical Models

Author: Badrinarayanan V
Budvytis I
Cipolla R
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 07/11/2013
Field of study

We present a novel, implementation friendly and occlusion aware semi-supervised video segmentation algorithm using tree structured graphical models, which delivers pixel labels alongwith their uncertainty estimates. Our motivation to employ supervision is to tackle a task-specific segmentation problem where the semantic objects are pre-defined by the user. The video model we propose for this problem is based on a tree structured approximation of a patch based undirected mixture model, which includes a novel time-series and a soft label Random Forest classifier participating in a feedback mechanism. We demonstrate the efficacy of our model in cutting out foreground objects and multi-class segmentation problems in lengthy and complex road scene sequences. Our results have wide applicability, including harvesting labelled video data for training discriminative models, shape/pose/articulation learning and large scale statistical analysis to develop priors for video segmentation. © 2011 IEEE

CUED - Cambridge University Engineering Department