2,028 research outputs found
Cascaded Scene Flow Prediction using Semantic Segmentation
Given two consecutive frames from a pair of stereo cameras, 3D scene flow
methods simultaneously estimate the 3D geometry and motion of the observed
scene. Many existing approaches use superpixels for regularization, but may
predict inconsistent shapes and motions inside rigidly moving objects. We
instead assume that scenes consist of foreground objects rigidly moving in
front of a static background, and use semantic cues to produce pixel-accurate
scene flow estimates. Our cascaded classification framework accurately models
3D scenes by iteratively refining semantic segmentation masks, stereo
correspondences, 3D rigid motion estimates, and optical flow fields. We
evaluate our method on the challenging KITTI autonomous driving benchmark, and
show that accounting for the motion of segmented vehicles leads to
state-of-the-art performance.Comment: International Conference on 3D Vision (3DV), 2017 (oral presentation
Cross Modal Distillation for Flood Extent Mapping
The increasing intensity and frequency of floods is one of the many
consequences of our changing climate. In this work, we explore ML techniques
that improve the flood detection module of an operational early flood warning
system. Our method exploits an unlabelled dataset of paired multi-spectral and
Synthetic Aperture Radar (SAR) imagery to reduce the labeling requirements of a
purely supervised learning method. Prior works have used unlabelled data by
creating weak labels out of them. However, from our experiments we noticed that
such a model still ends up learning the label mistakes in those weak labels.
Motivated by knowledge distillation and semi supervised learning, we explore
the use of a teacher to train a student with the help of a small hand labelled
dataset and a large unlabelled dataset. Unlike the conventional self
distillation setup, we propose a cross modal distillation framework that
transfers supervision from a teacher trained on richer modality (multi-spectral
images) to a student model trained on SAR imagery. The trained models are then
tested on the Sen1Floods11 dataset. Our model outperforms the Sen1Floods11
baseline model trained on the weak labeled SAR imagery by an absolute margin of
6.53% Intersection-over-Union (IoU) on the test split
Towards Label-free Scene Understanding by Vision Foundation Models
Vision foundation models such as Contrastive Vision-Language Pre-training
(CLIP) and Segment Anything (SAM) have demonstrated impressive zero-shot
performance on image classification and segmentation tasks. However, the
incorporation of CLIP and SAM for label-free scene understanding has yet to be
explored. In this paper, we investigate the potential of vision foundation
models in enabling networks to comprehend 2D and 3D worlds without labelled
data. The primary challenge lies in effectively supervising networks under
extremely noisy pseudo labels, which are generated by CLIP and further
exacerbated during the propagation from the 2D to the 3D domain. To tackle
these challenges, we propose a novel Cross-modality Noisy Supervision (CNS)
method that leverages the strengths of CLIP and SAM to supervise 2D and 3D
networks simultaneously. In particular, we introduce a prediction consistency
regularization to co-train 2D and 3D networks, then further impose the
networks' latent space consistency using the SAM's robust feature
representation. Experiments conducted on diverse indoor and outdoor datasets
demonstrate the superior performance of our method in understanding 2D and 3D
open environments. Our 2D and 3D network achieves label-free semantic
segmentation with 28.4% and 33.5% mIoU on ScanNet, improving 4.7% and 7.9%,
respectively. And for nuScenes dataset, our performance is 26.8% with an
improvement of 6%. Code will be released
(https://github.com/runnanchen/Label-Free-Scene-Understanding)
Facial micro-expression recognition with noisy labels
Abstract. Facial micro-expressions are quick, involuntary and low intensity facial movements. An interest in detecting and recognizing micro-expressions arises from the fact that they are able to show person’s genuine hidden emotions. The small and rapid facial muscle movements are often too difficult for a human to not only spot the occurring micro-expression but also be able to recognize the emotion correctly. Recently, a focus on developing better micro-expression recognition methods has been on models and architectures. However, we take a step back and go to the root of task, the data.
We thoroughly analyze the input data and notice that some of the data is noisy and possibly mislabelled. The authors of the micro-expression datasets have also acknowledged the possible problems in data labelling. Despite this, no attempts have been made to design models that take into account the potential mislabelled data in micro-expression recognition, to our best knowledge. In this thesis, we explore new methods taking noisy labels into special account in an attempt to solve the problem. We propose a simple, yet efficient label refurbishing method and a data cleaning method for handling noisy labels. We show through both quantitative and qualitative analysis the effectiveness of the methods for detecting noisy samples. The data cleaning method achieves state-of-the-art results reaching an F1-score of 0.77 in the MEGC2019 composite dataset. Further, we analyze and discuss the results in-depth and suggest future works based on our findings.Kasvojen mikroilmeiden tunnistus kohinaisilla luokilla. Tiivistelmä. Kasvojen mikroilmeet ovat nopeita, tahattomia ja pienen intensiteetin omaavia kasvojen liikkeitä. Kiinnostus mikroilmeiden tunnistamisesta johtuu niiden kyvystä paljastaa henkilöiden todelliset piilotetut tunteet. Pienet ja nopeat kasvojen lihasten liikkeet eivät olet pelkästään vaikeita huomata, mutta oikean tunteen tunnistaminen on erittäin vaikeaa. Lähiaikoina mikroilmetunnistusjärjestelmien kehitys on painottunut malleihin ja arkkitehtuureihin. Me kuitenkin otamme askeleen taaksepäin tästä kehitystyylistä ja menemme ongelman juureen eli dataan.
Me tarkastamme käytettävän datan huolellisesti ja huomaamme, että osa datasta on kohinaista ja mahdollisesti väärin kategorisoitu. Mikroilmetietokantojen tekijät ovat myös myöntäneet mahdolliset ongelmat datan kategorisoinnissa. Tästä huolimatta meidän parhaan tietomme mukaan mikroilmeiden tunnistukseen ei ole kehitetty malleja, jotka huomioisivat mahdollisesti väärin kategorisoituja näytteitä. Tässä työssä tutkimme uusia malleja ottaen virheellisesti kategorisoidut näytteet erityisesti huomioon. Ehdotamme yksinkertaista, mutta tehokasta oikaisu menetelmää ja datan puhdistus menetelmää kohinaisia luokkia varten. Näytämme sekä kvantiviisisesti että kvalitatiivisesti menetelmien tehokkuuden kohinaisten näytteiden havaitsemisessa. Datan puhdistus menetelmä saavuttaa huippuluokan tuloksen, saaden F1-arvon 0.77 MEGC2019 tietokannassa. Lisäksi analysoimme ja pohdimme tuloksia syvällisesti ja ehdotamme tutkimuksia tulevaisuuteen tuloksistamme
LiveCap: Real-time Human Performance Capture from Monocular Video
We present the first real-time human performance capture approach that
reconstructs dense, space-time coherent deforming geometry of entire humans in
general everyday clothing from just a single RGB video. We propose a novel
two-stage analysis-by-synthesis optimization whose formulation and
implementation are designed for high performance. In the first stage, a skinned
template model is jointly fitted to background subtracted input video, 2D and
3D skeleton joint positions found using a deep neural network, and a set of
sparse facial landmark detections. In the second stage, dense non-rigid 3D
deformations of skin and even loose apparel are captured based on a novel
real-time capable algorithm for non-rigid tracking using dense photometric and
silhouette constraints. Our novel energy formulation leverages automatically
identified material regions on the template to model the differing non-rigid
deformation behavior of skin and apparel. The two resulting non-linear
optimization problems per-frame are solved with specially-tailored
data-parallel Gauss-Newton solvers. In order to achieve real-time performance
of over 25Hz, we design a pipelined parallel architecture using the CPU and two
commodity GPUs. Our method is the first real-time monocular approach for
full-body performance capture. Our method yields comparable accuracy with
off-line performance capture techniques, while being orders of magnitude
faster
- …