109 research outputs found
Deep face tracking and parsing in the wild
Face analysis has been a long-standing research direction in the field of computer vision and pattern recognition. A complete face analysis system involves solving several tasks including face detection, face tracking, face parsing, and face recognition. Recently, the performance of methods in all tasks has significantly improved thanks to the employment of Deep Convolutional Neural Networks (DCNNs).
However, existing face analysis algorithms mainly focus on solving facial images captured in the constrained laboratory environment, and their performance on real-world images has remained less explored. Compared with the lab environment, the in-the-wild settings involve greater diversity in face sizes, poses, facial expressions, background clutters, lighting conditions and imaging quality.
This thesis investigates two fundamental tasks in face analysis under in-the-wild settings: face tracking and face parsing. Both tasks serve as important prerequisites for downstream face analysis applications. However, in-the-wild datasets remain scarce in both fields and models have not been rigorously evaluated in such settings. In this thesis, we aim to bridge that gap of lacking in-the-wild data, evaluate existing methods in these settings, and develop accurate, robust and efficient deep learning-based methods for the two tasks.
For face tracking in the wild, we introduce the first in-the-wild face tracking dataset, MobiFace, that consists of 80 videos captured by mobile phones during mobile live-streaming. The environment of the live-streaming performance is fully unconstrained and the interactions between users and mobile phones are natural and spontaneous. Next, we evaluate existing tracking methods, including generic object trackers and dedicated face trackers. The results show that MobiFace represent unique challenges in face tracking in the wild and cannot be readily solved by existing methods. Finally, we present a DCNN-based framework, FT-RCNN, that significantly outperforms other methods in face tracking in the wild.
For face parsing in the wild, we introduce the first large-scale in-the-wild face dataset, iBugMask, that contains 21, 866 training images and 1, 000 testing images. Unlike existing datasets, the images in iBugMask are captured in the fully unconstrained environment and are not cropped or preprocessed of any kind. Manually annotated per-pixel labels for eleven facial regions are provided for each target face. Next, we benchmark existing parsing methods and the results show that iBugMask is extremely challenging for all methods.
By rigorous benchmarking, we observe that the pre-processing of facial images with bounding boxes in face parsing in the wild introduces bias. When cropping the face with a bounding box, a cropping margin has to be hand-picked. If face alignment is used, fiducial landmarks are required and a predefined alignment template has to be selected. These additional hyper-parameters have to be carefully considered and can have a significant impact on the face parsing performance.
To solve this, we propose Region-of-Interest (RoI) Tanh-polar transform that warps the whole image to a fixed-sized representation. Moreover, the RoI Tanh-polar transform is differentiable and allows for rotation equivariance in
1
DCNNs. We show that when coupled with a simple Fully Convolutional Network, our RoI Tanh-polar transformer Network has achieved state-of-the-art results on face parsing in the wild.
This thesis contributes towards in-the-wild face tracking and face parsing by providing novel datasets and proposing effective frameworks. Both tasks can benefit real-world downstream applications such as facial age estimation, facial expression recognition and lip-reading. The proposed RoI Tanh-polar transform also provides a new perspective in how to preprocess the face images and make the DCNNs truly end-to-end for real-world face analysis applications.Open Acces
Methods for Estimating 3D Human Pose and Location
Absztrakt
3D humán pózbecslő és lokalizációs módszerek
A 3D humán pĂłzbecslĂ©s cĂ©lja, hogy egy kĂ©p vagy videĂł alapján megbecsĂĽljĂĽk egy szemĂ©ly fontosabb kulcspontjainak tĂ©rbeli koordinátáit. A koordinátákat egy erre alkalmas kamera-központĂş rendszerben adhatjuk meg. A problĂ©mának számos potenciális alkalmazása van, többek közt mozgásrögzĂtĂ©s, viselkedĂ©s- Ă©s sportanalitika.
A neurális hálók újbóli elterjedése után gyors fejlődésen ment keresztül a terület, a hibaráták folyamatosan csökkentek. Ez azonban főleg a stúdióban felvett videókra korlátozódott, a természetes körülmények közötti felvételeken akár kétszer akkora is lehetett a hiba.
Ennek oka, hogy pontos mĂ©rĂ©sekkel rendelkezĹ‘ 3D pĂłz adatbázisok kĂ©szĂtĂ©se igen nehĂ©z, speciális felszerelĂ©s szĂĽksĂ©ges, ami csak beltĂ©rben működik. Emiatt viszonylag kevĂ©s kĂĽltĂ©ri felvĂ©tel van, Ă©s ezek korlátozottak a háttĂ©r, kameraszögek Ă©s ábrázolt mozgások tekintetĂ©ben. Figyelembe vĂ©ve, hogy milyen sok adat kell egy tipikus mĂ©lyhálĂł betanĂtásához, további architekturális fejlesztĂ©sek szĂĽksĂ©gesek, amelyek csökkentik a tanĂtĂładatok iránti igĂ©nyt.
Egy másik problĂ©ma a feladat egyszerűsĂtĂ©sĂ©bĹ‘l fakad: a legtöbb algoritmus csak a derĂ©khoz relatĂven becsli meg a koordinátákat, eltekintve a szemĂ©lynek a tĂ©rben vett helyzetĂ©tĹ‘l. Egy embert tartalmazĂł felvĂ©tel esetĂ©n lehet, hogy ez az informáciĂł nem szĂĽksĂ©ges, azonban több szereplĹ‘ esetĂ©n az egymáshoz valĂł elhelyezkedĂ©s is számĂt.
Ebben a disszertáciĂłban nĂ©gy olyan mĂłdszert mutatok be, amelyek ezen problĂ©mákra prĂłbálnak választ nyĂşjtani. Az elsĹ‘ algoritmus a tanĂtĂładatokban levĹ‘ kevĂ©s kamerával foglalkozik, amely tĂşltanuláshoz vezethet. Egy sziámi architektĂşrájĂş hálĂłt vezetek be, ami ekvivariáns beágyazást tanul meg. Az ekvivariancia segĂtsĂ©gĂ©vel Ăşj kameraszögekre pontosabb eredmĂ©nyeket kapunk, augmentáciĂł hiányában is.
A második eljárás az irodalomban korábban elĂ©rhetĹ‘ naiv helybecslĹ‘ algoritmusokon javĂt. Ezek PnP (Perspective-n-Point) alapĂş megközelĂtĂ©st használnak, amihez szĂĽksĂ©ges egy pontos 2D Ă©s 3D becslĂ©s. Ha bármelyik pontatlan, akkor az eredmĂ©nyben többszörözĹ‘dhet a hiba. A pĂłz helyĂ©nek közvetlen becslĂ©sĂ©vel stabilabb eredmĂ©nyeket kaphatunk.
A következĹ‘ mĂłdszer cĂ©lja \RGBD (mĂ©lysĂ©gfelvĂ©telt is tartalmazĂł) videĂłk használata gyenge felĂĽgyeleti jelkĂ©nt. A kiegĂ©szĂtĹ‘ adatbázis jelentĹ‘sen javĂt az eredmĂ©nyeken, kĂĽlönösen a lokalizáciĂł esetĂ©n.
VĂ©gĂĽl, a videĂłkon elĹ‘fordulĂł rövid idejű okklĂşziĂłkkal foglalkozom. A videĂł alapĂş mĂłdszerek mĂ©g rövid kitakarás esetĂ©n is rossz eredmĂ©nyeket adnak, akkor is, ha az a szomszĂ©dos kĂ©pkockákbĂłl kikövetkeztethetĹ‘ volna. A javasolt mĂłdszer egy tetszĹ‘leges pĂłzbecslĹ‘ után elhelyezhetĹ‘, mint egy extra pontosságjavĂtĂł lĂ©pĂ©s.
Az eljárások erejĂ©t rĂ©szletes kvantitatĂv kĂsĂ©rletekkel igazolom
Recommended from our members
Embodied learning for visual recognition
The field of visual recognition in recent years has come to rely on large expensively curated and manually labeled "bags of disembodied images". In the wake of this, my focus has been on understanding and exploiting alternate "free" sources of supervision available to visual learning agents that are situated within real environments. For example, even simply moving from orderless image collections to continuous visual observations offers opportunities to understand the dynamics and other physical properties of the visual world. Further, embodied agents may have the abilities to move around their environment and/or effect changes within it, in which case these abilities offer new means to acquire useful supervision. In this dissertation, I present my work along this and related directions.Electrical and Computer Engineerin
LiDAR-Based Place Recognition For Autonomous Driving: A Survey
LiDAR-based place recognition (LPR) plays a pivotal role in autonomous
driving, which assists Simultaneous Localization and Mapping (SLAM) systems in
reducing accumulated errors and achieving reliable localization. However,
existing reviews predominantly concentrate on visual place recognition (VPR)
methods. Despite the recent remarkable progress in LPR, to the best of our
knowledge, there is no dedicated systematic review in this area. This paper
bridges the gap by providing a comprehensive review of place recognition
methods employing LiDAR sensors, thus facilitating and encouraging further
research. We commence by delving into the problem formulation of place
recognition, exploring existing challenges, and describing relations to
previous surveys. Subsequently, we conduct an in-depth review of related
research, which offers detailed classifications, strengths and weaknesses, and
architectures. Finally, we summarize existing datasets, commonly used
evaluation metrics, and comprehensive evaluation results from various methods
on public datasets. This paper can serve as a valuable tutorial for newcomers
entering the field of place recognition and for researchers interested in
long-term robot localization. We pledge to maintain an up-to-date project on
our website https://github.com/ShiPC-AI/LPR-Survey.Comment: 26 pages,13 figures, 5 table
Unsupervised Learning of Landmarks by Descriptor Vector Exchange
Equivariance to random image transformations is an effective method to learn
landmarks of object categories, such as the eyes and the nose in faces, without
manual supervision. However, this method does not explicitly guarantee that the
learned landmarks are consistent with changes between different instances of
the same object, such as different facial identities. In this paper, we develop
a new perspective on the equivariance approach by noting that dense landmark
detectors can be interpreted as local image descriptors equipped with
invariance to intra-category variations. We then propose a direct method to
enforce such an invariance in the standard equivariant loss. We do so by
exchanging descriptor vectors between images of different object instances
prior to matching them geometrically. In this manner, the same vectors must
work regardless of the specific object identity considered. We use this
approach to learn vectors that can simultaneously be interpreted as local
descriptors and dense landmarks, combining the advantages of both. Experiments
on standard benchmarks show that this approach can match, and in some cases
surpass state-of-the-art performance amongst existing methods that learn
landmarks without supervision. Code is available at
www.robots.ox.ac.uk/~vgg/research/DVE/.Comment: ICCV 201
Enhanced Capsule-based Networks and Their Applications
Current deep models have achieved human-like accuracy in many computer vision tasks, even defeating humans sometimes. However, these deep models still suffer from significant weaknesses. To name a few, it is hard to interpret how they reach decisions, and it is easy to attack them with tiny perturbations.
A capsule, usually implemented as a vector, represents an object or object part. Capsule networks and GLOM consist of classic and generalized capsules respectively, where the difference is whether the capsule is limited to representing a fixed thing. Both models are designed to parse their input into a part-whole hierarchy as humans do, where each capsule corresponds to an entity of the hierarchy. That is, the first layer finds the lowest-level vision patterns, and the following layers assemble the larger patterns till the entire object, e.g., from nostril to nose, face, and person.
This design enables capsule networks and GLOM the potential of solving the above problems of current deep models, by mimicking how humans overcome these problems with the part-whole hierarchy. However, their current implementations are not perfect on fulfilling their potentials and require further improvements, including intrinsic interpretability, guaranteed equivariance, robustness to adversarial attacks, a more efficient routing algorithm, compatibility with other models, etc.
In this dissertation, I first briefly introduce the motivations, essential ideas, and existing implementations of capsule networks and GLOM, then focus on addressing some limitations of these implementations. The improvements are briefly summarized as
follows. First, a fast non-iterative routing algorithm is proposed for capsule networks, which facilitates their applications in many tasks such as image classification and segmentation. Second, a new architecture, named Twin-Islands, is proposed based on GLOM, which achieves the many desired properties such as equivariance, model interpretability, and adversarial robustness. Lastly, the essential idea of capsule networks and GLOM is re-implemented in a small group ensemble block, which can also be used along with other types of neural networks, e.g., CNNs, on various tasks such as image classification, segmentation, and retrieval
- …