109 research outputs found

    Deep face tracking and parsing in the wild

    Get PDF
    Face analysis has been a long-standing research direction in the field of computer vision and pattern recognition. A complete face analysis system involves solving several tasks including face detection, face tracking, face parsing, and face recognition. Recently, the performance of methods in all tasks has significantly improved thanks to the employment of Deep Convolutional Neural Networks (DCNNs). However, existing face analysis algorithms mainly focus on solving facial images captured in the constrained laboratory environment, and their performance on real-world images has remained less explored. Compared with the lab environment, the in-the-wild settings involve greater diversity in face sizes, poses, facial expressions, background clutters, lighting conditions and imaging quality. This thesis investigates two fundamental tasks in face analysis under in-the-wild settings: face tracking and face parsing. Both tasks serve as important prerequisites for downstream face analysis applications. However, in-the-wild datasets remain scarce in both fields and models have not been rigorously evaluated in such settings. In this thesis, we aim to bridge that gap of lacking in-the-wild data, evaluate existing methods in these settings, and develop accurate, robust and efficient deep learning-based methods for the two tasks. For face tracking in the wild, we introduce the first in-the-wild face tracking dataset, MobiFace, that consists of 80 videos captured by mobile phones during mobile live-streaming. The environment of the live-streaming performance is fully unconstrained and the interactions between users and mobile phones are natural and spontaneous. Next, we evaluate existing tracking methods, including generic object trackers and dedicated face trackers. The results show that MobiFace represent unique challenges in face tracking in the wild and cannot be readily solved by existing methods. Finally, we present a DCNN-based framework, FT-RCNN, that significantly outperforms other methods in face tracking in the wild. For face parsing in the wild, we introduce the first large-scale in-the-wild face dataset, iBugMask, that contains 21, 866 training images and 1, 000 testing images. Unlike existing datasets, the images in iBugMask are captured in the fully unconstrained environment and are not cropped or preprocessed of any kind. Manually annotated per-pixel labels for eleven facial regions are provided for each target face. Next, we benchmark existing parsing methods and the results show that iBugMask is extremely challenging for all methods. By rigorous benchmarking, we observe that the pre-processing of facial images with bounding boxes in face parsing in the wild introduces bias. When cropping the face with a bounding box, a cropping margin has to be hand-picked. If face alignment is used, fiducial landmarks are required and a predefined alignment template has to be selected. These additional hyper-parameters have to be carefully considered and can have a significant impact on the face parsing performance. To solve this, we propose Region-of-Interest (RoI) Tanh-polar transform that warps the whole image to a fixed-sized representation. Moreover, the RoI Tanh-polar transform is differentiable and allows for rotation equivariance in 1 DCNNs. We show that when coupled with a simple Fully Convolutional Network, our RoI Tanh-polar transformer Network has achieved state-of-the-art results on face parsing in the wild. This thesis contributes towards in-the-wild face tracking and face parsing by providing novel datasets and proposing effective frameworks. Both tasks can benefit real-world downstream applications such as facial age estimation, facial expression recognition and lip-reading. The proposed RoI Tanh-polar transform also provides a new perspective in how to preprocess the face images and make the DCNNs truly end-to-end for real-world face analysis applications.Open Acces

    Methods for Estimating 3D Human Pose and Location

    Get PDF
    Absztrakt 3D humán pózbecslő és lokalizációs módszerek A 3D humán pózbecslés célja, hogy egy kép vagy videó alapján megbecsüljük egy személy fontosabb kulcspontjainak térbeli koordinátáit. A koordinátákat egy erre alkalmas kamera-központú rendszerben adhatjuk meg. A problémának számos potenciális alkalmazása van, többek közt mozgásrögzítés, viselkedés- és sportanalitika. A neurális hálók újbóli elterjedése után gyors fejlődésen ment keresztül a terület, a hibaráták folyamatosan csökkentek. Ez azonban főleg a stúdióban felvett videókra korlátozódott, a természetes körülmények közötti felvételeken akár kétszer akkora is lehetett a hiba. Ennek oka, hogy pontos mérésekkel rendelkező 3D póz adatbázisok készítése igen nehéz, speciális felszerelés szükséges, ami csak beltérben működik. Emiatt viszonylag kevés kültéri felvétel van, és ezek korlátozottak a háttér, kameraszögek és ábrázolt mozgások tekintetében. Figyelembe véve, hogy milyen sok adat kell egy tipikus mélyháló betanításához, további architekturális fejlesztések szükségesek, amelyek csökkentik a tanítóadatok iránti igényt. Egy másik probléma a feladat egyszerűsítéséből fakad: a legtöbb algoritmus csak a derékhoz relatíven becsli meg a koordinátákat, eltekintve a személynek a térben vett helyzetétől. Egy embert tartalmazó felvétel esetén lehet, hogy ez az információ nem szükséges, azonban több szereplő esetén az egymáshoz való elhelyezkedés is számít. Ebben a disszertációban négy olyan módszert mutatok be, amelyek ezen problémákra próbálnak választ nyújtani. Az első algoritmus a tanítóadatokban levő kevés kamerával foglalkozik, amely túltanuláshoz vezethet. Egy sziámi architektúrájú hálót vezetek be, ami ekvivariáns beágyazást tanul meg. Az ekvivariancia segítségével új kameraszögekre pontosabb eredményeket kapunk, augmentáció hiányában is. A második eljárás az irodalomban korábban elérhető naiv helybecslő algoritmusokon javít. Ezek PnP (Perspective-n-Point) alapú megközelítést használnak, amihez szükséges egy pontos 2D és 3D becslés. Ha bármelyik pontatlan, akkor az eredményben többszöröződhet a hiba. A póz helyének közvetlen becslésével stabilabb eredményeket kaphatunk. A következő módszer célja \RGBD (mélységfelvételt is tartalmazó) videók használata gyenge felügyeleti jelként. A kiegészítő adatbázis jelentősen javít az eredményeken, különösen a lokalizáció esetén. Végül, a videókon előforduló rövid idejű okklúziókkal foglalkozom. A videó alapú módszerek még rövid kitakarás esetén is rossz eredményeket adnak, akkor is, ha az a szomszédos képkockákból kikövetkeztethető volna. A javasolt módszer egy tetszőleges pózbecslő után elhelyezhető, mint egy extra pontosságjavító lépés. Az eljárások erejét részletes kvantitatív kísérletekkel igazolom

    LiDAR-Based Place Recognition For Autonomous Driving: A Survey

    Full text link
    LiDAR-based place recognition (LPR) plays a pivotal role in autonomous driving, which assists Simultaneous Localization and Mapping (SLAM) systems in reducing accumulated errors and achieving reliable localization. However, existing reviews predominantly concentrate on visual place recognition (VPR) methods. Despite the recent remarkable progress in LPR, to the best of our knowledge, there is no dedicated systematic review in this area. This paper bridges the gap by providing a comprehensive review of place recognition methods employing LiDAR sensors, thus facilitating and encouraging further research. We commence by delving into the problem formulation of place recognition, exploring existing challenges, and describing relations to previous surveys. Subsequently, we conduct an in-depth review of related research, which offers detailed classifications, strengths and weaknesses, and architectures. Finally, we summarize existing datasets, commonly used evaluation metrics, and comprehensive evaluation results from various methods on public datasets. This paper can serve as a valuable tutorial for newcomers entering the field of place recognition and for researchers interested in long-term robot localization. We pledge to maintain an up-to-date project on our website https://github.com/ShiPC-AI/LPR-Survey.Comment: 26 pages,13 figures, 5 table

    Unsupervised Learning of Landmarks by Descriptor Vector Exchange

    Get PDF
    Equivariance to random image transformations is an effective method to learn landmarks of object categories, such as the eyes and the nose in faces, without manual supervision. However, this method does not explicitly guarantee that the learned landmarks are consistent with changes between different instances of the same object, such as different facial identities. In this paper, we develop a new perspective on the equivariance approach by noting that dense landmark detectors can be interpreted as local image descriptors equipped with invariance to intra-category variations. We then propose a direct method to enforce such an invariance in the standard equivariant loss. We do so by exchanging descriptor vectors between images of different object instances prior to matching them geometrically. In this manner, the same vectors must work regardless of the specific object identity considered. We use this approach to learn vectors that can simultaneously be interpreted as local descriptors and dense landmarks, combining the advantages of both. Experiments on standard benchmarks show that this approach can match, and in some cases surpass state-of-the-art performance amongst existing methods that learn landmarks without supervision. Code is available at www.robots.ox.ac.uk/~vgg/research/DVE/.Comment: ICCV 201

    Enhanced Capsule-based Networks and Their Applications

    Get PDF
    Current deep models have achieved human-like accuracy in many computer vision tasks, even defeating humans sometimes. However, these deep models still suffer from significant weaknesses. To name a few, it is hard to interpret how they reach decisions, and it is easy to attack them with tiny perturbations. A capsule, usually implemented as a vector, represents an object or object part. Capsule networks and GLOM consist of classic and generalized capsules respectively, where the difference is whether the capsule is limited to representing a fixed thing. Both models are designed to parse their input into a part-whole hierarchy as humans do, where each capsule corresponds to an entity of the hierarchy. That is, the first layer finds the lowest-level vision patterns, and the following layers assemble the larger patterns till the entire object, e.g., from nostril to nose, face, and person. This design enables capsule networks and GLOM the potential of solving the above problems of current deep models, by mimicking how humans overcome these problems with the part-whole hierarchy. However, their current implementations are not perfect on fulfilling their potentials and require further improvements, including intrinsic interpretability, guaranteed equivariance, robustness to adversarial attacks, a more efficient routing algorithm, compatibility with other models, etc. In this dissertation, I first briefly introduce the motivations, essential ideas, and existing implementations of capsule networks and GLOM, then focus on addressing some limitations of these implementations. The improvements are briefly summarized as follows. First, a fast non-iterative routing algorithm is proposed for capsule networks, which facilitates their applications in many tasks such as image classification and segmentation. Second, a new architecture, named Twin-Islands, is proposed based on GLOM, which achieves the many desired properties such as equivariance, model interpretability, and adversarial robustness. Lastly, the essential idea of capsule networks and GLOM is re-implemented in a small group ensemble block, which can also be used along with other types of neural networks, e.g., CNNs, on various tasks such as image classification, segmentation, and retrieval
    • …
    corecore