931 research outputs found

    Unmanned Aerial Systems for Wildland and Forest Fires

    Full text link
    Wildfires represent an important natural risk causing economic losses, human death and important environmental damage. In recent years, we witness an increase in fire intensity and frequency. Research has been conducted towards the development of dedicated solutions for wildland and forest fire assistance and fighting. Systems were proposed for the remote detection and tracking of fires. These systems have shown improvements in the area of efficient data collection and fire characterization within small scale environments. However, wildfires cover large areas making some of the proposed ground-based systems unsuitable for optimal coverage. To tackle this limitation, Unmanned Aerial Systems (UAS) were proposed. UAS have proven to be useful due to their maneuverability, allowing for the implementation of remote sensing, allocation strategies and task planning. They can provide a low-cost alternative for the prevention, detection and real-time support of firefighting. In this paper we review previous work related to the use of UAS in wildfires. Onboard sensor instruments, fire perception algorithms and coordination strategies are considered. In addition, we present some of the recent frameworks proposing the use of both aerial vehicles and Unmanned Ground Vehicles (UV) for a more efficient wildland firefighting strategy at a larger scale.Comment: A recent published version of this paper is available at: https://doi.org/10.3390/drones501001

    Going Deeper into Action Recognition: A Survey

    Full text link
    Understanding human actions in visual data is tied to advances in complementary research areas including object recognition, human dynamics, domain adaptation and semantic segmentation. Over the last decade, human action analysis evolved from earlier schemes that are often limited to controlled environments to nowadays advanced solutions that can learn from millions of videos and apply to almost all daily activities. Given the broad range of applications from video surveillance to human-computer interaction, scientific milestones in action recognition are achieved more rapidly, eventually leading to the demise of what used to be good in a short time. This motivated us to provide a comprehensive review of the notable steps taken towards recognizing human actions. To this end, we start our discussion with the pioneering methods that use handcrafted representations, and then, navigate into the realm of deep learning based approaches. We aim to remain objective throughout this survey, touching upon encouraging improvements as well as inevitable fallbacks, in the hope of raising fresh questions and motivating new research directions for the reader

    Towards accurate multi-person pose estimation in the wild

    Get PDF
    In this thesis we are concerned with the problem of articulated human pose estimation and pose tracking in images and video sequences. Human pose estimation is a task of localising major joints of a human skeleton in natural images and is one of the most important visual recognition tasks in the scenes containing humans with numerous applications in robotics, virtual and augmented reality, gaming and healthcare among others. Articulated human pose tracking requires tracking multiple persons in the video sequence while simultaneously estimating full body poses. This task is important for analysing surveillance footage, activity recognition, sports analytics, etc. Most of the prior work focused on the pose estimation of single pre-localised humans whereas here we address a case with multiple people in real world images which entails several challenges such as person-person overlaps in highly crowded scenes, unknown number of people or people entering and leaving video sequences. The first contribution is a multi-person pose estimation algorithm based on the bottom-up detection-by-grouping paradigm. Unlike the widespread top-down approaches our method detects body joints and pairwise relations between them in a single forward pass of a convolutional neural network. Multi-person parsing is performed by optimizing a joint objective based on a multicut graph partitioning framework. Secondly, we extend our pose estimation approach to articulated multi-person pose tracking in videos. Our approach performs multi-target tracking and pose estimation in a holistic manner by optimising a single objective. We further simplify and refine the formulation which allows us to reach close to the real-time performance. Thirdly, we propose a large scale dataset and a benchmark for articulated multi-person tracking. It is the first dataset of video sequences comprising complex multi-person scenes and fully annotated tracks with 2D keypoints. Our fourth contribution is a method for estimating 3D body pose using on-body wearable cameras. Our approach uses a pair of downward facing, head-mounted cameras and captures an entire body. This egocentric approach is free of limitations of traditional setups with external cameras and can estimate body poses in very crowded environments. Our final contribution goes beyond human pose estimation and is in the field of deep learning of 3D object shapes. In particular, we address the case of reconstructing 3D objects from weak supervision. Our approach represents objects as 3D point clouds and is able to learn them with 2D supervision only and without requiring camera pose information at training time. We design a differentiable renderer of point clouds as well as a novel loss formulation for dealing with camera pose ambiguity.In dieser Arbeit behandeln wir das Problem der Schätzung und Verfolgung artikulierter menschlicher Posen in Bildern und Video-Sequenzen. Die Schätzung menschlicher Posen besteht darin die Hauptgelenke des menschlichen Skeletts in natürlichen Bildern zu lokalisieren und ist eine der wichtigsten Aufgaben der visuellen Erkennung in Szenen, die Menschen beinhalten. Sie hat zahlreiche Anwendungen in der Robotik, virtueller und erweiterter Realität, in Videospielen, in der Medizin und weiteren Bereichen. Die Verfolgung artikulierter menschlicher Posen erfordert die Verfolgung mehrerer Personen in einer Videosequenz bei gleichzeitiger Schätzung vollständiger Körperhaltungen. Diese Aufgabe ist besonders wichtig für die Analyse von Video-Überwachungsaufnahmen, Aktivitätenerkennung, digitale Sportanalyse etc. Die meisten vorherigen Arbeiten sind auf die Schätzung einzelner Posen vorlokalisierter Menschen fokussiert, wohingegen wir den Fall mehrerer Personen in natürlichen Aufnahmen betrachten. Dies bringt einige Herausforderungen mit sich, wie die Überlappung verschiedener Personen in dicht gedrängten Szenen, eine unbekannte Anzahl an Personen oder Personen die das Sichtfeld der Video-Sequenz verlassen oder betreten. Der erste Beitrag ist ein Algorithmus zur Schätzung der Posen mehrerer Personen, welcher auf dem Paradigma der Erkennung durch Gruppierung aufbaut. Im Gegensatz zu den verbreiteten Verfeinerungs-Ansätzen erkennt unsere Methode Körpergelenke and paarweise Beziehungen zwischen ihnen in einer einzelnen Vorwärtsrechnung eines faltenden neuronalen Netzwerkes. Die Gliederung in mehrere Personen erfolgt durch Optimierung einer gemeinsamen Zielfunktion, die auf dem Mehrfachschnitt-Problem in der Graphenzerlegung basiert. Zweitens erweitern wir unseren Ansatz zur Posen-Bestimmung auf das Verfolgen mehrerer Personen und deren Artikulation in Videos. Unser Ansatz führt eine Verfolgung mehrerer Ziele und die Schätzung der zugehörigen Posen in ganzheitlicher Weise durch, indem eine einzelne Zielfunktion optimiert wird. Desweiteren vereinfachen und verfeinern wir die Formulierung, was unsere Methode nah an Echtzeit-Leistung bringt. Drittens schlagen wir einen großen Datensatz und einen Bewertungsmaßstab für die Verfolgung mehrerer artikulierter Personen vor. Dies ist der erste Datensatz der Video-Sequenzen von komplexen Szenen mit mehreren Personen beinhaltet und deren Spuren komplett mit zwei-dimensionalen Markierungen der Schlüsselpunkte versehen sind. Unser vierter Beitrag ist eine Methode zur Schätzung von drei-dimensionalen Körperhaltungen mittels am Körper tragbarer Kameras. Unser Ansatz verwendet ein Paar nach unten gerichteter, am Kopf befestigter Kameras und erfasst den gesamten Körper. Dieser egozentrische Ansatz ist frei von jeglichen Limitierungen traditioneller Konfigurationen mit externen Kameras und kann Körperhaltungen in sehr dicht gedrängten Umgebungen bestimmen. Unser letzter Beitrag geht über die Schätzung menschlicher Posen hinaus in den Bereich des tiefen Lernens der Gestalt von drei-dimensionalen Objekten. Insbesondere befassen wir uns mit dem Fall drei-dimensionale Objekte unter schwacher Überwachung zu rekonstruieren. Unser Ansatz repräsentiert Objekte als drei-dimensionale Punktwolken and ist im Stande diese nur mittels zwei-dimensionaler Überwachung und ohne Informationen über die Kamera-Ausrichtung zur Trainingszeit zu lernen. Wir entwerfen einen differenzierbaren Renderer für Punktwolken sowie eine neue Formulierung um mit uneindeutigen Kamera-Ausrichtungen umzugehen

    Computational Modeling of Human Dorsal Pathway for Motion Processing

    Get PDF
    Reliable motion estimation in videos is of crucial importance for background iden- tification, object tracking, action recognition, event analysis, self-navigation, etc. Re- constructing the motion field in the 2D image plane is very challenging, due to variations in image quality, scene geometry, lighting condition, and most importantly, camera jit- tering. Traditional optical flow models assume consistent image brightness and smooth motion field, which are violated by unstable illumination and motion discontinuities that are common in real world videos. To recognize observer (or camera) motion robustly in complex, realistic scenarios, we propose a biologically-inspired motion estimation system to overcome issues posed by real world videos. The bottom-up model is inspired from the infrastructure as well as functionalities of human dorsal pathway, and the hierarchical processing stream can be divided into three stages: 1) spatio-temporal processing for local motion, 2) recogni- tion for global motion patterns (camera motion), and 3) preemptive estimation of object motion. To extract effective and meaningful motion features, we apply a series of steer- able, spatio-temporal filters to detect local motion at different speeds and directions, in a way that\u27s selective of motion velocity. The intermediate response maps are cal- ibrated and combined to estimate dense motion fields in local regions, and then, local motions along two orthogonal axes are aggregated for recognizing planar, radial and circular patterns of global motion. We evaluate the model with an extensive, realistic video database that collected by hand with a mobile device (iPad) and the video content varies in scene geometry, lighting condition, view perspective and depth. We achieved high quality result and demonstrated that this bottom-up model is capable of extracting high-level semantic knowledge regarding self motion in realistic scenes. Once the global motion is known, we segment objects from moving backgrounds by compensating for camera motion. For videos captured with non-stationary cam- eras, we consider global motion as a combination of camera motion (background) and object motion (foreground). To estimate foreground motion, we exploit corollary dis- charge mechanism of biological systems and estimate motion preemptively. Since back- ground motions for each pixel are collectively introduced by camera movements, we apply spatial-temporal averaging to estimate the background motion at pixel level, and the initial estimation of foreground motion is derived by comparing global motion and background motion at multiple spatial levels. The real frame signals are compared with those derived by forward predictions, refining estimations for object motion. This mo- tion detection system is applied to detect objects with cluttered, moving backgrounds and is proved to be efficient in locating independently moving, non-rigid regions. The core contribution of this thesis is the invention of a robust motion estimation system for complicated real world videos, with challenges by real sensor noise, complex natural scenes, variations in illumination and depth, and motion discontinuities. The overall system demonstrates biological plausibility and holds great potential for other applications, such as camera motion removal, heading estimation, obstacle avoidance, route planning, and vision-based navigational assistance, etc

    Practical and Rich User Digitization

    Full text link
    A long-standing vision in computer science has been to evolve computing devices into proactive assistants that enhance our productivity, health and wellness, and many other facets of our lives. User digitization is crucial in achieving this vision as it allows computers to intimately understand their users, capturing activity, pose, routine, and behavior. Today's consumer devices - like smartphones and smartwatches provide a glimpse of this potential, offering coarse digital representations of users with metrics such as step count, heart rate, and a handful of human activities like running and biking. Even these very low-dimensional representations are already bringing value to millions of people's lives, but there is significant potential for improvement. On the other end, professional, high-fidelity comprehensive user digitization systems exist. For example, motion capture suits and multi-camera rigs that digitize our full body and appearance, and scanning machines such as MRI capture our detailed anatomy. However, these carry significant user practicality burdens, such as financial, privacy, ergonomic, aesthetic, and instrumentation considerations, that preclude consumer use. In general, the higher the fidelity of capture, the lower the user's practicality. Most conventional approaches strike a balance between user practicality and digitization fidelity. My research aims to break this trend, developing sensing systems that increase user digitization fidelity to create new and powerful computing experiences while retaining or even improving user practicality and accessibility, allowing such technologies to have a societal impact. Armed with such knowledge, our future devices could offer longitudinal health tracking, more productive work environments, full body avatars in extended reality, and embodied telepresence experiences, to name just a few domains.Comment: PhD thesi
    corecore