106 research outputs found

    Privaatsust säilitava raalnägemise meetodi arendamine kehalise aktiivsuse automaatseks jälgimiseks koolis

    Get PDF
    Väitekirja elektrooniline versioon ei sisalda publikatsiooneKuidas vaadelda inimesi ilma neid nägemata? Öeldakse, et ei ole viisakas jõllitada. Õigus privaatsusele on lausa inimõigus. Siiski on inimkäitumises palju sellist, mida teadlased tahaksid uurida inimesi vaadeldes. Näiteks tahame teada, kas lapsed hakkavad vahetunnis rohkem liikuma, kui koolis keelatakse nutitelefonid? Selle välja selgitamiseks peaks teadlane küsima lapsevanematelt nõusolekut võsukeste vaatlemiseks. Eeldusel, et lapsevanemad annavad loa, oleks klassikaliseks vaatluseks vaja tohutult palju tööjõudu – mitu vaatlejat koolimajas iga päev piisavalt pikal perioodil enne ja pärast nutitelefoni keelu kehtestamist. Doktoritööga püüdsin lahendada korraga privaatsuse probleemi ja tööjõu probleemi, asendades inimvaatleja tehisaruga. Kaasaegsed masinõppe meetodid võimaldavad luua mudeleid, mis tuvastavad automaatselt pildil või videos kujutatud objekte ja nende omadusi. Kui tahame tehisaru, mis tunneb pildil ära inimese, tuleb moodustada masinõppe andmestik, kus on pilte inimestest ja pilte ilma inimesteta. Kui tahame tehisaru, mis eristaks videos madalat ja kõrget kehalist aktiivsust, on vaja vastavat videoandmestikku. Doktoritöös kogusingi andmestiku, kus video laste liikumisest on sünkroniseeritud puusal kantavate aktseleromeetritega, et treenida mudel, mis eristaks videopikslites madalamat ja kõrgemat liikumise intensiivsust. Koostöös Tehonoloogiainstituudi iCV laboriga arendasime välja videoanalüüsi sensori prototüübi, mis suudab reaalaja kiirusel hinnata kaamera vaateväljas olevate inimeste kehalise aktiivsuse taset. Just see, et tehisaru suudab tuletada videost kehalise aktiivsuse informatsiooni ilma neid videokaadreid salvestamata ega inimestele üldsegi näitamata, võimaldab vaadelda inimesi ilma neid nägemata. Väljatöötatud meetod on mõeldud kehalise aktiivsuse mõõtmiseks koolipõhistes teadusuuringutes ning seetõttu on arenduses rõhutatud privaatsuse kaitsmist ja teaduseetikat. Laiemalt vaadates illustreerib doktoritöö aga raalnägemistehnoloogiate potentsiaali töötlemaks visuaalset infot linnaruumis ja töökohtadel ning mitte ainult kehalise aktiivsuse mõõtmiseks kõrgete teaduseetika kriteerimitega. Siin ongi koht avalikuks aruteluks – millistel tingimustel või kas üldse on OK, kui sind jõllitab robot?  How to observe people without seeing them? They say it's not polite to stare. The right to privacy is considered a human right. However, there is much in human behavior that scientists would like to study via observation. For example, we want to know whether children will start moving more during recess if smartphones are banned at school? To figure this out, scientists would have to ask parental consent to carry out the observation. Assuming parents grant permission, a huge amount of labour would be needed for classical observation - several observers in the schoolhouse every day for a sufficiently long period before and after the smartphone ban. With my doctoral thesis, I tried to solve both the problem of privacy and of labor by replacing the human observer with artificial intelligence (AI). Modern machine learning methods allow training models that automatically detect objects and their properties in images or video. If we want an AI that recognizes people in images, we need to form a machine learning dataset with pictures of people and pictures without people. If we want an AI that differentiates between low and high physical activity in video, we need a corresponding video dataset. In my doctoral thesis, I collected a dataset where video of children's movement is synchronized with hip-worn accelerometers to train a model that could differentiate between lower and higher levels of physical activity in video. In collaboration with the ICV lab at the Institute of Technology, we developed a prototype video analysis sensor that can estimate the level of physical activity of people in the camera's field of view at real-time speed. The fact that AI can derive information about physical activity from the video without recording the footage or showing it to anyone at all, makes it possible to observe without seeing. The method is designed for measuring physical activity in school-based research and therefore highly prioritizes privacy protection and research ethics. But more broadly, the thesis illustrates the potential of computer vision technologies for processing visual information in urban spaces and workplaces, and not only for measuring physical activity or adhering to high ethical standards. This warrants wider public discussion – under what conditions or whether at all is it OK to have a robot staring at you?https://www.ester.ee/record=b555972

    Data Processing for Perception of Autonomous Vehicles in Urban Traffic Environments

    Full text link
    The perception system of AVs, consisting of various onboard sensors including cameras and LiDAR scanners, is crucial to perceiving the environment, localizing the vehicle, and recognizing the semantics of traffic scenes. Despite recent advances in computer vision, the perception of AVs has remained challenging, especially in urban environments. One of the reasons is that multiple types of dynamic agents usually coexist in an urban area. The interactions between agents and surroundings are complicated to explicitly model, and the agents' unpredictable behaviors also increase the problem complexity. Moreover, ground truth data with a rich set of labels is not sufficient to cover diverse scenarios, and it is challenging to get data from real scenes. This dissertation presents research contributions to overcome the challenges of AV perception in urban traffic environments, from data collection and labeling to 3D reconstruction and analysis of intrinsic properties. Mainly focusing on unsignalized urban intersections, we discuss (1) how to obtain valuable data from real urban traffic scenes, (2) how to efficiently process the raw data to produce meaningful labels for the dynamic road agents, (3) how to augment the data with semantic labels to the scenes, and (4) what factors make the reconstruction more realistic. We first built a data capture system with a multi-modal sensor suite to simulate actual AV perception. We then introduced a 3D model-fitting algorithm to fit parametrized human mesh models to the pedestrians in a scene. The generated 3D models provide free labels, such as human pose and trajectories, with no cost of manual labeling. We proposed performing the entire scene modeling through densely reconstructing the scene and expanding the scope of automatic labeling to scene elements. These include dynamic vehicles and static components, such as roads, buildings, and traffic signs. To do this, we built a simulator that can generate a rich set of labels using virtual sensors. Finally, we tackled the problem of estimating intrinsic properties and discuss ways to achieve realistic 3D reconstruction. This dissertation understands the AV perception pipeline, explores data preparations at urban traffic scenes, and discusses relevant experiments and applications critical for tackling other problems. We conclude the dissertation with future research directions for further augmenting the data and improving the realism of the reconstructed scene models.PHDElectrical Engineering: SystemsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/169648/1/wonhui_1.pd

    Visual scene context in emotion perception

    Get PDF
    Els estudis psicològics demostren que el context de l'escena, a més de l'expressió facial i la postura corporal, aporta informació important a la nostra percepció de les emocions de les persones. Tot i això, el processament del context per al reconeixement automàtic de les emocions no s'ha explorat a fons, en part per la manca de dades adequades. En aquesta tesi presentem EMOTIC, un conjunt de dades d'imatges de persones en situacions naturals i diferents anotades amb la seva aparent emoció. La base de dades EMOTIC combina dos tipus de representació d'emocions diferents: (1) un conjunt de 26 categories d'emoció i (2) les dimensions contínues valència, excitació i dominància. També presentem una anàlisi estadística i algorítmica detallada del conjunt de dades juntament amb l'anàlisi d'acords d'anotadors. Els models CNN estan formats per EMOTIC, combinant característiques de la persona amb funcions d'escena (context). Els nostres resultats mostren com el context d'escena aporta informació important per reconèixer automàticament els estats emocionals i motiven més recerca en aquesta direcció.Los estudios psicológicos muestran que el contexto de la escena, además de la expresión facial y la pose corporal, aporta información importante a nuestra percepción de las emociones de las personas. Sin embargo, el procesamiento del contexto para el reconocimiento automático de emociones no se ha explorado en profundidad, en parte debido a la falta de datos adecuados. En esta tesis presentamos EMOTIC, un conjunto de datos de imágenes de personas en situaciones naturales y diferentes anotadas con su aparente emoción. La base de datos EMOTIC combina dos tipos diferentes de representación de emociones: (1) un conjunto de 26 categorías de emociones y (2) las dimensiones continuas de valencia, excitación y dominación. También presentamos un análisis estadístico y algorítmico detallado del conjunto de datos junto con el análisis de concordancia de los anotadores. Los modelos CNN están entrenados en EMOTIC, combinando características de la persona con características de escena (contexto). Nuestros resultados muestran cómo el contexto de la escena aporta información importante para reconocer automáticamente los estados emocionales, lo cual motiva más investigaciones en esta dirección.Psychological studies show that the context of a setting, in addition to facial expression and body language, lends important information that conditions our perception of people's emotions. However, context's processing in the case of automatic emotion recognition has not been explored in depth, partly due to the lack of sufficient data. In this thesis we present EMOTIC, a dataset of images of people in various natural scenarios annotated with their apparent emotion. The EMOTIC database combines two different types of emotion representation: (1) a set of 26 emotion categories, and (2) the continuous dimensions of valence, arousal and dominance. We also present a detailed statistical and algorithmic analysis of the dataset along with the annotators' agreement analysis. CNN models are trained using EMOTIC, combining a person's features with those of the setting (context). Our results not only show how the context of a setting contributes important information for automatically recognizing emotional states but also promote further research in this direction

    From pixels to people : recovering location, shape and pose of humans in images

    Get PDF
    Humans are at the centre of a significant amount of research in computer vision. Endowing machines with the ability to perceive people from visual data is an immense scientific challenge with a high degree of direct practical relevance. Success in automatic perception can be measured at different levels of abstraction, and this will depend on which intelligent behaviour we are trying to replicate: the ability to localise persons in an image or in the environment, understanding how persons are moving at the skeleton and at the surface level, interpreting their interactions with the environment including with other people, and perhaps even anticipating future actions. In this thesis we tackle different sub-problems of the broad research area referred to as "looking at people", aiming to perceive humans in images at different levels of granularity. We start with bounding box-level pedestrian detection: We present a retrospective analysis of methods published in the decade preceding our work, identifying various strands of research that have advanced the state of the art. With quantitative exper- iments, we demonstrate the critical role of developing better feature representations and having the right training distribution. We then contribute two methods based on the insights derived from our analysis: one that combines the strongest aspects of past detectors and another that focuses purely on learning representations. The latter method outperforms more complicated approaches, especially those based on hand- crafted features. We conclude our work on pedestrian detection with a forward-looking analysis that maps out potential avenues for future research. We then turn to pixel-level methods: Perceiving humans requires us to both separate them precisely from the background and identify their surroundings. To this end, we introduce Cityscapes, a large-scale dataset for street scene understanding. This has since established itself as a go-to benchmark for segmentation and detection. We additionally develop methods that relax the requirement for expensive pixel-level annotations, focusing on the task of boundary detection, i.e. identifying the outlines of relevant objects and surfaces. Next, we make the jump from pixels to 3D surfaces, from localising and labelling to fine-grained spatial understanding. We contribute a method for recovering 3D human shape and pose, which marries the advantages of learning-based and model- based approaches. We conclude the thesis with a detailed discussion of benchmarking practices in computer vision. Among other things, we argue that the design of future datasets should be driven by the general goal of combinatorial robustness besides task-specific considerations.Der Mensch steht im Zentrum vieler Forschungsanstrengungen im Bereich des maschinellen Sehens. Es ist eine immense wissenschaftliche Herausforderung mit hohem unmittelbarem Praxisbezug, Maschinen mit der Fähigkeit auszustatten, Menschen auf der Grundlage von visuellen Daten wahrzunehmen. Die automatische Wahrnehmung kann auf verschiedenen Abstraktionsebenen erfolgen. Dies hängt davon ab, welches intelligente Verhalten wir nachbilden wollen: die Fähigkeit, Personen auf der Bildfläche oder im 3D-Raum zu lokalisieren, die Bewegungen von Körperteilen und Körperoberflächen zu erfassen, Interaktionen einer Person mit ihrer Umgebung einschließlich mit anderen Menschen zu deuten, und vielleicht sogar zukünftige Handlungen zu antizipieren. In dieser Arbeit beschäftigen wir uns mit verschiedenen Teilproblemen die dem breiten Forschungsgebiet "Betrachten von Menschen" gehören. Beginnend mit der Fußgängererkennung präsentieren wir eine Analyse von Methoden, die im Jahrzehnt vor unserem Ausgangspunkt veröffentlicht wurden, und identifizieren dabei verschiedene Forschungsstränge, die den Stand der Technik vorangetrieben haben. Unsere quantitativen Experimente zeigen die entscheidende Rolle sowohl der Entwicklung besserer Bildmerkmale als auch der Trainingsdatenverteilung. Anschließend tragen wir zwei Methoden bei, die auf den Erkenntnissen unserer Analyse basieren: eine Methode, die die stärksten Aspekte vergangener Detektoren kombiniert, eine andere, die sich im Wesentlichen auf das Lernen von Bildmerkmalen konzentriert. Letztere übertrifft kompliziertere Methoden, insbesondere solche, die auf handgefertigten Bildmerkmalen basieren. Wir schließen unsere Arbeit zur Fußgängererkennung mit einer vorausschauenden Analyse ab, die mögliche Wege für die zukünftige Forschung aufzeigt. Anschließend wenden wir uns Methoden zu, die Entscheidungen auf Pixelebene betreffen. Um Menschen wahrzunehmen, müssen wir diese sowohl praezise vom Hintergrund trennen als auch ihre Umgebung verstehen. Zu diesem Zweck führen wir Cityscapes ein, einen umfangreichen Datensatz zum Verständnis von Straßenszenen. Dieser hat sich seitdem als Standardbenchmark für Segmentierung und Erkennung etabliert. Darüber hinaus entwickeln wir Methoden, die die Notwendigkeit teurer Annotationen auf Pixelebene reduzieren. Wir konzentrieren uns hierbei auf die Aufgabe der Umgrenzungserkennung, d. h. das Erkennen der Umrisse relevanter Objekte und Oberflächen. Als nächstes machen wir den Sprung von Pixeln zu 3D-Oberflächen, vom Lokalisieren und Beschriften zum präzisen räumlichen Verständnis. Wir tragen eine Methode zur Schätzung der 3D-Körperoberfläche sowie der 3D-Körperpose bei, die die Vorteile von lernbasierten und modellbasierten Ansätzen vereint. Wir schließen die Arbeit mit einer ausführlichen Diskussion von Evaluationspraktiken im maschinellen Sehen ab. Unter anderem argumentieren wir, dass der Entwurf zukünftiger Datensätze neben aufgabenspezifischen Überlegungen vom allgemeinen Ziel der kombinatorischen Robustheit bestimmt werden sollte

    Visual scene context in emotion perception

    Get PDF
    Els estudis psicològics demostren que el context de l'escena, a més de l'expressió facial i la postura corporal, aporta informació important a la nostra percepció de les emocions de les persones. Tot i això, el processament del context per al reconeixement automàtic de les emocions no s'ha explorat a fons, en part per la manca de dades adequades. En aquesta tesi presentem EMOTIC, un conjunt de dades d'imatges de persones en situacions naturals i diferents anotades amb la seva aparent emoció. La base de dades EMOTIC combina dos tipus de representació d'emocions diferents: (1) un conjunt de 26 categories d'emoció i (2) les dimensions contínues valència, excitació i dominància. També presentem una anàlisi estadística i algorítmica detallada del conjunt de dades juntament amb l'anàlisi d'acords d'anotadors. Els models CNN estan formats per EMOTIC, combinant característiques de la persona amb funcions d'escena (context). Els nostres resultats mostren com el context d'escena aporta informació important per reconèixer automàticament els estats emocionals i motiven més recerca en aquesta direcció.Los estudios psicológicos muestran que el contexto de la escena, además de la expresión facial y la pose corporal, aporta información importante a nuestra percepción de las emociones de las personas. Sin embargo, el procesamiento del contexto para el reconocimiento automático de emociones no se ha explorado en profundidad, en parte debido a la falta de datos adecuados. En esta tesis presentamos EMOTIC, un conjunto de datos de imágenes de personas en situaciones naturales y diferentes anotadas con su aparente emoción. La base de datos EMOTIC combina dos tipos diferentes de representación de emociones: (1) un conjunto de 26 categorías de emociones y (2) las dimensiones continuas de valencia, excitación y dominación. También presentamos un análisis estadístico y algorítmico detallado del conjunto de datos junto con el análisis de concordancia de los anotadores. Los modelos CNN están entrenados en EMOTIC, combinando características de la persona con características de escena (contexto). Nuestros resultados muestran cómo el contexto de la escena aporta información importante para reconocer automáticamente los estados emocionales, lo cual motiva más investigaciones en esta dirección.Psychological studies show that the context of a setting, in addition to facial expression and body language, lends important information that conditions our perception of people's emotions. However, context's processing in the case of automatic emotion recognition has not been explored in depth, partly due to the lack of sufficient data. In this thesis we present EMOTIC, a dataset of images of people in various natural scenarios annotated with their apparent emotion. The EMOTIC database combines two different types of emotion representation: (1) a set of 26 emotion categories, and (2) the continuous dimensions of valence, arousal and dominance. We also present a detailed statistical and algorithmic analysis of the dataset along with the annotators' agreement analysis. CNN models are trained using EMOTIC, combining a person's features with those of the setting (context). Our results not only show how the context of a setting contributes important information for automatically recognizing emotional states but also promote further research in this direction

    Action-oriented Scene Understanding

    Get PDF
    In order to allow robots to act autonomously it is crucial that they do not only describe their environment accurately but also identify how to interact with their surroundings. While we witnessed tremendous progress in descriptive computer vision, approaches that explicitly target action are scarcer. This cumulative dissertation approaches the goal of interpreting visual scenes “in the wild” with respect to actions implied by the scene. We call this approach action-oriented scene understanding. It involves identifying and judging opportunities for interaction with constituents of the scene (e.g. objects and their parts) as well as understanding object functions and how interactions will impact the future. All of these aspects are addressed on three levels of abstraction: elements, perception and reasoning. On the elementary level, we investigate semantic and functional grouping of objects by analyzing annotated natural image scenes. We compare object label-based and visual context definitions with respect to their suitability for generating meaningful object class representations. Our findings suggest that representations generated from visual context are on-par in terms of semantic quality with those generated from large quantities of text. The perceptive level concerns action identification. We propose a system to identify possible interactions for robots and humans with the environment (affordances) on a pixel level using state-of-the-art machine learning methods. Pixel-wise part annotations of images are transformed into 12 affordance maps. Using these maps, a convolutional neural network is trained to densely predict affordance maps from unknown RGB images. In contrast to previous work, this approach operates exclusively on RGB images during both, training and testing, and yet achieves state-of-the-art performance. At the reasoning level, we extend the question from asking what actions are possible to what actions are plausible. For this, we gathered a dataset of household images associated with human ratings of the likelihoods of eight different actions. Based on the judgement provided by the human raters, we train convolutional neural networks to generate plausibility scores from unseen images. Furthermore, having considered only static scenes previously in this thesis, we propose a system that takes video input and predicts plausible future actions. Since this requires careful identification of relevant features in the video sequence, we analyze this particular aspect in detail using a synthetic dataset for several state-of-the-art video models. We identify feature learning as a major obstacle for anticipation in natural video data. The presented projects analyze the role of action in scene understanding from various angles and in multiple settings while highlighting the advantages of assuming an action-oriented perspective. We conclude that action-oriented scene understanding can augment classic computer vision in many real-life applications, in particular robotics

    Robust subspace learning for static and dynamic affect and behaviour modelling

    Get PDF
    Machine analysis of human affect and behavior in naturalistic contexts has witnessed a growing attention in the last decade from various disciplines ranging from social and cognitive sciences to machine learning and computer vision. Endowing machines with the ability to seamlessly detect, analyze, model, predict as well as simulate and synthesize manifestations of internal emotional and behavioral states in real-world data is deemed essential for the deployment of next-generation, emotionally- and socially-competent human-centered interfaces. In this thesis, we are primarily motivated by the problem of modeling, recognizing and predicting spontaneous expressions of non-verbal human affect and behavior manifested through either low-level facial attributes in static images or high-level semantic events in image sequences. Both visual data and annotations of naturalistic affect and behavior naturally contain noisy measurements of unbounded magnitude at random locations, commonly referred to as ‘outliers’. We present here machine learning methods that are robust to such gross, sparse noise. First, we deal with static analysis of face images, viewing the latter as a superposition of mutually-incoherent, low-complexity components corresponding to facial attributes, such as facial identity, expressions and activation of atomic facial muscle actions. We develop a robust, discriminant dictionary learning framework to extract these components from grossly corrupted training data and combine it with sparse representation to recognize the associated attributes. We demonstrate that our framework can jointly address interrelated classification tasks such as face and facial expression recognition. Inspired by the well-documented importance of the temporal aspect in perceiving affect and behavior, we direct the bulk of our research efforts into continuous-time modeling of dimensional affect and social behavior. Having identified a gap in the literature which is the lack of data containing annotations of social attitudes in continuous time and scale, we first curate a new audio-visual database of multi-party conversations from political debates annotated frame-by-frame in terms of real-valued conflict intensity and use it to conduct the first study on continuous-time conflict intensity estimation. Our experimental findings corroborate previous evidence indicating the inability of existing classifiers in capturing the hidden temporal structures of affective and behavioral displays. We present here a novel dynamic behavior analysis framework which models temporal dynamics in an explicit way, based on the natural assumption that continuous- time annotations of smoothly-varying affect or behavior can be viewed as outputs of a low-complexity linear dynamical system when behavioral cues (features) act as system inputs. A novel robust structured rank minimization framework is proposed to estimate the system parameters in the presence of gross corruptions and partially missing data. Experiments on prediction of dimensional conflict and affect as well as multi-object tracking from detection validate the effectiveness of our predictive framework and demonstrate that for the first time that complex human behavior and affect can be learned and predicted based on small training sets of person(s)-specific observations.Open Acces

    A Robust Object Detection System for Driverless Vehicles through Sensor Fusion and Artificial Intelligence Techniques

    Get PDF
    Since the early 1990s, various research domains have been concerned with the concept of autonomous driving, leading to the widespread implementation of numerous advanced driver assistance features. However, fully automated vehicles have not yet been introduced to the market. The process of autonomous driving can be outlined through the following stages: environment perception, ego-vehicle localization, trajectory estimation, path planning, and vehicle control. Environment perception is partially based on computer vision algorithms that can detect and track surrounding objects. The process of objects detection performed by autonomous vehicles is considered challenging for several reasons, such as the presence of multiple dynamic objects in the same scene, interaction between objects, real-time speed requirements, and the presence of diverse weather conditions (e.g., rain, snow, fog, etc.). Although many studies have been conducted on objects detection performed by autonomous vehicles, it remains a challenging task, and improving the performance of object detection in diverse driving scenes is an ongoing field. This thesis aims to develop novel methods for the detection and 3D localization of surrounding dynamic objects in driving scenes in different rainy weather conditions. In this thesis, firstly, owing to the frequent occurrence of rain and its negative effect on the performance of objects detection operation, a real-time lightweight deraining network is proposed; it works on single real-time images separately. Rain streaks and the accumulation of rain streaks introduce distinct visual degradation effects to captured images. The proposed deraining network effectively removes both rain streaks and accumulated rain streaks from images. It makes use of the progressive operation of two main stages: rain streaks removal and rain streaks accumulation removal. The rain streaks removal stage is based on a Residual Network (ResNet) to maintain real-time performance and avoid adding to the computational complexity. Furthermore, the application of recursive computations involves the sharing of network parameters. Meanwhile, distant rain streaks accumulate and induce a distortion similar to fogging. Thus, it could be mitigated in a way similar to defogging. This stage relies on a transmission-guided lightweight network (TGL-Net). The proposed deraining network was evaluated on five datasets having synthetic rain of different properties and two other datasets with real rainy scenes. Secondly, an emphasis has been put on proposing a novel sensory system that achieves realtime multiple dynamic objects detection in driving scenes. The proposed sensory system utilizes a monocular camera and a 2D Light Detection and Ranging (LiDAR) sensor in a complementary fusion approach. YOLOv3- a baseline real-time object detection algorithm has been used to detect and classify objects in images captured by the camera; detected objects are surrounded by bounding boxes to localize them within the frames. Since objects present in a driving scene are dynamic and usually occluding each other, an algorithm has been developed to differentiate objects whose bounding boxes are overlapping. Moreover, the locations of bounding boxes within frames (in pixels) are converted into real-world angular coordinates. A 2D LiDAR was used to obtain depth measurements while maintaining low computational requirements in order to save resources for other autonomous driving related operations. A novel technique has been developed and tested for processing and mapping 2D LiDAR measurements with corresponding bounding boxes. The detection accuracy of the proposed system was manually evaluated in different real-time scenarios. Finally, the effectiveness of the proposed deraining network was validated in terms of its impact on objects detection in the context of de-rained images. Results of the proposed deraining network were compared to existing baseline deraining networks and have shown that the running time of the proposed network is 2.23× faster than the average running time of baseline deraining networks while achieving 1.2× improvement when tested on different synthetic datasets. Moreover, tests on the LiDAR measurements showed an average error of ±0.04m in real driving scenes. Also, both deraining and objects detection are jointly tested, and it was demonstrated that performing deraining ahead of objects detection caused 1.45× enhancement in the object detection precision

    Restoring the balance between stuff and things in scene understanding

    Get PDF
    Scene understanding is a central field in computer vision that attempts to detect objects in a scene and reason about their spatial, functional and semantic relations. While many works focus on things (objects with a well-defined shape), less attention has been given to stuff classes (amorphous background regions). However, stuff classes are important as they allow to explain many aspects of an image, including the scene type, thing classes likely to be present and physical attributes of all objects in the scene. The goal of this thesis is to restore the balance between stuff and things in scene understanding. In particular, we investigate how the recognition of stuff differs from things and develop methods that are suitable to deal with both. We use stuff to find things and annotate a large-scale dataset to study stuff and things in context. First, we present two methods for semantic segmentation of stuff and things. Most methods require manual class weighting to counter imbalanced class frequency distributions, particularly on datasets with stuff and thing classes. We develop a novel joint calibration technique that takes into account class imbalance, class competition and overlapping regions by calibrating for the pixel-level evaluation criterion. The second method shows how to unify the advantages of region-based approaches (accurately delineated object boundaries) and fully convolutional approaches (end-to-end training). Both are combined in a universal framework that is equally suitable to deal with stuff and things. Second, we propose to help weakly supervised object localization for classes where location annotations are not available, by transferring things and stuff knowledge from a source set with available annotations. This is particularly important if we want to scale scene understanding to real-world applications with thousands of classes, without having to exhaustively annotate millions of images. Finally, we present COCO-Stuff – the largest existing dataset with dense stuff and thing annotations. Existing datasets are much smaller and were made with expensive polygon-based annotation. We use a very efficient stuff annotation protocol to densely annotate 164K images. Using this new dataset, we provide a detailed analysis of the dataset and visualize how stuff and things co-occur spatially in an image. We revisit the question whether stuff or things are easier to detect and which is more important based on visual and linguistic analysis

    Efficient Human Activity Recognition in Large Image and Video Databases

    Get PDF
    Vision-based human action recognition has attracted considerable interest in recent research for its applications to video surveillance, content-based search, healthcare, and interactive games. Most existing research deals with building informative feature descriptors, designing efficient and robust algorithms, proposing versatile and challenging datasets, and fusing multiple modalities. Often, these approaches build on certain conventions such as the use of motion cues to determine video descriptors, application of off-the-shelf classifiers, and single-factor classification of videos. In this thesis, we deal with important but overlooked issues such as efficiency, simplicity, and scalability of human activity recognition in different application scenarios: controlled video environment (e.g.~indoor surveillance), unconstrained videos (e.g.~YouTube), depth or skeletal data (e.g.~captured by Kinect), and person images (e.g.~Flicker). In particular, we are interested in answering questions like (a) is it possible to efficiently recognize human actions in controlled videos without temporal cues? (b) given that the large-scale unconstrained video data are often of high dimension low sample size (HDLSS) nature, how to efficiently recognize human actions in such data? (c) considering the rich 3D motion information available from depth or motion capture sensors, is it possible to recognize both the actions and the actors using only the motion dynamics of underlying activities? and (d) can motion information from monocular videos be used for automatically determining saliency regions for recognizing actions in still images
    corecore