50 research outputs found

    Pedestrian Attribute Recognition: A Survey

    Full text link
    Recognizing pedestrian attributes is an important task in computer vision community due to it plays an important role in video surveillance. Many algorithms has been proposed to handle this task. The goal of this paper is to review existing works using traditional methods or based on deep learning networks. Firstly, we introduce the background of pedestrian attributes recognition (PAR, for short), including the fundamental concepts of pedestrian attributes and corresponding challenges. Secondly, we introduce existing benchmarks, including popular datasets and evaluation criterion. Thirdly, we analyse the concept of multi-task learning and multi-label learning, and also explain the relations between these two learning algorithms and pedestrian attribute recognition. We also review some popular network architectures which have widely applied in the deep learning community. Fourthly, we analyse popular solutions for this task, such as attributes group, part-based, \emph{etc}. Fifthly, we shown some applications which takes pedestrian attributes into consideration and achieve better performance. Finally, we summarized this paper and give several possible research directions for pedestrian attributes recognition. The project page of this paper can be found from the following website: \url{https://sites.google.com/view/ahu-pedestrianattributes/}.Comment: Check our project page for High Resolution version of this survey: https://sites.google.com/view/ahu-pedestrianattributes

    3D Robotic Sensing of People: Human Perception, Representation and Activity Recognition

    Get PDF
    The robots are coming. Their presence will eventually bridge the digital-physical divide and dramatically impact human life by taking over tasks where our current society has shortcomings (e.g., search and rescue, elderly care, and child education). Human-centered robotics (HCR) is a vision to address how robots can coexist with humans and help people live safer, simpler and more independent lives. As humans, we have a remarkable ability to perceive the world around us, perceive people, and interpret their behaviors. Endowing robots with these critical capabilities in highly dynamic human social environments is a significant but very challenging problem in practical human-centered robotics applications. This research focuses on robotic sensing of people, that is, how robots can perceive and represent humans and understand their behaviors, primarily through 3D robotic vision. In this dissertation, I begin with a broad perspective on human-centered robotics by discussing its real-world applications and significant challenges. Then, I will introduce a real-time perception system, based on the concept of Depth of Interest, to detect and track multiple individuals using a color-depth camera that is installed on moving robotic platforms. In addition, I will discuss human representation approaches, based on local spatio-temporal features, including new “CoDe4D” features that incorporate both color and depth information, a new “SOD” descriptor to efficiently quantize 3D visual features, and the novel AdHuC features, which are capable of representing the activities of multiple individuals. Several new algorithms to recognize human activities are also discussed, including the RG-PLSA model, which allows us to discover activity patterns without supervision, the MC-HCRF model, which can explicitly investigate certainty in latent temporal patterns, and the FuzzySR model, which is used to segment continuous data into events and probabilistically recognize human activities. Cognition models based on recognition results are also implemented for decision making that allow robotic systems to react to human activities. Finally, I will conclude with a discussion of future directions that will accelerate the upcoming technological revolution of human-centered robotics

    Understanding Objects in the Visual World

    Get PDF
    One way to understand the visual world is by reasoning about the objects present in it: their type, their location, their similarities, their layout etc. Despite several successes, detailed recognition remains a challenging tasks for current computer vision systems. This dissertation focuses on building systems that improve on the state-of-the-art on several fronts. On one hand, we propose better representations of visual categories that enable more accurate reasoning about their properties. To learn such representations, we employ machine learning methods that leverage the power of big-data. On the other hand, we present solutions to make current frameworks more efficient without losing on performance. The first part of the dissertation focuses on improvements in efficiency. We first introduce a fast automated mechanism for selecting a diverse set of discriminative filters and show that one can efficiently learn a universal model of filter "goodness" based on properties of the filter itself. As an alternative to the expensive evaluation of filters, which is often the bottleneck in many techniques, our method has the potential of dramatically altering the trade-off between the accuracy of a filter based method and the cost of training. Second, we present a method for linear dimensionality reduction which we call composite discriminant factor analysis (CDF). CDF searches for a discriminative but compact feature subspace in which the classifiers can be trained, leading to an order of magnitude saving in detection time. In the second part, we focus on the problem of person re-identification, an important component of surveillance systems. We present a deep learning architecture that simultaneously learns features and computes their corresponding similarity metric. Given a pair of images as input, our network outputs a similarity value indicating whether the two input images depict the same person. We propose new layers which capture local relationships among mid-level features, produce a high-level summary of these relationships and spatially integrate them to give a holistic representation. In the final part, we present a semantic object selection framework that uses natural language input to perform image editing. In the general context of interactive object segmentation, many of the methods that utilize user input (such as mouse clicks and mouse strokes) often require significant user intervention. In this work, we present a system with a far simpler input method: the user only needs to give the name of the desired object. For this problem we present a solution which borrows ideas from image retrieval, segmentation propagation, object localization and convolution neural networks

    Analyzing Human-Human Interactions: A Survey

    Full text link
    Many videos depict people, and it is their interactions that inform us of their activities, relation to one another and the cultural and social setting. With advances in human action recognition, researchers have begun to address the automated recognition of these human-human interactions from video. The main challenges stem from dealing with the considerable variation in recording setting, the appearance of the people depicted and the coordinated performance of their interaction. This survey provides a summary of these challenges and datasets to address these, followed by an in-depth discussion of relevant vision-based recognition and detection methods. We focus on recent, promising work based on deep learning and convolutional neural networks (CNNs). Finally, we outline directions to overcome the limitations of the current state-of-the-art to analyze and, eventually, understand social human actions

    A highly adaptable model based – method for colour image interpretation

    Get PDF
    This Thesis presents a model-based interpretation of images that can vary greatly in appearance. Rather than seek characteristic landmarks to model objects we sample points at regular intervals on the boundary to model objects with a smooth boundary. A statistical model of form in the exponent domain of an extended superellipse is created using sampled points and appearance by sampling inside objects. A colour Maximum Likelihood Ratio criterion (MLR) was used to detect cues to the location of potential pedestrians. The adaptability and specificity of this cue detector was evaluated using over 700 images. A True Positive Rate (TPR) of 0.95 and a False Positive Rate (FPR) of 0.20 were obtained. To detect objects with axes at various orientations a variant method using an interpolated colour MLR has been developed. This had a TPR of 0.94 and an FPR of 0.21 when tested over 700 images of pedestrians. Interpretation was evaluated using over 220 video sequences (640 x 480 pixels per frame) and 1000 images of people alone and people associated with other objects. The objective was not so much to evaluate pedestrian detection but the precision and reliability of object delineation. More than 94% of pedestrians were correctly interpreted

    An AI-based Framework For Parent-child Interaction Analysis

    Get PDF
    The quality of parent-child interactions is foundational to children's social-emotional and cognitive development, as well as their lifelong mental health. The Parent-Child Interaction Teaching Scale (PCITS) is a well-established and effective tool used to measure parent-child interaction quality. It is utilized in both public health settings and basic and applied research studies to identify problem areas within parent-child interactions. However, like other observational measures of parent-child interaction quality, the PCITS can be time-consuming to administer and score, which limits its wider implementation. Therefore, the main objective of this research is to organize a framework for the recognition of behavioural symptoms of the child and parent during interventions. Based on the literature on interactive parent-child behaviour analysis, we categorized PCITS labels into three modalities: language, audio, and video. Some labels have dyadic actors, while others have a single actor (either the parent or child). In addition, within each modality, there are technical issues, considerations, and limitations in terms of artificial intelligence. Hence, we divided the problem into three modalities, proposed models for each modality, and a solution to combine them. Firstly, we proposed a model for recognizing action-related labels (video). These labels are interactive and involve two actors: the parent and the child. We conducted a feature extraction algorithm to produce semantic features passed through a feature selection algorithm to extract the most meaningful semantic features from the video. We chose this method due to its lower data requirement compared to other modalities. Also, because of using 2D video files, the proposed feature extraction and selection algorithms are to handle the occlusion and natural conditions like camera movement, Secondly, we proposed a model for recognizing language- and audio-related labels. These labels represent a single-actor role for the parent, as children are not yet capable of producing meaningful text in the intervention videos. To develop this model, we conducted research on a similar dataset to utilize transfer learning between two problems. Therefore, the second part of this research is associated with working on this text dataset. Third, we focused on multi-modal aspects of the work. We conducted experiments to determine how to integrate the prior work into our model. We also provided an ensemble model, which combined the modalities of language and audio based on the semantic and syntactic characteristics of the text. This ensemble model provides a baseline for developing further models with different aspects and modalities. Finally, we provided a roadmap to support more labels that were not covered in this research due to not reaching enough samples. Our proposed framework includes a labelling system that we developed in the primary stages of the research to gather labelled data. This system also plays a role to be integrated with AI modules to provide auto-recognition of the behavioural labels in parent-child interaction videos to the nurses

    Articulated people detection and pose estimation in challenging real world environments

    Get PDF
    In this thesis we are interested in the problem of articulated people detection and pose estimation being key ingredients towards understanding visual scenes containing people. First, we investigate how statistical 3D human shape models from computer graphics can be leveraged to ease training data generation. Second, we develop expressive models for 2D single- and multi-person pose estimation. Third, we introduce a novel human pose estimation benchmark that makes a significant advance in terms of diversity and difficulty. Thorough experimental evaluation on standard benchmarks demonstrates significant improvements due to the proposed data augmentation techniques and novel body models, while detailed performance analysis of competing approaches on our novel benchmark allows to identify the most promising directions of improvement.In dieser Arbeit untersuchen wir das Problem der artikulierten Detektion und Posenschätzung von Personen als Schlüsselkomponenten des Verstehens von visuellen Szenen mit Personen. Obwohl es umfangreiche Bemühungen gibt, die Lösung dieser Probleme anzugehen, haben wir drei vielversprechende Herangehensweisen ermittelt, die unserer Meinung nach bisher nicht ausreichend beachtet wurden. Erstens untersuchen wir, wie statistische 3 D Modelle des menschlichen Umrisses, die aus der Computergrafik stammen, wirksam eingesetzt werden können, um die Generierung von Trainingsdaten zu erleichtern. Wir schlagen eine Reihe von Techniken zur automatischen Datengenerierung vor, die eine direkte Repräsentation relevanter Variationen in den Trainingsdaten erlauben. Indem wir Stichproben aus der zu Grunde liegenden Verteilung des menschlichen Umrisses und aus einem großen Datensatz von menschlichen Posen ziehen, erzeugen wir eine neue für unsere Aufgabe relevante Auswahl mit regulierbaren Variationen von Form und Posen. Darüber hinaus verbessern wir das neueste 3 D Modell des menschlichen Umrisses selbst, indem wir es aus einem großen handelsüblichen Datensatz von 3 D Körpern neu aufbauen. Zweitens entwickeln wir ausdrucksstarke räumliche Modelle und ErscheinungsbildModelle für die 2 D Posenschätzung einzelner und mehrerer Personen. Wir schlagen ein ausdrucksstarkes Einzelperson-Modell vor, das Teilabhängigkeiten höherer Ordnung einbezieht, aber dennoch effizient bleibt. Wir verstärken dieses Modell durch verschiedene Arten von starken Erscheinungsbild-Repräsentationen, um die Körperteilhypothesen erheblich zu verbessern. Schließlich schlagen wir ein ausdruckstarkes Modell zur gemeinsamen Posenschätzung mehrerer Personen vor. Dazu entwickeln wir starke Deep Learning-basierte Körperteildetektoren und ein ausdrucksstarkes voll verbundenes räumliches Modell. Der vorgeschlagene Ansatz behandelt die Posenschätzung mehrerer Personen als ein Problem der gemeinsamen Aufteilung und Annotierung eines Satzes von Körperteilhypothesen: er erschließt die Anzahl von Personen in einer Szene, identifiziert verdeckte Körperteile und unterscheidet eindeutig Körperteile von Personen, die sich nahe beieinander befinden. Drittens führen wir eine gründliche Bewertung und Performanzanalyse führender Methoden der menschlichen Posenschätzung und Aktivitätserkennung durch. Dazu stellen wir einen neuen Benchmark vor, der einen bedeutenden Fortschritt bezüglich Diversität und Schwierigkeit im Vergleich zu bisherigen Datensätzen mit sich bringt und über 40 . 000 annotierte Körperposen und mehr als 1 . 5 Millionen Einzelbilder enthält. Darüber hinaus stellen wir einen reichhaltigen Satz an Annotierungen zur Verfügung, die zu einer detaillierten Analyse konkurrierender Herangehensweisen benutzt werden, wodurch wir Erkenntnisse zu Erfolg und Mißerfolg dieser Methoden erhalten. Zusammengefasst präsentiert diese Arbeit einen neuen Ansatz zur artikulierten Detektion und Posenschätzung von Personen. Eine gründliche experimentelle Evaluation auf Standard-Benchmarkdatensätzen zeigt signifikante Verbesserungen durch die vorgeschlagenen Datenverstärkungstechniken und neuen Körpermodelle, während eine detaillierte Performanzanalyse konkurrierender Herangehensweisen auf unserem neu vorgestellten großen Benchmark uns erlaubt, die vielversprechendsten Bereiche für Verbesserungen zu erkennen
    corecore