9 research outputs found

    Co-interest Person Detection from Multiple Wearable Camera Videos

    Full text link
    Wearable cameras, such as Google Glass and Go Pro, enable video data collection over larger areas and from different views. In this paper, we tackle a new problem of locating the co-interest person (CIP), i.e., the one who draws attention from most camera wearers, from temporally synchronized videos taken by multiple wearable cameras. Our basic idea is to exploit the motion patterns of people and use them to correlate the persons across different videos, instead of performing appearance-based matching as in traditional video co-segmentation/localization. This way, we can identify CIP even if a group of people with similar appearance are present in the view. More specifically, we detect a set of persons on each frame as the candidates of the CIP and then build a Conditional Random Field (CRF) model to select the one with consistent motion patterns in different videos and high spacial-temporal consistency in each video. We collect three sets of wearable-camera videos for testing the proposed algorithm. All the involved people have similar appearances in the collected videos and the experiments demonstrate the effectiveness of the proposed algorithm.Comment: ICCV 201

    Watch and Learn: Semi-Supervised Learning of Object Detectors from Videos

    Full text link
    We present a semi-supervised approach that localizes multiple unknown object instances in long videos. We start with a handful of labeled boxes and iteratively learn and label hundreds of thousands of object instances. We propose criteria for reliable object detection and tracking for constraining the semi-supervised learning process and minimizing semantic drift. Our approach does not assume exhaustive labeling of each object instance in any single frame, or any explicit annotation of negative data. Working in such a generic setting allow us to tackle multiple object instances in video, many of which are static. In contrast, existing approaches either do not consider multiple object instances per video, or rely heavily on the motion of the objects present. The experiments demonstrate the effectiveness of our approach by evaluating the automatically labeled data on a variety of metrics like quality, coverage (recall), diversity, and relevance to training an object detector.Comment: To appear in CVPR 201

    Segmentación temporal y reconocimiento débilmente supervisado de acciones en vídeos

    Get PDF
    El reconocimiento de acciones en vídeos es, sin duda, uno de los problemas de visión por computador más relevantes en la actualidad. Uno de los principales motivos de que ésto sea así son las numerosas aplicaciones derivadas que podrían ser desarrolladas en diversos ámbitos de la ciencia y la vida cotidiana y el entretenimiento. Si además de reconocer las acciones presentes en los vídeos somos capaces de segmentarlas temporalmente, ésto es, determinar los instantes en que empiezan y acaban, su identificación es mucho más completa. No sólo sabríamos que en el vídeo en cuestión aparece una determinada acción, sino que dispondríamos de información adicional para analizarla con más detalle. En este proyecto se formula el problema de la segmentación temporal y el reconocimiento de acciones en vídeos mediante una función de coste, o función de energía, definida de forma débilmente supervisada. A diferencia de los métodos existentes, los cuales emplean un número enorme de vídeos anotados para entrenar los algoritmos, en este proyecto se ha utilizado un único vídeo anotado por cada acción que se pretende reconocer. Con ello conseguimos que la fase de aprendizaje del algoritmo sea menos costosa en esfuerzo humano y que el método sea aplicable a casi cualquier dataset de vídeos. La energía formulada se compone de una serie de términos y parámetros que han sido ajustados mediante la experimentación. Se ha utilizado para ello un dataset de videos realistas extraídos de películas, construído a partir del dataset Hollywood2. La minimización de la energía proporciona la solución de menor coste del problema, es decir, la solución óptima. La bondad de los resultados de minimización se ha evaluado mediante la comparación con un ground truth creado a partir de los vídeos de estudio. Los resultados obtenidos en nuestro dataset y en el dataset KTH demuestran que es posible obtener buenas tasas de acierto en segmentación temporal y reconocimiento de acciones en vídeos de forma débilmente supervisada

    CONTENT EXTRACTION BASED ON VIDEO CO-SEGMENTATION

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Interest Detection in Image, Video and Multiple Videos: Model and Applications

    Get PDF
    Interest detection is detecting an object, event, or process that draws attention. In this dissertation, we focus on interest detection in images, video and multiple videos. Interest detection in an image or a video is closely related to visual attention. However, the interest detection in multiple videos needs to consider all the videos as a whole rather than considering the attention in each single video independently. Visual attention is an important mechanism of human vision. The computational model of visual attention has recently attracted a lot of interest in the computer vision community mainly because it helps find the objects or regions that efficiently represent a scene and thus aids in solving complex vision problems such as scene understanding. In this dissertation, we first introduce a new computational visual-attention model for detecting region of interest in static images and/or videos. This model constructs the saliency map for each image and takes the region with the highest saliency value as the region of interest. Specifically, we use the Earth Mover’s Distance (EMD) to measure the center-surround difference in the receptive field. Furthermore, we propose to take two steps of biologically-inspired nonlinear operations for combining different features: combining subsets of basic features into a set of super features using the Lm-norm and then combining the super features using the Winner-Take- All mechanism. Then, we extend the proposed model to construct dynamic saliency maps from videos by computing the center-surround difference in the spatio-temporal receptive field. Motivated by the natural relation between visual saliency and object/region of interest, we then propose an algorithm to isolate infrequently moving foreground from background with frequent local motions, in which the saliency detection technique is used to identify the foreground (object/region of interest) and background. Traditional motion detection usually assumes that the background is static while the foreground objects are moving most of the time. However, in practice, especially in surveillance, the foreground objects may show infrequent motion. For example, a person may stand in the same place for most of the time. Meanwhile, the background may contain frequent local motions, such as trees and/or grass waving in the breeze. Such complexities may prevent the existing background subtraction algorithms from correctly identifying the foreground objects. In this dissertation, we propose a background subtraction approach that can detect the foreground objects with frequent and/or infrequent motions. Finally, we focus on the task of locating the co-interest person from multiple temporally synchronized videos taken by the multiple wearable cameras. More specifically, we propose a co-interest detection algorithm that can find persons that draw attention from most camera wearers, even if multiple similar-appearance persons are present in the videos. Our basic idea is to exploit the motion pattern, location, and size of persons detected in different synchronized videos and use them to correlate the detected persons across different videos – one person in a video may be the same person in another video at the same time. We utilized a Conditional Random Field (CRF) to achieve this goal, by taking each frame as a node and the detected persons as the states at each node. We collect three sets of wearable-camera videos for testing the proposed algorithm where each set consists of six temporally synchronized videos

    Bayesian non-parametrics for multi-modal segmentation

    Get PDF
    Segmentation is a fundamental and core problem in computer vision research which has applications in many tasks, such as object recognition, content-based image retrieval, and semantic labelling. To partition the data into groups coherent in one or more characteristics such as semantic classes, is often a first step towards understanding the content of data. As information in the real world is generally perceived in multiple modalities, segmentation performed on multi-modal data for extracting the latent structure usually encounters a challenge: how to combine features from multiple modalities and resolve accidental ambiguities. This thesis tackles three main axes of multi-modal segmentation problems: video segmentation and object discovery, activity segmentation and discovery, and segmentation in 3D data. For the first two axes, we introduce non-parametric Bayesian approaches for segmenting multi-modal data collections, including groups of videos and context sensor streams. The proposed method shows benefits on: integrating multiple features and data dependencies in a probabilistic formulation, inferring the number of clusters from data and hierarchical semantic partitions, as well as resolving ambiguities by joint segmentation across videos or streams. The third axis focuses on the robust use of 3D information for various applications, as 3D perception provides richer geometric structure and holistic observation of the visual scene. The studies covered in this thesis for utilizing various types of 3D data include: 3D object segmentation based on Kinect depth sensing improved by cross-modal stereo, matching 3D CAD models to objects on 2D image plane by exploiting the differentiability of the HOG descriptor, segmenting stereo videos based on adaptive ensemble models, and fusing 2D object detectors with 3D context information for an augmented reality application scenario.Segmentierung ist ein zentrales problem in der Computer Vision Forschung mit Anwendungen in vielen Bereichen wie der Objekterkennung, der inhaltsbasierten Bildsuche und dem semantischen Labelling. Daten in Gruppen zu partitionieren, die in einer oder mehreren Eigenschaften wie zum Beispiel der semantischen Klasse übereinstimmen, ist oft ein erster Schritt in Richtung Inhaltsanalyse. Da Informationen in der realen Welt im Allgemeinen multi-modal wahrgenommen werden, wird die Segmentierung auf multi-modale Daten angewendet und die latente Struktur dahinter extrahiert. Dies stellt in der Regel eine Herausforderung dar: Wie kombiniert man Merkmale aus mehreren Modalitäten und beseitigt zufällige Mehrdeutigkeiten? Diese Doktorarbeit befasst sich mit drei Hauptachsen multi-modaler Segmentierungsprobleme: Videosegmentierung und Objektentdeckung, Aktivitätssegmentierung und –entdeckung, sowie Segmentierung von 3D Daten. Für die ersten beiden Achsen führen wir nichtparametrische Bayessche Ansätze ein um multi-modale Datensätze wie Videos und Kontextsensor-Ströme zu segmentieren. Die vorgeschlagene Methode zeigt Vorteile in folgenden Bereichen: Integration multipler Merkmale und Datenabhängigkeiten in probabilistischen Formulierungen, Bestimmung der Anzahl der Cluster und hierarchische, semantischen Partitionen, sowie die Beseitigung von Mehrdeutigkeiten in gemeinsamen Segmentierungen in Videos und Sensor-Strömen. Die dritte Achse konzentiert sich auf die robuste Nutzung von 3D Informationen für verschiedene Anwendungen. So bietet die 3D-Wahrnehmung zum Beispiel reichere geometrische Strukturen und eine holistische Betrachtung der sichtbaren Szene. Die Untersuchungen, die in dieser Arbeit zur Nutzung verschiedener Arten von 3D-Daten vorgestellt werden, umfassen: die 3D-Objektsegmentierung auf Basis der Kinect Tiefenmessung, verbessert durch cross-modale Stereoverfahren, die Anpassung von 3D-CAD-Modellen auf Objekte in der 2D-Bildebene durch Ausnutzung der Differenzierbarkeit des HOG-Descriptors, die Segmentierung von Stereo-Videos, basierend auf adaptiven Ensemble-Modellen, sowie der Verschmelzung von 2D- Objektdetektoren mit 3D-Kontextinformationen für ein Augmented-Reality Anwendungsszenario

    Bayesian non-parametrics for multi-modal segmentation

    Get PDF
    Segmentation is a fundamental and core problem in computer vision research which has applications in many tasks, such as object recognition, content-based image retrieval, and semantic labelling. To partition the data into groups coherent in one or more characteristics such as semantic classes, is often a first step towards understanding the content of data. As information in the real world is generally perceived in multiple modalities, segmentation performed on multi-modal data for extracting the latent structure usually encounters a challenge: how to combine features from multiple modalities and resolve accidental ambiguities. This thesis tackles three main axes of multi-modal segmentation problems: video segmentation and object discovery, activity segmentation and discovery, and segmentation in 3D data. For the first two axes, we introduce non-parametric Bayesian approaches for segmenting multi-modal data collections, including groups of videos and context sensor streams. The proposed method shows benefits on: integrating multiple features and data dependencies in a probabilistic formulation, inferring the number of clusters from data and hierarchical semantic partitions, as well as resolving ambiguities by joint segmentation across videos or streams. The third axis focuses on the robust use of 3D information for various applications, as 3D perception provides richer geometric structure and holistic observation of the visual scene. The studies covered in this thesis for utilizing various types of 3D data include: 3D object segmentation based on Kinect depth sensing improved by cross-modal stereo, matching 3D CAD models to objects on 2D image plane by exploiting the differentiability of the HOG descriptor, segmenting stereo videos based on adaptive ensemble models, and fusing 2D object detectors with 3D context information for an augmented reality application scenario.Segmentierung ist ein zentrales problem in der Computer Vision Forschung mit Anwendungen in vielen Bereichen wie der Objekterkennung, der inhaltsbasierten Bildsuche und dem semantischen Labelling. Daten in Gruppen zu partitionieren, die in einer oder mehreren Eigenschaften wie zum Beispiel der semantischen Klasse übereinstimmen, ist oft ein erster Schritt in Richtung Inhaltsanalyse. Da Informationen in der realen Welt im Allgemeinen multi-modal wahrgenommen werden, wird die Segmentierung auf multi-modale Daten angewendet und die latente Struktur dahinter extrahiert. Dies stellt in der Regel eine Herausforderung dar: Wie kombiniert man Merkmale aus mehreren Modalitäten und beseitigt zufällige Mehrdeutigkeiten? Diese Doktorarbeit befasst sich mit drei Hauptachsen multi-modaler Segmentierungsprobleme: Videosegmentierung und Objektentdeckung, Aktivitätssegmentierung und –entdeckung, sowie Segmentierung von 3D Daten. Für die ersten beiden Achsen führen wir nichtparametrische Bayessche Ansätze ein um multi-modale Datensätze wie Videos und Kontextsensor-Ströme zu segmentieren. Die vorgeschlagene Methode zeigt Vorteile in folgenden Bereichen: Integration multipler Merkmale und Datenabhängigkeiten in probabilistischen Formulierungen, Bestimmung der Anzahl der Cluster und hierarchische, semantischen Partitionen, sowie die Beseitigung von Mehrdeutigkeiten in gemeinsamen Segmentierungen in Videos und Sensor-Strömen. Die dritte Achse konzentiert sich auf die robuste Nutzung von 3D Informationen für verschiedene Anwendungen. So bietet die 3D-Wahrnehmung zum Beispiel reichere geometrische Strukturen und eine holistische Betrachtung der sichtbaren Szene. Die Untersuchungen, die in dieser Arbeit zur Nutzung verschiedener Arten von 3D-Daten vorgestellt werden, umfassen: die 3D-Objektsegmentierung auf Basis der Kinect Tiefenmessung, verbessert durch cross-modale Stereoverfahren, die Anpassung von 3D-CAD-Modellen auf Objekte in der 2D-Bildebene durch Ausnutzung der Differenzierbarkeit des HOG-Descriptors, die Segmentierung von Stereo-Videos, basierend auf adaptiven Ensemble-Modellen, sowie der Verschmelzung von 2D- Objektdetektoren mit 3D-Kontextinformationen für ein Augmented-Reality Anwendungsszenario

    Video co-segmentation for meaningful action extraction

    No full text
    10.1109/ICCV.2013.278Proceedings of the IEEE International Conference on Computer Vision2232-2239PICV
    corecore