38 research outputs found

    Quick, accurate, smart: 3D computer vision technology helps assessing confined animals' behaviour

    Get PDF
    <p>(a) Visual representation of the alignment of two sequences using the Dynamic Time Warping (DTW). The DTW stretches the sequences in time by matching the same point with several points of the compared time series. (b) The Needleman Wunsh (NW) algorithm substitutes the temporal stretch with gap elements (red circles in the table) inserting blank spaces instead of forcefully matching point. The alignment is achieved by arranging the two sequences in this table, the first sequence row-wise (T) and the second column-wise (S). The figure shows a score table for two hypothetical sub-sequences (i, j) and the alignment scores (numbers in cells) for each pair of elements forming the sequence (letters in head row and head column). Arrows show the warping path between the two series and consequently the final alignment. The optimal alignment score is in the bottom-right cell of the table.</p

    Quick, accurate, smart: 3D computer vision technology helps assessing confined animals' behaviour

    Get PDF
    Mankind directly controls the environment and lifestyles of several domestic species for purposes ranging from production and research to conservation and companionship. These environments and lifestyles may not offer these animals the best quality of life. Behaviour is a direct reflection of how the animal is coping with its environment. Behavioural indicators are thus among the preferred parameters to assess welfare. However, behavioural recording (usually from video) can be very time consuming and the accuracy and reliability of the output rely on the experience and background of the observers. The outburst of new video technology and computer image processing gives the basis for promising solutions. In this pilot study, we present a new prototype software able to automatically infer the behaviour of dogs housed in kennels from 3D visual data and through structured machine learning frameworks. Depth information acquired through 3D features, body part detection and training are the key elements that allow the machine to recognise postures, trajectories inside the kennel and patterns of movement that can be later labelled at convenience. The main innovation of the software is its ability to automatically cluster frequently observed temporal patterns of movement without any pre-set ethogram. Conversely, when common patterns are defined through training, a deviation from normal behaviour in time or between individuals could be assessed. The software accuracy in correctly detecting the dogs' behaviour was checked through a validation process. An automatic behaviour recognition system, independent from human subjectivity, could add scientific knowledge on animals' quality of life in confinement as well as saving time and resources. This 3D framework was designed to be invariant to the dog's shape and size and could be extended to farm, laboratory and zoo quadrupeds in artificial housing. The computer vision technique applied to this software is innovative in non-human animal behaviour science. Further improvements and validation are needed, and future applications and limitations are discussed.</p

    Learning by correlation for computer vision applications: from Kernel methods to deep learning

    Get PDF
    Learning to spot analogies and differences within/across visual categories is an arguably powerful approach in machine learning and pattern recognition which is directly inspired by human cognition. In this thesis, we investigate a variety of approaches which are primarily driven by correlation and tackle several computer vision applications

    Discriminative latent variable models for visual recognition

    Get PDF
    Visual Recognition is a central problem in computer vision, and it has numerous potential applications in many dierent elds, such as robotics, human computer interaction, and entertainment. In this dissertation, we propose two discriminative latent variable models for handling challenging visual recognition problems. In particular, we use latent variables to capture and model various prior knowledge in the training data. In the rst model, we address the problem of recognizing human actions from still images. We jointly consider both poses and actions in a unied framework, and treat human poses as latent variables. The learning of this model follows the framework of latent SVM. Secondly, we propose another latent variable model to address the problem of automated tag learning on YouTube videos. In particular, we address the semantic variations (sub-tags) of the videos which have the same tag. In the model, each video is assumed to be associated with a sub-tag label, and we treat this sub-tag label as latent information. This model is trained using a latent learning framework based on LogitBoost, which jointly considers both the latent sub-tag label and the tag label. Moreover, we propose a novel discriminative latent learning framework, kernel latent SVM, which combines the benet of latent SVM and kernel methods. The framework of kernel latent SVM is general enough to be applied in many applications of visual recognition. It is also able to handle complex latent variables with interdependent structures using composite kernels

    IST Austria Thesis

    Get PDF
    The human ability to recognize objects in complex scenes has driven research in the computer vision field over couple of decades. This thesis focuses on the object recognition task in images. That is, given the image, we want the computer system to be able to predict the class of the object that appears in the image. A recent successful attempt to bridge semantic understanding of the image perceived by humans and by computers uses attribute-based models. Attributes are semantic properties of the objects shared across different categories, which humans and computers can decide on. To explore the attribute-based models we take a statistical machine learning approach, and address two key learning challenges in view of object recognition task: learning augmented attributes as mid-level discriminative feature representation, and learning with attributes as privileged information. Our main contributions are parametric and non-parametric models and algorithms to solve these frameworks. In the parametric approach, we explore an autoencoder model combined with the large margin nearest neighbor principle for mid-level feature learning, and linear support vector machines for learning with privileged information. In the non-parametric approach, we propose a supervised Indian Buffet Process for automatic augmentation of semantic attributes, and explore the Gaussian Processes classification framework for learning with privileged information. A thorough experimental analysis shows the effectiveness of the proposed models in both parametric and non-parametric views

    Part based object detection with a flexible context constraint

    Get PDF
    This work describes an object detection system which integrates flexible spatial context constraints to improve detection performance. It allows spatial and scale deformation of the object relative to its context. The contextual model extends an existing deformable parts model and is trained on partially labeled data using a latent SVM. The approach can be applied to any object detection problem where the object class always exists in one typical image context, but the context can appear independently. A new scoring method is used to model the asymmetric relationship between object and context. Furthermore, the system enables the use of contextual non-maximum suppression, a context sensitive way to discard redundant detections. Trained on our combined dataset of dresses and persons, the system achieves a significant improvement in detection performance when compared with basic deformable parts models

    Visual object category discovery in images and videos

    Get PDF
    textThe current trend in visual recognition research is to place a strict division between the supervised and unsupervised learning paradigms, which is problematic for two main reasons. On the one hand, supervised methods require training data for each and every category that the system learns; training data may not always be available and is expensive to obtain. On the other hand, unsupervised methods must determine the optimal visual cues and distance metrics that distinguish one category from another to group images into semantically meaningful categories; however, for unlabeled data, these are unknown a priori. I propose a visual category discovery framework that transcends the two paradigms and learns accurate models with few labeled exemplars. The main insight is to automatically focus on the prevalent objects in images and videos, and learn models from them for category grouping, segmentation, and summarization. To implement this idea, I first present a context-aware category discovery framework that discovers novel categories by leveraging context from previously learned categories. I devise a novel object-graph descriptor to model the interaction between a set of known categories and the unknown to-be-discovered categories, and group regions that have similar appearance and similar object-graphs. I then present a collective segmentation framework that simultaneously discovers the segmentations and groupings of objects by leveraging the shared patterns in the unlabeled image collection. It discovers an ensemble of representative instances for each unknown category, and builds top-down models from them to refine the segmentation of the remaining instances. Finally, building on these techniques, I show how to produce compact visual summaries for first-person egocentric videos that focus on the important people and objects. The system leverages novel egocentric and high-level saliency features to predict important regions in the video, and produces a concise visual summary that is driven by those regions. I compare against existing state-of-the-art methods for category discovery and segmentation on several challenging benchmark datasets. I demonstrate that we can discover visual concepts more accurately by focusing on the prevalent objects in images and videos, and show clear advantages of departing from the status quo division between the supervised and unsupervised learning paradigms. The main impact of my thesis is that it lays the groundwork for building large-scale visual discovery systems that can automatically discover visual concepts with minimal human supervision.Electrical and Computer Engineerin

    Productive Vision: Methods for Automatic Image Comprehension

    Get PDF
    Image comprehension is the ability to summarize, translate, and answer basic questions about images. Using original techniques for scene object parsing, material labeling, and activity recognition, a system can gather information about the objects and actions in a scene. When this information is integrated into a deep knowledge base capable of inference, the system becomes capable of performing tasks that, when performed by students, are considered by educators to demonstrate comprehension. The vision components of the system consist of the following: object scene parsing by means of visual filters, material scene parsing by superpixel segmentation and kernel descriptors, and activity recognition by action grammars. These techniques are characterized and compared with the state-of-the-art in their respective fields. The output of the vision components is a list of assertions in a Cyc microtheory. By reasoning on these assertions and the rest of the Cyc knowledge base, the system is able to perform a variety of tasks, including the following: Recognize essential parts of objects are likely present in the scene despite not having an explicit detector for them. Recognize the likely presence of objects due to the presence of their essential parts. Improve estimates of both object and material labels by incorporating knowledge about the typical pairings. Label ambiguous objects with a more general label that encompasses both possible labelings. Answer questions about the scene that require inference and give justifications for the answers in natural language. Create a visual representation of the scene in a new medium. Recognize scene similarity even when there is little visual similarity
    corecore