156,184 research outputs found

    A bag-to-class divergence approach to multiple-instance learning

    Full text link
    In multi-instance (MI) learning, each object (bag) consists of multiple feature vectors (instances), and is most commonly regarded as a set of points in a multidimensional space. A different viewpoint is that the instances are realisations of random vectors with corresponding probability distribution, and that a bag is the distribution, not the realisations. In MI classification, each bag in the training set has a class label, but the instances are unlabelled. By introducing the probability distribution space to bag-level classification problems, dissimilarities between probability distributions (divergences) can be applied. The bag-to-bag Kullback-Leibler information is asymptotically the best classifier, but the typical sparseness of MI training sets is an obstacle. We introduce bag-to-class divergence to MI learning, emphasising the hierarchical nature of the random vectors that makes bags from the same class different. We propose two properties for bag-to-class divergences, and an additional property for sparse training sets

    OrthographicNet: A Deep Transfer Learning Approach for 3D Object Recognition in Open-Ended Domains

    Full text link
    Nowadays, service robots are appearing more and more in our daily life. For this type of robot, open-ended object category learning and recognition is necessary since no matter how extensive the training data used for batch learning, the robot might be faced with a new object when operating in a real-world environment. In this work, we present OrthographicNet, a Convolutional Neural Network (CNN)-based model, for 3D object recognition in open-ended domains. In particular, OrthographicNet generates a global rotation- and scale-invariant representation for a given 3D object, enabling robots to recognize the same or similar objects seen from different perspectives. Experimental results show that our approach yields significant improvements over the previous state-of-the-art approaches concerning object recognition performance and scalability in open-ended scenarios. Moreover, OrthographicNet demonstrates the capability of learning new categories from very few examples on-site. Regarding real-time performance, three real-world demonstrations validate the promising performance of the proposed architecture

    Detecting Repeating Objects using Patch Correlation Analysis

    Full text link
    In this paper we describe a new method for detecting and counting a repeating object in an image. While the method relies on a fairly sophisticated deformable part model, unlike existing techniques it estimates the model parameters in an unsupervised fashion thus alleviating the need for a user-annotated training data and avoiding the associated specificity. This automatic fitting process is carried out by exploiting the recurrence of small image patches associated with the repeating object and analyzing their spatial correlation. The analysis allows us to reject outlier patches, recover the visual and shape parameters of the part model, and detect the object instances efficiently. In order to achieve a practical system which is able to cope with diverse images, we describe a simple and intuitive active-learning procedure that updates the object classification by querying the user on very few carefully chosen marginal classifications. Evaluation of the new method against the state-of-the-art techniques demonstrates its ability to achieve higher accuracy through a better user experience

    Not-so-supervised: a survey of semi-supervised, multi-instance, and transfer learning in medical image analysis

    Full text link
    Machine learning (ML) algorithms have made a tremendous impact in the field of medical imaging. While medical imaging datasets have been growing in size, a challenge for supervised ML algorithms that is frequently mentioned is the lack of annotated data. As a result, various methods which can learn with less/other types of supervision, have been proposed. We review semi-supervised, multiple instance, and transfer learning in medical imaging, both in diagnosis/detection or segmentation tasks. We also discuss connections between these learning scenarios, and opportunities for future research.Comment: Submitted to Medical Image Analysi

    Visual Relationship Detection using Scene Graphs: A Survey

    Full text link
    Understanding a scene by decoding the visual relationships depicted in an image has been a long studied problem. While the recent advances in deep learning and the usage of deep neural networks have achieved near human accuracy on many tasks, there still exists a pretty big gap between human and machine level performance when it comes to various visual relationship detection tasks. Developing on earlier tasks like object recognition, segmentation and captioning which focused on a relatively coarser image understanding, newer tasks have been introduced recently to deal with a finer level of image understanding. A Scene Graph is one such technique to better represent a scene and the various relationships present in it. With its wide number of applications in various tasks like Visual Question Answering, Semantic Image Retrieval, Image Generation, among many others, it has proved to be a useful tool for deeper and better visual relationship understanding. In this paper, we present a detailed survey on the various techniques for scene graph generation, their efficacy to represent visual relationships and how it has been used to solve various downstream tasks. We also attempt to analyze the various future directions in which the field might advance in the future. Being one of the first papers to give a detailed survey on this topic, we also hope to give a succinct introduction to scene graphs, and guide practitioners while developing approaches for their applications

    Temporal Cross-Media Retrieval with Soft-Smoothing

    Full text link
    Multimedia information have strong temporal correlations that shape the way modalities co-occur over time. In this paper we study the dynamic nature of multimedia and social-media information, where the temporal dimension emerges as a strong source of evidence for learning the temporal correlations across visual and textual modalities. So far, cross-media retrieval models, explored the correlations between different modalities (e.g. text and image) to learn a common subspace, in which semantically similar instances lie in the same neighbourhood. Building on such knowledge, we propose a novel temporal cross-media neural architecture, that departs from standard cross-media methods, by explicitly accounting for the temporal dimension through temporal subspace learning. The model is softly-constrained with temporal and inter-modality constraints that guide the new subspace learning task by favouring temporal correlations between semantically similar and temporally close instances. Experiments on three distinct datasets show that accounting for time turns out to be important for cross-media retrieval. Namely, the proposed method outperforms a set of baselines on the task of temporal cross-media retrieval, demonstrating its effectiveness for performing temporal subspace learning.Comment: To appear in ACM MM 201

    Felzenszwalb-Baum-Welch: Event Detection by Changing Appearance

    Full text link
    We propose a method which can detect events in videos by modeling the change in appearance of the event participants over time. This method makes it possible to detect events which are characterized not by motion, but by the changing state of the people or objects involved. This is accomplished by using object detectors as output models for the states of a hidden Markov model (HMM). The method allows an HMM to model the sequence of poses of the event participants over time, and is effective for poses of humans and inanimate objects. The ability to use existing object-detection methods as part of an event model makes it possible to leverage ongoing work in the object-detection community. A novel training method uses an EM loop to simultaneously learn the temporal structure and object models automatically, without the need to specify either the individual poses to be modeled or the frames in which they occur. The E-step estimates the latent assignment of video frames to HMM states, while the M-step estimates both the HMM transition probabilities and state output models, including the object detectors, which are trained on the weighted subset of frames assigned to their state. A new dataset was gathered because little work has been done on events characterized by changing object pose, and suitable datasets are not available. Our method produced results superior to that of comparison systems on this dataset

    Evolving inductive generalization via genetic self-assembly

    Full text link
    We propose that genetic encoding of self-assembling components greatly enhances the evolution of complex systems and provides an efficient platform for inductive generalization, i.e. the inductive derivation of a solution to a problem with a potentially infinite number of instances from a limited set of test examples. We exemplify this in simulations by evolving scalable circuitry for several problems. One of them, digital multiplication, has been intensively studied in recent years, where hitherto the evolutionary design of only specific small multipliers was achieved. The fact that this and other problems can be solved in full generality employing self-assembly sheds light on the evolutionary role of self-assembly in biology and is of relevance for the design of complex systems in nano- and bionanotechnology

    Sentence Directed Video Object Codetection

    Full text link
    We tackle the problem of video object codetection by leveraging the weak semantic constraint implied by sentences that describe the video content. Unlike most existing work that focuses on codetecting large objects which are usually salient both in size and appearance, we can codetect objects that are small or medium sized. Our method assumes no human pose or depth information such as is required by the most recent state-of-the-art method. We employ weak semantic constraint on the codetection process by pairing the video with sentences. Although the semantic information is usually simple and weak, it can greatly boost the performance of our codetection framework by reducing the search space of the hypothesized object detections. Our experiment demonstrates an average IoU score of 0.423 on a new challenging dataset which contains 15 object classes and 150 videos with 12,509 frames in total, and an average IoU score of 0.373 on a subset of an existing dataset, originally intended for activity recognition, which contains 5 object classes and 75 videos with 8,854 frames in total

    Sparse Coding with Earth Mover's Distance for Multi-Instance Histogram Representation

    Full text link
    Sparse coding (Sc) has been studied very well as a powerful data representation method. It attempts to represent the feature vector of a data sample by reconstructing it as the sparse linear combination of some basic elements, and a L2L_2 norm distance function is usually used as the loss function for the reconstruction error. In this paper, we investigate using Sc as the representation method within multi-instance learning framework, where a sample is given as a bag of instances, and further represented as a histogram of the quantized instances. We argue that for the data type of histogram, using L2L_2 norm distance is not suitable, and propose to use the earth mover's distance (EMD) instead of L2L_2 norm distance as a measure of the reconstruction error. By minimizing the EMD between the histogram of a sample and the its reconstruction from some basic histograms, a novel sparse coding method is developed, which is refereed as SC-EMD. We evaluate its performances as a histogram representation method in tow multi-instance learning problems --- abnormal image detection in wireless capsule endoscopy videos, and protein binding site retrieval. The encouraging results demonstrate the advantages of the new method over the traditional method using L2L_2 norm distance
    • …
    corecore