156,184 research outputs found
A bag-to-class divergence approach to multiple-instance learning
In multi-instance (MI) learning, each object (bag) consists of multiple
feature vectors (instances), and is most commonly regarded as a set of points
in a multidimensional space. A different viewpoint is that the instances are
realisations of random vectors with corresponding probability distribution, and
that a bag is the distribution, not the realisations. In MI classification,
each bag in the training set has a class label, but the instances are
unlabelled. By introducing the probability distribution space to bag-level
classification problems, dissimilarities between probability distributions
(divergences) can be applied. The bag-to-bag Kullback-Leibler information is
asymptotically the best classifier, but the typical sparseness of MI training
sets is an obstacle. We introduce bag-to-class divergence to MI learning,
emphasising the hierarchical nature of the random vectors that makes bags from
the same class different. We propose two properties for bag-to-class
divergences, and an additional property for sparse training sets
OrthographicNet: A Deep Transfer Learning Approach for 3D Object Recognition in Open-Ended Domains
Nowadays, service robots are appearing more and more in our daily life. For
this type of robot, open-ended object category learning and recognition is
necessary since no matter how extensive the training data used for batch
learning, the robot might be faced with a new object when operating in a
real-world environment. In this work, we present OrthographicNet, a
Convolutional Neural Network (CNN)-based model, for 3D object recognition in
open-ended domains. In particular, OrthographicNet generates a global rotation-
and scale-invariant representation for a given 3D object, enabling robots to
recognize the same or similar objects seen from different perspectives.
Experimental results show that our approach yields significant improvements
over the previous state-of-the-art approaches concerning object recognition
performance and scalability in open-ended scenarios. Moreover, OrthographicNet
demonstrates the capability of learning new categories from very few examples
on-site. Regarding real-time performance, three real-world demonstrations
validate the promising performance of the proposed architecture
Detecting Repeating Objects using Patch Correlation Analysis
In this paper we describe a new method for detecting and counting a repeating
object in an image. While the method relies on a fairly sophisticated
deformable part model, unlike existing techniques it estimates the model
parameters in an unsupervised fashion thus alleviating the need for a
user-annotated training data and avoiding the associated specificity. This
automatic fitting process is carried out by exploiting the recurrence of small
image patches associated with the repeating object and analyzing their spatial
correlation. The analysis allows us to reject outlier patches, recover the
visual and shape parameters of the part model, and detect the object instances
efficiently. In order to achieve a practical system which is able to cope with
diverse images, we describe a simple and intuitive active-learning procedure
that updates the object classification by querying the user on very few
carefully chosen marginal classifications. Evaluation of the new method against
the state-of-the-art techniques demonstrates its ability to achieve higher
accuracy through a better user experience
Not-so-supervised: a survey of semi-supervised, multi-instance, and transfer learning in medical image analysis
Machine learning (ML) algorithms have made a tremendous impact in the field
of medical imaging. While medical imaging datasets have been growing in size, a
challenge for supervised ML algorithms that is frequently mentioned is the lack
of annotated data. As a result, various methods which can learn with less/other
types of supervision, have been proposed. We review semi-supervised, multiple
instance, and transfer learning in medical imaging, both in diagnosis/detection
or segmentation tasks. We also discuss connections between these learning
scenarios, and opportunities for future research.Comment: Submitted to Medical Image Analysi
Visual Relationship Detection using Scene Graphs: A Survey
Understanding a scene by decoding the visual relationships depicted in an
image has been a long studied problem. While the recent advances in deep
learning and the usage of deep neural networks have achieved near human
accuracy on many tasks, there still exists a pretty big gap between human and
machine level performance when it comes to various visual relationship
detection tasks. Developing on earlier tasks like object recognition,
segmentation and captioning which focused on a relatively coarser image
understanding, newer tasks have been introduced recently to deal with a finer
level of image understanding. A Scene Graph is one such technique to better
represent a scene and the various relationships present in it. With its wide
number of applications in various tasks like Visual Question Answering,
Semantic Image Retrieval, Image Generation, among many others, it has proved to
be a useful tool for deeper and better visual relationship understanding. In
this paper, we present a detailed survey on the various techniques for scene
graph generation, their efficacy to represent visual relationships and how it
has been used to solve various downstream tasks. We also attempt to analyze the
various future directions in which the field might advance in the future. Being
one of the first papers to give a detailed survey on this topic, we also hope
to give a succinct introduction to scene graphs, and guide practitioners while
developing approaches for their applications
Temporal Cross-Media Retrieval with Soft-Smoothing
Multimedia information have strong temporal correlations that shape the way
modalities co-occur over time. In this paper we study the dynamic nature of
multimedia and social-media information, where the temporal dimension emerges
as a strong source of evidence for learning the temporal correlations across
visual and textual modalities. So far, cross-media retrieval models, explored
the correlations between different modalities (e.g. text and image) to learn a
common subspace, in which semantically similar instances lie in the same
neighbourhood. Building on such knowledge, we propose a novel temporal
cross-media neural architecture, that departs from standard cross-media
methods, by explicitly accounting for the temporal dimension through temporal
subspace learning. The model is softly-constrained with temporal and
inter-modality constraints that guide the new subspace learning task by
favouring temporal correlations between semantically similar and temporally
close instances. Experiments on three distinct datasets show that accounting
for time turns out to be important for cross-media retrieval. Namely, the
proposed method outperforms a set of baselines on the task of temporal
cross-media retrieval, demonstrating its effectiveness for performing temporal
subspace learning.Comment: To appear in ACM MM 201
Felzenszwalb-Baum-Welch: Event Detection by Changing Appearance
We propose a method which can detect events in videos by modeling the change
in appearance of the event participants over time. This method makes it
possible to detect events which are characterized not by motion, but by the
changing state of the people or objects involved. This is accomplished by using
object detectors as output models for the states of a hidden Markov model
(HMM). The method allows an HMM to model the sequence of poses of the event
participants over time, and is effective for poses of humans and inanimate
objects. The ability to use existing object-detection methods as part of an
event model makes it possible to leverage ongoing work in the object-detection
community. A novel training method uses an EM loop to simultaneously learn the
temporal structure and object models automatically, without the need to specify
either the individual poses to be modeled or the frames in which they occur.
The E-step estimates the latent assignment of video frames to HMM states, while
the M-step estimates both the HMM transition probabilities and state output
models, including the object detectors, which are trained on the weighted
subset of frames assigned to their state. A new dataset was gathered because
little work has been done on events characterized by changing object pose, and
suitable datasets are not available. Our method produced results superior to
that of comparison systems on this dataset
Evolving inductive generalization via genetic self-assembly
We propose that genetic encoding of self-assembling components greatly
enhances the evolution of complex systems and provides an efficient platform
for inductive generalization, i.e. the inductive derivation of a solution to a
problem with a potentially infinite number of instances from a limited set of
test examples. We exemplify this in simulations by evolving scalable circuitry
for several problems. One of them, digital multiplication, has been intensively
studied in recent years, where hitherto the evolutionary design of only
specific small multipliers was achieved. The fact that this and other problems
can be solved in full generality employing self-assembly sheds light on the
evolutionary role of self-assembly in biology and is of relevance for the
design of complex systems in nano- and bionanotechnology
Sentence Directed Video Object Codetection
We tackle the problem of video object codetection by leveraging the weak
semantic constraint implied by sentences that describe the video content.
Unlike most existing work that focuses on codetecting large objects which are
usually salient both in size and appearance, we can codetect objects that are
small or medium sized. Our method assumes no human pose or depth information
such as is required by the most recent state-of-the-art method. We employ weak
semantic constraint on the codetection process by pairing the video with
sentences. Although the semantic information is usually simple and weak, it can
greatly boost the performance of our codetection framework by reducing the
search space of the hypothesized object detections. Our experiment demonstrates
an average IoU score of 0.423 on a new challenging dataset which contains 15
object classes and 150 videos with 12,509 frames in total, and an average IoU
score of 0.373 on a subset of an existing dataset, originally intended for
activity recognition, which contains 5 object classes and 75 videos with 8,854
frames in total
Sparse Coding with Earth Mover's Distance for Multi-Instance Histogram Representation
Sparse coding (Sc) has been studied very well as a powerful data
representation method. It attempts to represent the feature vector of a data
sample by reconstructing it as the sparse linear combination of some basic
elements, and a norm distance function is usually used as the loss
function for the reconstruction error. In this paper, we investigate using Sc
as the representation method within multi-instance learning framework, where a
sample is given as a bag of instances, and further represented as a histogram
of the quantized instances. We argue that for the data type of histogram, using
norm distance is not suitable, and propose to use the earth mover's
distance (EMD) instead of norm distance as a measure of the
reconstruction error. By minimizing the EMD between the histogram of a sample
and the its reconstruction from some basic histograms, a novel sparse coding
method is developed, which is refereed as SC-EMD. We evaluate its performances
as a histogram representation method in tow multi-instance learning problems
--- abnormal image detection in wireless capsule endoscopy videos, and protein
binding site retrieval. The encouraging results demonstrate the advantages of
the new method over the traditional method using norm distance
- …