18,859 research outputs found
Weakly-supervised Visual Grounding of Phrases with Linguistic Structures
We propose a weakly-supervised approach that takes image-sentence pairs as
input and learns to visually ground (i.e., localize) arbitrary linguistic
phrases, in the form of spatial attention masks. Specifically, the model is
trained with images and their associated image-level captions, without any
explicit region-to-phrase correspondence annotations. To this end, we introduce
an end-to-end model which learns visual groundings of phrases with two types of
carefully designed loss functions. In addition to the standard discriminative
loss, which enforces that attended image regions and phrases are consistently
encoded, we propose a novel structural loss which makes use of the parse tree
structures induced by the sentences. In particular, we ensure complementarity
among the attention masks that correspond to sibling noun phrases, and
compositionality of attention masks among the children and parent phrases, as
defined by the sentence parse tree. We validate the effectiveness of our
approach on the Microsoft COCO and Visual Genome datasets.Comment: CVPR 201
Learning Behavioural Context
The original publication is available at www.springerlink.co
Machine learning of hierarchical clustering to segment 2D and 3D images
We aim to improve segmentation through the use of machine learning tools
during region agglomeration. We propose an active learning approach for
performing hierarchical agglomerative segmentation from superpixels. Our method
combines multiple features at all scales of the agglomerative process, works
for data with an arbitrary number of dimensions, and scales to very large
datasets. We advocate the use of variation of information to measure
segmentation accuracy, particularly in 3D electron microscopy (EM) images of
neural tissue, and using this metric demonstrate an improvement over competing
algorithms in EM and natural images.Comment: 15 pages, 8 figure
Smartphone picture organization: a hierarchical approach
We live in a society where the large majority of the population has a camera-equipped smartphone. In addition, hard drives and cloud storage are getting cheaper and cheaper, leading to a tremendous growth in stored personal photos. Unlike photo collections captured by a digital camera, which typically are pre-processed by the user who organizes them into event-related folders, smartphone pictures are automatically stored in the cloud. As a consequence, photo collections captured by a smartphone are highly unstructured and because smartphones are ubiquitous, they present a larger variability compared to pictures captured by a digital camera. To solve the need of organizing large smartphone photo collections automatically, we propose here a new methodology for hierarchical photo organization into topics and topic-related categories. Our approach successfully estimates latent topics in the pictures by applying probabilistic Latent Semantic Analysis, and automatically assigns a name to each topic by relying on a lexical database. Topic-related categories are then estimated by using a set of topic-specific Convolutional Neuronal Networks. To validate our approach, we ensemble and make public a large dataset of more than 8,000 smartphone pictures from 40 persons. Experimental results demonstrate major user satisfaction with respect to state of the art solutions in terms of organization.Peer ReviewedPreprin
Nonparametric Bayesian Double Articulation Analyzer for Direct Language Acquisition from Continuous Speech Signals
Human infants can discover words directly from unsegmented speech signals
without any explicitly labeled data. In this paper, we develop a novel machine
learning method called nonparametric Bayesian double articulation analyzer
(NPB-DAA) that can directly acquire language and acoustic models from observed
continuous speech signals. For this purpose, we propose an integrative
generative model that combines a language model and an acoustic model into a
single generative model called the "hierarchical Dirichlet process hidden
language model" (HDP-HLM). The HDP-HLM is obtained by extending the
hierarchical Dirichlet process hidden semi-Markov model (HDP-HSMM) proposed by
Johnson et al. An inference procedure for the HDP-HLM is derived using the
blocked Gibbs sampler originally proposed for the HDP-HSMM. This procedure
enables the simultaneous and direct inference of language and acoustic models
from continuous speech signals. Based on the HDP-HLM and its inference
procedure, we developed a novel double articulation analyzer. By assuming
HDP-HLM as a generative model of observed time series data, and by inferring
latent variables of the model, the method can analyze latent double
articulation structure, i.e., hierarchically organized latent words and
phonemes, of the data in an unsupervised manner. The novel unsupervised double
articulation analyzer is called NPB-DAA.
The NPB-DAA can automatically estimate double articulation structure embedded
in speech signals. We also carried out two evaluation experiments using
synthetic data and actual human continuous speech signals representing Japanese
vowel sequences. In the word acquisition and phoneme categorization tasks, the
NPB-DAA outperformed a conventional double articulation analyzer (DAA) and
baseline automatic speech recognition system whose acoustic model was trained
in a supervised manner.Comment: 15 pages, 7 figures, Draft submitted to IEEE Transactions on
Autonomous Mental Development (TAMD
- …