5 research outputs found
Nonparametric Bayesian Double Articulation Analyzer for Direct Language Acquisition from Continuous Speech Signals
Human infants can discover words directly from unsegmented speech signals
without any explicitly labeled data. In this paper, we develop a novel machine
learning method called nonparametric Bayesian double articulation analyzer
(NPB-DAA) that can directly acquire language and acoustic models from observed
continuous speech signals. For this purpose, we propose an integrative
generative model that combines a language model and an acoustic model into a
single generative model called the "hierarchical Dirichlet process hidden
language model" (HDP-HLM). The HDP-HLM is obtained by extending the
hierarchical Dirichlet process hidden semi-Markov model (HDP-HSMM) proposed by
Johnson et al. An inference procedure for the HDP-HLM is derived using the
blocked Gibbs sampler originally proposed for the HDP-HSMM. This procedure
enables the simultaneous and direct inference of language and acoustic models
from continuous speech signals. Based on the HDP-HLM and its inference
procedure, we developed a novel double articulation analyzer. By assuming
HDP-HLM as a generative model of observed time series data, and by inferring
latent variables of the model, the method can analyze latent double
articulation structure, i.e., hierarchically organized latent words and
phonemes, of the data in an unsupervised manner. The novel unsupervised double
articulation analyzer is called NPB-DAA.
The NPB-DAA can automatically estimate double articulation structure embedded
in speech signals. We also carried out two evaluation experiments using
synthetic data and actual human continuous speech signals representing Japanese
vowel sequences. In the word acquisition and phoneme categorization tasks, the
NPB-DAA outperformed a conventional double articulation analyzer (DAA) and
baseline automatic speech recognition system whose acoustic model was trained
in a supervised manner.Comment: 15 pages, 7 figures, Draft submitted to IEEE Transactions on
Autonomous Mental Development (TAMD
Unsupervised Phoneme and Word Discovery from Multiple Speakers using Double Articulation Analyzer and Neural Network with Parametric Bias
This paper describes a new unsupervised machine learning method for
simultaneous phoneme and word discovery from multiple speakers. Human infants
can acquire knowledge of phonemes and words from interactions with his/her
mother as well as with others surrounding him/her. From a computational
perspective, phoneme and word discovery from multiple speakers is a more
challenging problem than that from one speaker because the speech signals from
different speakers exhibit different acoustic features. This paper proposes an
unsupervised phoneme and word discovery method that simultaneously uses
nonparametric Bayesian double articulation analyzer (NPB-DAA) and deep sparse
autoencoder with parametric bias in hidden layer (DSAE-PBHL). We assume that an
infant can recognize and distinguish speakers based on certain other features,
e.g., visual face recognition. DSAE-PBHL is aimed to be able to subtract
speaker-dependent acoustic features and extract speaker-independent features.
An experiment demonstrated that DSAE-PBHL can subtract distributed
representations of acoustic signals, enabling extraction based on the types of
phonemes rather than on the speakers. Another experiment demonstrated that a
combination of NPB-DAA and DSAE-PB outperformed the available methods in
phoneme and word discovery tasks involving speech signals with Japanese vowel
sequences from multiple speakers.Comment: 21 pages. Submitte
Learning visually grounded meaning representations
Humans possess a rich semantic knowledge of words and concepts which captures the
perceivable physical properties of their real-world referents and their relations. Encoding
this knowledge or some of its aspects is the goal of computational models of
semantic representation and has been the subject of considerable research in cognitive
science, natural language processing, and related areas. Existing models have
placed emphasis on different aspects of meaning, depending ultimately on the task at
hand. Typically, such models have been used in tasks addressing the simulation of behavioural
phenomena, e.g., lexical priming or categorisation, as well as in natural language
applications, such as information retrieval, document classification, or semantic
role labelling. A major strand of research popular across disciplines focuses on models
which induce semantic representations from text corpora. These models are based on
the hypothesis that the meaning of words is established by their distributional relation
to other words (Harris, 1954). Despite their widespread use, distributional models of
word meaning have been criticised as ‘disembodied’ in that they are not grounded in
perception and action (Perfetti, 1998; Barsalou, 1999; Glenberg and Kaschak, 2002).
This lack of grounding contrasts with many experimental studies suggesting that meaning
is acquired not only from exposure to the linguistic environment but also from our
interaction with the physical world (Landau et al., 1998; Bornstein et al., 2004). This
criticism has led to the emergence of new models aiming at inducing perceptually
grounded semantic representations. Essentially, existing approaches learn meaning
representations from multiple views corresponding to different modalities, i.e. linguistic
and perceptual input. To approximate the perceptual modality, previous work has
relied largely on semantic attributes collected from humans (e.g., is round, is sour), or
on automatically extracted image features. Semantic attributes have a long-standing
tradition in cognitive science and are thought to represent salient psychological aspects
of word meaning including multisensory information. However, their elicitation
from human subjects limits the scope of computational models to a small number of
concepts for which attributes are available.
In this thesis, we present an approach which draws inspiration from the successful
application of attribute classifiers in image classification, and represent images and
the concepts depicted by them by automatically predicted visual attributes. To this
end, we create a dataset comprising nearly 700K images and a taxonomy of 636 visual
attributes and use it to train attribute classifiers. We show that their predictions
can act as a substitute for human-produced attributes without any critical information
loss. In line with the attribute-based approximation of the visual modality, we represent
the linguistic modality by textual attributes which we obtain with an off-the-shelf
distributional model. Having first established this core contribution of a novel modelling
framework for grounded meaning representations based on semantic attributes,
we show that these can be integrated into existing approaches to perceptually grounded
representations. We then introduce a model which is formulated as a stacked autoencoder
(a variant of multilayer neural networks), which learns higher-level meaning representations
by mapping words and images, represented by attributes, into a common
embedding space. In contrast to most previous approaches to multimodal learning using
different variants of deep networks and data sources, our model is defined at a finer
level of granularity—it computes representations for individual words and is unique in
its use of attributes as a means of representing the textual and visual modalities.
We evaluate the effectiveness of the representations learnt by our model by assessing
its ability to account for human behaviour on three semantic tasks, namely word
similarity, concept categorisation, and typicality of category members. With respect to
the word similarity task, we focus on the model’s ability to capture similarity in both
the meaning and appearance of the words’ referents. Since existing benchmark datasets
on word similarity do not distinguish between these two dimensions and often contain
abstract words, we create a new dataset in a large-scale experiment where participants
are asked to give two ratings per word pair expressing their semantic and visual
similarity, respectively. Experimental results show that our model learns meaningful
representations which are more accurate than models based on individual modalities or
different modality integration mechanisms. The presented model is furthermore able to
predict textual attributes for new concepts given their visual attribute predictions only,
which we demonstrate by comparing model output with human generated attributes.
Finally, we show the model’s effectiveness in an image-based task on visual category
learning, in which images are used as a stand-in for real-world objects