188 research outputs found
Unsupervised Phoneme and Word Discovery from Multiple Speakers using Double Articulation Analyzer and Neural Network with Parametric Bias
This paper describes a new unsupervised machine learning method for
simultaneous phoneme and word discovery from multiple speakers. Human infants
can acquire knowledge of phonemes and words from interactions with his/her
mother as well as with others surrounding him/her. From a computational
perspective, phoneme and word discovery from multiple speakers is a more
challenging problem than that from one speaker because the speech signals from
different speakers exhibit different acoustic features. This paper proposes an
unsupervised phoneme and word discovery method that simultaneously uses
nonparametric Bayesian double articulation analyzer (NPB-DAA) and deep sparse
autoencoder with parametric bias in hidden layer (DSAE-PBHL). We assume that an
infant can recognize and distinguish speakers based on certain other features,
e.g., visual face recognition. DSAE-PBHL is aimed to be able to subtract
speaker-dependent acoustic features and extract speaker-independent features.
An experiment demonstrated that DSAE-PBHL can subtract distributed
representations of acoustic signals, enabling extraction based on the types of
phonemes rather than on the speakers. Another experiment demonstrated that a
combination of NPB-DAA and DSAE-PB outperformed the available methods in
phoneme and word discovery tasks involving speech signals with Japanese vowel
sequences from multiple speakers.Comment: 21 pages. Submitte
Double Articulation Analyzer with Prosody for Unsupervised Word and Phoneme Discovery
Infants acquire words and phonemes from unsegmented speech signals using
segmentation cues, such as distributional, prosodic, and co-occurrence cues.
Many pre-existing computational models that represent the process tend to focus
on distributional or prosodic cues. This paper proposes a nonparametric
Bayesian probabilistic generative model called the prosodic hierarchical
Dirichlet process-hidden language model (Prosodic HDP-HLM). Prosodic HDP-HLM,
an extension of HDP-HLM, considers both prosodic and distributional cues within
a single integrative generative model. We conducted three experiments on
different types of datasets, and demonstrate the validity of the proposed
method. The results show that the Prosodic DAA successfully uses prosodic cues
and outperforms a method that solely uses distributional cues. The main
contributions of this study are as follows: 1) We develop a probabilistic
generative model for time series data including prosody that potentially has a
double articulation structure; 2) We propose the Prosodic DAA by deriving the
inference procedure for Prosodic HDP-HLM and show that Prosodic DAA can
discover words directly from continuous human speech signals using statistical
information and prosodic information in an unsupervised manner; 3) We show that
prosodic cues contribute to word segmentation more in naturally distributed
case words, i.e., they follow Zipf's law.Comment: 11 pages, Submitted to IEEE Transactions on Cognitive and
Developmental System
A Survey on Bayesian Deep Learning
A comprehensive artificial intelligence system needs to not only perceive the
environment with different `senses' (e.g., seeing and hearing) but also infer
the world's conditional (or even causal) relations and corresponding
uncertainty. The past decade has seen major advances in many perception tasks
such as visual object recognition and speech recognition using deep learning
models. For higher-level inference, however, probabilistic graphical models
with their Bayesian nature are still more powerful and flexible. In recent
years, Bayesian deep learning has emerged as a unified probabilistic framework
to tightly integrate deep learning and Bayesian models. In this general
framework, the perception of text or images using deep learning can boost the
performance of higher-level inference and in turn, the feedback from the
inference process is able to enhance the perception of text or images. This
survey provides a comprehensive introduction to Bayesian deep learning and
reviews its recent applications on recommender systems, topic models, control,
etc. Besides, we also discuss the relationship and differences between Bayesian
deep learning and other related topics such as Bayesian treatment of neural
networks.Comment: To appear in ACM Computing Surveys (CSUR) 202
Discovering a Domain Knowledge Representation for Image Grouping: Multimodal Data Modeling, Fusion, and Interactive Learning
In visually-oriented specialized medical domains such as dermatology and radiology, physicians explore interesting image cases from medical image repositories for comparative case studies to aid clinical diagnoses, educate medical trainees, and support medical research. However, general image classification and retrieval approaches fail in grouping medical images from the physicians\u27 viewpoint. This is because fully-automated learning techniques cannot yet bridge the gap between image features and domain-specific content for the absence of expert knowledge. Understanding how experts get information from medical images is therefore an important research topic.
As a prior study, we conducted data elicitation experiments, where physicians were instructed to inspect each medical image towards a diagnosis while describing image content to a student seated nearby. Experts\u27 eye movements and their verbal descriptions of the image content were recorded to capture various aspects of expert image understanding. This dissertation aims at an intuitive approach to extracting expert knowledge, which is to find patterns in expert data elicited from image-based diagnoses. These patterns are useful to understand both the characteristics of the medical images and the experts\u27 cognitive reasoning processes.
The transformation from the viewed raw image features to interpretation as domain-specific concepts requires experts\u27 domain knowledge and cognitive reasoning. This dissertation also approximates this transformation using a matrix factorization-based framework, which helps project multiple expert-derived data modalities to high-level abstractions.
To combine additional expert interventions with computational processing capabilities, an interactive machine learning paradigm is developed to treat experts as an integral part of the learning process. Specifically, experts refine medical image groups presented by the learned model locally, to incrementally re-learn the model globally. This paradigm avoids the onerous expert annotations for model training, while aligning the learned model with experts\u27 sense-making
- …