571 research outputs found
Unsupervised Shift-invariant Feature Learning from Time-series Data
Unsupervised feature learning is one of the key components of machine learningand articial intelligence. Learning features from high dimensional streaming data isan important and dicult problem which is incorporated with number of challenges.Moreover, feature learning algorithms need to be evaluated and generalized for timeseries with dierent patterns and components. A detailed study is needed to clarifywhen simple algorithms fail to learn features and whether we need more complicatedmethods.In this thesis, we show that the systematic way to learn meaningful featuresfrom time-series is by using convolutional or shift-invariant versions of unsupervisedfeature learning. We experimentally compare the shift-invariant versions of clustering,sparse coding and non-negative matrix factorization algorithms for: reconstruction,noise separation, prediction, classication and simulating auditory lters from acousticsignals. The results show that the most ecient and highly scalable clustering algorithmwith a simple modication in inference and learning phase is able to produce meaningfulresults. Clustering features are also comparable with sparse coding and non-negativematrix factorization in most of the tasks (e.g. classication) and even more successful insome (e.g. prediction). Shift invariant sparse coding is also used on a novel application,inferring hearing loss from speech signal and produced promising results.Performance of algorithms with regard to some important factors such as: timeseries components, number of features and size of receptive eld is also analyzed. Theresults show that there is a signicant positive correlation between performance of clusteringwith degree of trend, frequency skewness, frequency kurtosis and serial correlationof data, whereas, the correlation is negative in the case of dataset average bandwidth.Performance of shift invariant sparse coding is aected by frequency skewness, frequencykurtosis and serial correlation of data. Non-Negative matrix factorization is influenced by data characteristics same as clustering
Human-in-the-Loop Optimization for Deep Stimulus Encoding in Visual Prostheses
Neuroprostheses show potential in restoring lost sensory function and
enhancing human capabilities, but the sensations produced by current devices
often seem unnatural or distorted. Exact placement of implants and differences
in individual perception lead to significant variations in stimulus response,
making personalized stimulus optimization a key challenge. Bayesian
optimization could be used to optimize patient-specific stimulation parameters
with limited noisy observations, but is not feasible for high-dimensional
stimuli. Alternatively, deep learning models can optimize stimulus encoding
strategies, but typically assume perfect knowledge of patient-specific
variations. Here we propose a novel, practically feasible approach that
overcomes both of these fundamental limitations. First, a deep encoder network
is trained to produce optimal stimuli for any individual patient by inverting a
forward model mapping electrical stimuli to visual percepts. Second, a
preferential Bayesian optimization strategy utilizes this encoder to optimize
patient-specific parameters for a new patient, using a minimal number of
pairwise comparisons between candidate stimuli. We demonstrate the viability of
this approach on a novel, state-of-the-art visual prosthesis model. We show
that our approach quickly learns a personalized stimulus encoder, leads to
dramatic improvements in the quality of restored vision, and is robust to noisy
patient feedback and misspecifications in the underlying forward model.
Overall, our results suggest that combining the strengths of deep learning and
Bayesian optimization could significantly improve the perceptual experience of
patients fitted with visual prostheses and may prove a viable solution for a
range of neuroprosthetic technologies
Deep Multimodal Feature Encoding for Video Ordering
True understanding of videos comes from a joint analysis of all its
modalities: the video frames, the audio track, and any accompanying text such
as closed captions. We present a way to learn a compact multimodal feature
representation that encodes all these modalities. Our model parameters are
learned through a proxy task of inferring the temporal ordering of a set of
unordered videos in a timeline. To this end, we create a new multimodal dataset
for temporal ordering that consists of approximately 30K scenes (2-6 clips per
scene) based on the "Large Scale Movie Description Challenge". We analyze and
evaluate the individual and joint modalities on three challenging tasks: (i)
inferring the temporal ordering of a set of videos; and (ii) action
recognition. We demonstrate empirically that multimodal representations are
indeed complementary, and can play a key role in improving the performance of
many applications.Comment: IEEE International Conference on Computer Vision (ICCV) Workshop on
Large Scale Holistic Video Understanding. The datasets and code are available
at https://github.com/vivoutlaw/tcb
Probabilistic models of contextual effects in Auditory Pitch Perception
Perception was recognised by Helmholtz as an inferential process whereby learned expectations about the environment combine with sensory experience to give rise to percepts. Expectations are flexible, built from past experiences over multiple time-scales. What is the nature of perceptual expectations? How are they learned? How do they affect perception? These are the questions I propose to address in this thesis. I focus on two important yet simple perceptual attributes of sounds whose perception is widely regarded as effortless and automatic : pitch and frequency. In a first study, I aim to propose a definition of pitch as the solution of a computational goal. Pitch is a fundamental and salient perceptual attribute of many behaviourally important sounds including speech and music. The effortless nature of its perception has led to the search for a direct physical correlate of pitch and for mechanisms to extract pitch from peripheral neural responses. I propose instead that pitch is the outcome of a probabilistic inference of an underlying periodicity in sounds given a learned statistical prior over naturally pitch-evoking sounds, explaining in a single model a wide range of psychophysical results. In two other psychophysical studies I study how and at what time-scales recent sensory history affects the perception of frequency shifts and pitch shifts. (1) When subjects are presented with ambiguous pitch shifts (using octave ambiguous Shepard tone pairs), I show that sensory history is used to leverage the ambiguity in a way that reflects expectations of spectro-temporal continuity of auditory scenes. (2) In delayed 2 tone frequency discrimination tasks, I explore the contraction bias : when asked to report which of two tones separated by brief silence is higher, subjects behave as though they hear the earlier tone ’contracted’ in frequency towards a combination of recently presented stimulus frequencies, and the mean of the overall distribution of tones used in the experiment. I propose that expectations - the statistical learning of the sampled stimulus distribution - are built online and combined with sensory evidence in a statistically optimal fashion. Models derived in the thesis embody the concept of perception as unconscious inference. The results support the view that even apparently primitive acoustic percepts may derive from subtle statistical inference, suggesting that such inferential processes operate at all levels across our sensory systems
Autonomous Discovery of Motor Constraints in an Intrinsically-Motivated Vocal Learner
This work introduces new results on the modeling of early-vocal development using artificial intelligent cognitive architectures and a simulated vocal tract. The problem is addressed using intrinsically-motivated learning algorithms for autonomous sensorimotor exploration, a kind of algorithm belonging to the active learning architectures family. The artificial agent is able to autonomously select goals to explore its own sensorimotor system in regions where its competence to execute intended goals is improved. We propose to include a somatosensory system to provide a proprioceptive feedback signal to reinforce learning through the autonomous discovery of motor constraints. Constraints are represented by a somatosensory model which is unknown beforehand to the learner. Both the sensorimotor and somatosensory system are modeled using Gaussian mixture models. We argue that using an architecture which includes a somatosensory model would reduce redundancy in the sensorimotor model and drive the learning process more efficiently than algorithms taking into account only auditory feedback. The role of this proposed system is to predict whether an undesired collision within the vocal tract under a certain motor configuration is likely to occur. Thus, compromised motor configurations are rejected, guaranteeing that the agent is less prone to violate its own constraints.Peer ReviewedPostprint (author's final draft
Acoustically Inspired Probabilistic Time-domain Music Transcription and Source Separation.
PhD ThesisAutomatic music transcription (AMT) and source separation are important
computational tasks, which can help to understand, analyse and process music
recordings. The main purpose of AMT is to estimate, from an observed
audio recording, a latent symbolic representation of a piece of music (piano-roll).
In this sense, in AMT the duration and location of every note played is
reconstructed from a mixture recording. The related task of source separation
aims to estimate the latent functions or source signals that were mixed
together in an audio recording. This task requires not only the duration and
location of every event present in the mixture, but also the reconstruction
of the waveform of all the individual sounds. Most methods for AMT and
source separation rely on the magnitude of time-frequency representations
of the analysed recording, i.e., spectrograms, and often arbitrarily discard
phase information. On one hand, this decreases the time resolution in AMT.
On the other hand, discarding phase information corrupts the reconstruction
in source separation, because the phase of each source-spectrogram must
be approximated. There is thus a need for models that circumvent phase
approximation, while operating at sample-rate resolution.
This thesis intends to solve AMT and source separation together from
an unified perspective. For this purpose, Bayesian non-parametric signal
processing, covariance kernels designed for audio, and scalable variational
inference are integrated to form efficient and acoustically-inspired probabilistic
models. To circumvent phase approximation while keeping sample-rate
resolution, AMT and source separation are addressed from a Bayesian time-domain
viewpoint. That is, the posterior distribution over the waveform of
each sound event in the mixture is computed directly from the observed data.
For this purpose, Gaussian processes (GPs) are used to define priors over the
sources/pitches. GPs are probability distributions over functions, and its
kernel or covariance determines the properties of the functions sampled from
a GP. Finally, the GP priors and the available data (mixture recording) are
combined using Bayes' theorem in order to compute the posterior distributions
over the sources/pitches.
Although the proposed paradigm is elegant, it introduces two main challenges.
First, as mentioned before, the kernel of the GP priors determines the
properties of each source/pitch function, that is, its smoothness, stationariness,
and more importantly its spectrum. Consequently, the proposed model
requires the design of flexible kernels, able to learn the rich frequency content
and intricate properties of audio sources. To this end, spectral mixture
(SM) kernels are studied, and the Mat ern spectral mixture (MSM) kernel
is introduced, i.e. a modified version of the SM covariance function. The
MSM kernel introduces less strong smoothness, thus it is more suitable for
modelling physical processes. Second, the computational complexity of GP
inference scales cubically with the number of audio samples. Therefore, the
application of GP models to large audio signals becomes intractable. To
overcome this limitation, variational inference is used to make the proposed
model scalable and suitable for signals in the order of hundreds of thousands
of data points.
The integration of GP priors, kernels intended for audio, and variational
inference could enable AMT and source separation time-domain methods to
reconstruct sources and transcribe music in an efficient and informed manner.
In addition, AMT and source separation are current challenges, because
the spectra of the sources/pitches overlap with each other in intricate
ways. Thus, the development of probabilistic models capable of differentiating
sources/pitches in the time domain, despite the high similarity between
their spectra, opens the possibility to take a step towards solving source separation
and automatic music transcription. We demonstrate the utility of our
methods using real and synthesized music audio datasets for various types of
musical instruments
- …