70 research outputs found
Statistical Knowledge and Learning in Phonology
This thesis deals with the theory of the phonetic component of grammar in a formal probabilistic inference framework: (1) it has been recognized since the beginning of generative phonology that some language-specific phonetic implementation is actually context-dependent, and thus it can be said that there are gradient "phonetic processes" in grammar in addition to categorical "phonological processes." However, no explicit theory has been developed to characterize these processes. Meanwhile, (2) it is understood that language acquisition and perception are both really informed guesswork: the result of both types of inference can be reasonably thought to be a less-than-perfect committment, with multiple candidate grammars or parses considered and each associated with some degree of credence. Previous research has used probability theory to formalize these inferences in implemented computational models, especially in phonetics and phonology. In this role, computational models serve to demonstrate the existence of working learning/per- ception/parsing systems assuming a faithful implementation of one particular theory of human language, and are not intended to adjudicate whether that theory is correct. The current thesis (1) develops a theory of the phonetic component of grammar and how it
relates to the greater phonological system and (2) uses a formal Bayesian treatment of learning to evaluate this theory of the phonological architecture and for making predictions about how the resulting grammars will be organized. The coarse description of the consequence for linguistic theory is that the processes we think of as "allophonic" are actually language-specific, gradient phonetic processes, assigned to the phonetic component of grammar; strict allophones have no representation in the output of the categorical phonological grammar
Learning weakly supervised multimodal phoneme embeddings
Recent works have explored deep architectures for learning multimodal speech
representation (e.g. audio and images, articulation and audio) in a supervised
way. Here we investigate the role of combining different speech modalities,
i.e. audio and visual information representing the lips movements, in a weakly
supervised way using Siamese networks and lexical same-different side
information. In particular, we ask whether one modality can benefit from the
other to provide a richer representation for phone recognition in a weakly
supervised setting. We introduce mono-task and multi-task methods for merging
speech and visual modalities for phone recognition. The mono-task learning
consists in applying a Siamese network on the concatenation of the two
modalities, while the multi-task learning receives several different
combinations of modalities at train time. We show that multi-task learning
enhances discriminability for visual and multimodal inputs while minimally
impacting auditory inputs. Furthermore, we present a qualitative analysis of
the obtained phone embeddings, and show that cross-modal visual input can
improve the discriminability of phonological features which are visually
discernable (rounding, open/close, labial place of articulation), resulting in
representations that are closer to abstract linguistic features than those
based on audio only
RNNs Implicitly Implement Tensor Product Representations
Recurrent neural networks (RNNs) can learn continuous vector representations
of symbolic structures such as sequences and sentences; these representations
often exhibit linear regularities (analogies). Such regularities motivate our
hypothesis that RNNs that show such regularities implicitly compile symbolic
structures into tensor product representations (TPRs; Smolensky, 1990), which
additively combine tensor products of vectors representing roles (e.g.,
sequence positions) and vectors representing fillers (e.g., particular words).
To test this hypothesis, we introduce Tensor Product Decomposition Networks
(TPDNs), which use TPRs to approximate existing vector representations. We
demonstrate using synthetic data that TPDNs can successfully approximate linear
and tree-based RNN autoencoder representations, suggesting that these
representations exhibit interpretable compositional structure; we explore the
settings that lead RNNs to induce such structure-sensitive representations. By
contrast, further TPDN experiments show that the representations of four models
trained to encode naturally-occurring sentences can be largely approximated
with a bag of words, with only marginal improvements from more sophisticated
structures. We conclude that TPDNs provide a powerful method for interpreting
vector representations, and that standard RNNs can induce compositional
sequence representations that are remarkably well approximated by TPRs; at the
same time, existing training tasks for sentence representation learning may not
be sufficient for inducing robust structural representations.Comment: Accepted to ICLR 201
Recommended from our members
Extracting binary features from speech production errors and perceptual confusions using Redundancy-Corrected Transmission
We develop a mutual information-based feature extraction method and apply it to English speech production and perception error data. The extracted features show different phoneme groupings than conventional phonological features, especially in the place features. We evaluate how well the extracted features can define natural classes to account for English phonological patterns. The features extracted from production errors had performance close to conventional phonological features, while the features extracted from perception errors performed worse. The study shows that featural information can be extracted from underused sources of data such as confusion matrices of production and perception errors, and the results suggest that phonological patterning is more closely related to natural production errors than to perception errors in noisy speech
Comparing unsupervised speech learning directly to human performance in speech perception
International audienceWe compare the performance of humans (English and French listeners) versus an unsupervised speech model in a perception experiment (ABX discrimination task). Although the ABX task has been used for acoustic model evaluation in previous research, the results have not, until now, been compared directly with human behaviour in an experiment. We show that a standard, well-performing model (DPGMM) has better accuracy at predicting human responses than the acoustic baseline. The model also shows a native language effect, better resembling native listeners of the language on which it was trained. However, the native language effect shown by the models is different than the one shown by the human listeners, and, notably , the models do not show the same overall patterns of vowel confusions
Analogies minus analogy test: measuring regularities in word embeddings
Vector space models of words have long been claimed to capture linguistic
regularities as simple vector translations, but problems have been raised with
this claim. We decompose and empirically analyze the classic arithmetic word
analogy test, to motivate two new metrics that address the issues with the
standard test, and which distinguish between class-wise offset concentration
(similar directions between pairs of words drawn from different broad classes,
such as France--London, China--Ottawa, ...) and pairing consistency (the
existence of a regular transformation between correctly-matched pairs such as
France:Paris::China:Beijing). We show that, while the standard analogy test is
flawed, several popular word embeddings do nevertheless encode linguistic
regularities
Mouse tracking as a window into decision making
International audienceMouse tracking promises to be an efficient method to investigate the dynamics of cognitive processes: It is easier to deploy than eyetracking, yet in principle it is much more fine-grained than looking at response times. We investigated these claimed benefits directly, asking how the features of decision processesânotably, decision changesâmight be captured in mouse movements. We ran two experiments, one in which we explicitly manipulated whether our stimuli triggered a flip in decision, and one in which we replicated more ecological, classical mouse-tracking results on linguistic negation (Dale & Duran, Cognitive Science, 35, 983â996, 2011). We concluded, first, that spatial information (mouse path) is more important than temporal information (speed and acceleration) for detecting decision changes, and we offer a comparison of the sensitivities of various typical measures used in analyses of mouse tracking (area under the trajectory curve, direction flips, etc.). We do so using an âoptimalâ analysis of our data (a linear discriminant analysis explicitly trained to classify trajectories) and see what type of data (position, speed, or acceleration) it capitalizes on. We also quantify how its results compare with those based on more standard measures
Recommended from our members
Tensor Product Decomposition Networks: Uncovering Representations of Structure Learned by Neural Networks
We introduce an analysis technique for understanding compositional structure present in the vector representations used by neural networks. The inner workings of neural networks are notoriously difficult to understand, and in particular it is far from clear how they manage to perform remarkably well on tasks that depend on compositional structure even though they use continuous vector representations with no obvious compositional structure. Using our analysis technique, we show that the representations of these models can be closely approximated by Tensor Product Representations, a type of interpretable structure that lends significant insight into the workings of these hard-to-interpret models
The Zero Resource Speech Challenge 2017
We describe a new challenge aimed at discovering subword and word units from
raw speech. This challenge is the followup to the Zero Resource Speech
Challenge 2015. It aims at constructing systems that generalize across
languages and adapt to new speakers. The design features and evaluation metrics
of the challenge are presented and the results of seventeen models are
discussed.Comment: IEEE ASRU (Automatic Speech Recognition and Understanding) 2017.
Okinawa, Japa
- âŠ