14 research outputs found
Recommended from our members
Modeling unsupervised phonetic and phonological learning in Generative Adversarial Phonology
This paper models phonetic and phonological learning as a dependency between random space and generated speech data in the Generative Adversarial Neural network architecture and proposes a methodology to uncover the network’s internal representation that corresponds to phonetic and phonological features. A Generative Adversarial Network (Goodfellow et al. 2014; implemented as WaveGAN for acoustic data by Donahue et al. 2019) was trained on an allophonic distribution in English, where voiceless stops surface as aspirated word-initially before stressed vowels except if preceded by a sibilant [s]. The network successfully learns the allophonic alternation: the network’s generated speech signal contains the conditional distribution of aspiration duration. Additionally, the network generates innovative outputs for which no evidence is available in the training data, suggesting that the network segments continuous speech signal into units that can be productively recombined. The paper also proposes a technique for establishing the network’s internal representations. We identify latent variables that directly correspond to presence of [s] in the output. By manipulating these variables, we actively control the presence of [s], its frication amplitude, and spectral shape of the frication noise in the generated outputs
Modeling speech recognition and synthesis simultaneously: Encoding and decoding lexical and sublexical semantic information into speech with no direct access to speech data
Human speakers encode information into raw speech which is then decoded by
the listeners. This complex relationship between encoding (production) and
decoding (perception) is often modeled separately. Here, we test how encoding
and decoding of lexical semantic information can emerge automatically from raw
speech in unsupervised generative deep convolutional networks that combine the
production and perception principles of speech. We introduce, to our knowledge,
the most challenging objective in unsupervised lexical learning: a network that
must learn unique representations for lexical items with no direct access to
training data. We train several models (ciwGAN and fiwGAN arXiv:2006.02951) and
test how the networks classify acoustic lexical items in unobserved test data.
Strong evidence in favor of lexical learning and a causal relationship between
latent codes and meaningful sublexical units emerge. The architecture that
combines the production and perception principles is thus able to learn to
decode unique information from raw acoustic data without accessing real
training data directly. We propose a technique to explore lexical (holistic)
and sublexical (featural) learned representations in the classifier network.
The results bear implications for unsupervised speech technology, as well as
for unsupervised semantic modeling as language models increasingly bypass text
and operate from raw acoustics.Comment: Submitted to Interspeech 202
Large Linguistic Models: Analyzing theoretical linguistic abilities of LLMs
The performance of large language models (LLMs) has recently improved to the
point where the models can perform well on many language tasks. We show here
that for the first time, the models can also generate coherent and valid formal
analyses of linguistic data and illustrate the vast potential of large language
models for analyses of their metalinguistic abilities. LLMs are primarily
trained on language data in the form of text; analyzing and evaluating their
metalinguistic abilities improves our understanding of their general
capabilities and sheds new light on theoretical models in linguistics. In this
paper, we probe into GPT-4's metalinguistic capabilities by focusing on three
subfields of formal linguistics: syntax, phonology, and semantics. We outline a
research program for metalinguistic analyses of large language models, propose
experimental designs, provide general guidelines, discuss limitations, and
offer future directions for this line of research. This line of inquiry also
exemplifies behavioral interpretability of deep learning, where models'
representations are accessed by explicit prompting rather than internal
representations
AI-assisted coding: Experiments with GPT-4
Artificial intelligence (AI) tools based on large language models have
acheived human-level performance on some computer programming tasks. We report
several experiments using GPT-4 to generate computer code. These experiments
demonstrate that AI code generation using the current generation of tools,
while powerful, requires substantial human validation to ensure accurate
performance. We also demonstrate that GPT-4 refactoring of existing code can
significantly improve that code along several established metrics for code
quality, and we show that GPT-4 can generate tests with substantial coverage,
but that many of the tests fail when applied to the associated code. These
findings suggest that while AI coding tools are very powerful, they still
require humans in the loop to ensure validity and accuracy of the results
Articulation GAN: Unsupervised modeling of articulatory learning
Generative deep neural networks are widely used for speech synthesis, but
most existing models directly generate waveforms or spectral outputs. Humans,
however, produce speech by controlling articulators, which results in the
production of speech sounds through physical properties of sound propagation.
We propose a new unsupervised generative model of speech production/synthesis
that includes articulatory representations and thus more closely mimics human
speech production. We introduce the Articulatory Generator to the Generative
Adversarial Network paradigm. The Articulatory Generator needs to learn to
generate articulatory representations (electromagnetic articulography or EMA)
in a fully unsupervised manner without ever accessing EMA data. A separate
pre-trained physical model (ema2wav) then transforms the generated EMA
representations to speech waveforms, which get sent to the Discriminator for
evaluation. Articulatory analysis of the generated EMA representations suggests
that the network learns to control articulators in a manner that closely
follows human articulators during speech production. Acoustic analysis of the
outputs suggest that the network learns to generate words that are part of
training data as well as novel innovative words that are absent from training
data. Our proposed architecture thus allows modeling of articulatory learning
with deep neural networks from raw audio inputs in a fully unsupervised manner.
We additionally discuss implications of articulatory representations for
cognitive models of human language and speech technology in general
CiwaGAN: Articulatory information exchange
Humans encode information into sounds by controlling articulators and decode
information from sounds using the auditory apparatus. This paper introduces
CiwaGAN, a model of human spoken language acquisition that combines
unsupervised articulatory modeling with an unsupervised model of information
exchange through the auditory modality. While prior research includes
unsupervised articulatory modeling and information exchange separately, our
model is the first to combine the two components. The paper also proposes an
improved articulatory model with more interpretable internal representations.
The proposed CiwaGAN model is the most realistic approximation of human spoken
language acquisition using deep learning. As such, it is useful for cognitively
plausible simulations of the human speech act