9,137 research outputs found
Acoustic absement in detail: Quantifying acoustic differences across time-series representations of speech data
The speech signal is a consummate example of time-series data. The acoustics
of the signal change over time, sometimes dramatically. Yet, the most common
type of comparison we perform in phonetics is between instantaneous acoustic
measurements, such as formant values. In the present paper, I discuss the
concept of absement as a quantification of differences between two time-series.
I then provide an experimental example of absement applied to phonetic analysis
for human and/or computer speech recognition. The experiment is a
template-based speech recognition task, using dynamic time warping to compare
the acoustics between recordings of isolated words. A recognition accuracy of
57.9% was achieved. The results of the experiment are discussed in terms of
using absement as a tool, as well as the implications of using acoustics-only
models of spoken word recognition with the word as the smallest discrete
linguistic unit.Comment: 5 pages, 1 figure, accepted for ICPhS 202
Speaker independent isolated word recognition
The work presented in this thesis concerns the recognition of
isolated words using a pattern matching approach. In such a system,
an unknown speech utterance, which is to be identified, is
transformed into a pattern of characteristic features. These
features are then compared with a set of pre-stored reference
patterns that were generated from the vocabulary words. The unknown
word is identified as that vocabulary word for which the reference
pattern gives the best match.
One of the major difficul ties in the pattern comparison process is
that speech patterns, obtained from the same word, exhibit non-linear
temporal fluctuations and thus a high degree of redundancy. The
initial part of this thesis considers various dynamic time warping
techniques used for normalizing the temporal differences between
speech patterns. Redundancy removal methods are also considered, and
their effect on the recognition accuracy is assessed.
Although the use of dynamic time warping algorithms provide
considerable improvement in the accuracy of isolated word recognition
schemes, the performance is ultimately limited by their poor ability
to discriminate between acoustically similar words. Methods for
enhancing the identification rate among acoustically similar words,
by using common pattern features for similar sounding regions, are
investigated.
Pattern matching based, speaker independent systems, can only operate
with a high recognition rate, by using multiple reference patterns
for each of the words included in the vocabulary. These patterns are
obtained from the utterances of a group of speakers. The use of
multiple reference patterns, not only leads to a large increase in
the memory requirements of the recognizer, but also an increase in
the computational load. A recognition system is proposed in this
thesis, which overcomes these difficulties by (i) employing vector
quantization techniques to reduce the storage of reference patterns,
and (ii) eliminating the need for dynamic time warping which reduces
the computational complexity of the system.
Finally, a method of identifying the acoustic structure of an
utterance in terms of voiced, unvoiced, and silence segments by using
fuzzy set theory is proposed. The acoustic structure is then
employed to enhance the recognition accuracy of a conventional
isolated word recognizer
A Tutorial on Prototyping Internet of Things Devices and Systems: A Gentle Introduction to Technology that Shapes Our Lives
The Internet of Things, which has been quietly building and evolving over the past decade, now impacts many aspects of society, including homes, battlefields, and medical communities. Research in information systems, traditionally, has been concentrated on exploring the impacts of such technology, rather than how to actually create systems using it. Although research in design science could especially contribute to the Internet of Things, this type of research from the Information Systems community has been sparse. The most likely cause is the knowledge barriers to learning and understanding this kind of technology development. Recognizing the importance of the continued evolution of the Internet of Things, this paper provides a basic tutorial on how to construct Internet of Things prototypes. The paper is intended to educate Information Systems scholars on how to build their own Internet of Things so they can conduct technical research in this area and instruct their students on how to do the same
A Silent-Speech Interface using Electro-Optical Stomatography
Sprachtechnologie ist eine große und wachsende Industrie, die das Leben von technologieinteressierten Nutzern auf zahlreichen Wegen bereichert. Viele potenzielle Nutzer werden jedoch ausgeschlossen: Nämlich alle Sprecher, die nur schwer oder sogar gar nicht Sprache produzieren können.
Silent-Speech Interfaces bieten einen Weg, mit Maschinen durch ein bequemes sprachgesteuertes Interface zu kommunizieren ohne dafür akustische Sprache zu benötigen. Sie können außerdem prinzipiell eine Ersatzstimme stellen, indem sie die intendierten Äußerungen, die der Nutzer nur still artikuliert, künstlich synthetisieren. Diese Dissertation stellt ein neues Silent-Speech Interface vor, das auf einem neu entwickelten Messsystem namens Elektro-Optischer Stomatografie und einem neuartigen parametrischen Vokaltraktmodell basiert, das die Echtzeitsynthese von Sprache basierend auf den gemessenen Daten ermöglicht. Mit der Hardware wurden Studien zur Einzelworterkennung durchgeführt, die den Stand der Technik in der intra- und inter-individuellen Genauigkeit erreichten und übertrafen. Darüber hinaus wurde eine Studie abgeschlossen, in der die Hardware zur Steuerung des Vokaltraktmodells in einer direkten Artikulation-zu-Sprache-Synthese verwendet wurde. Während die Verständlichkeit der Synthese von Vokalen sehr hoch eingeschätzt wurde, ist die Verständlichkeit von Konsonanten und kontinuierlicher Sprache sehr schlecht. Vielversprechende Möglichkeiten zur Verbesserung des Systems werden im Ausblick diskutiert.:Statement of authorship iii
Abstract v
List of Figures vii
List of Tables xi
Acronyms xiii
1. Introduction 1
1.1. The concept of a Silent-Speech Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2. Structure of this work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2. Fundamentals of phonetics 7
2.1. Components of the human speech production system . . . . . . . . . . . . . . . . . . . 7
2.2. Vowel sounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3. Consonantal sounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4. Acoustic properties of speech sounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5. Coarticulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6. Phonotactics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.7. Summary and implications for the design of a Silent-Speech Interface (SSI) . . . . . . . 21
3. Articulatory data acquisition techniques in Silent-Speech Interfaces 25
3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2. Scope of the literature review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3. Video Recordings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4. Ultrasonography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.5. Electromyography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.6. Permanent-Magnetic Articulography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.7. Electromagnetic Articulography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.8. Radio waves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.9. Palatography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.10.Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4. Electro-Optical Stomatography 55
4.1. Contact sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2. Optical distance sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3. Lip sensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.4. Sensor Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.5. Control Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.6. Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5. Articulation-to-Text 99
5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.2. Command word recognition pilot study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.3. Command word recognition small-scale study . . . . . . . . . . . . . . . . . . . . . . . . 102
6. Articulation-to-Speech 109
6.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.2. Articulatory synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.3. The six point vocal tract model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.4. Objective evaluation of the vocal tract model . . . . . . . . . . . . . . . . . . . . . . . . 116
6.5. Perceptual evaluation of the vocal tract model . . . . . . . . . . . . . . . . . . . . . . . . 120
6.6. Direct synthesis using EOS to control the vocal tract model . . . . . . . . . . . . . . . . 125
6.7. Pitch and voicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
7. Summary and outlook 145
7.1. Summary of the contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.2. Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
A. Overview of the International Phonetic Alphabet 151
B. Mathematical proofs and derivations 153
B.1. Combinatoric calculations illustrating the reduction of possible syllables using phonotactics
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
B.2. Signal Averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
B.3. Effect of the contact sensor area on the conductance . . . . . . . . . . . . . . . . . . . . 155
B.4. Calculation of the forward current for the OP280V diode . . . . . . . . . . . . . . . . . . 155
C. Schematics and layouts 157
C.1. Schematics of the control unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
C.2. Layout of the control unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
C.3. Bill of materials of the control unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
C.4. Schematics of the sensor unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
C.5. Layout of the sensor unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
C.6. Bill of materials of the sensor unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
D. Sensor unit assembly 169
E. Firmware flow and data protocol 177
F. Palate file format 181
G. Supplemental material regarding the vocal tract model 183
H. Articulation-to-Speech: Optimal hyperparameters 189
Bibliography 191Speech technology is a major and growing industry that enriches the lives of technologically-minded people in a number of ways. Many potential users are, however, excluded: Namely, all speakers who cannot easily or even at all produce speech. Silent-Speech Interfaces offer a way to communicate with a machine by a convenient speech recognition interface without the need for acoustic speech. They also can potentially provide a full replacement voice by synthesizing the intended utterances that are only silently articulated by the user. To that end, the speech movements need to be captured and mapped to either text or acoustic speech. This dissertation proposes a new Silent-Speech Interface based on a newly developed measurement technology called Electro-Optical Stomatography and a novel parametric vocal tract model to facilitate real-time speech synthesis based on the measured data. The hardware was used to conduct command word recognition studies reaching state-of-the-art intra- and inter-individual performance. Furthermore, a study on using the hardware to control the vocal tract model in a direct articulation-to-speech synthesis loop was also completed. While the intelligibility of synthesized vowels was high, the intelligibility of consonants and connected speech was quite poor. Promising ways to improve the system are discussed in the outlook.:Statement of authorship iii
Abstract v
List of Figures vii
List of Tables xi
Acronyms xiii
1. Introduction 1
1.1. The concept of a Silent-Speech Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2. Structure of this work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2. Fundamentals of phonetics 7
2.1. Components of the human speech production system . . . . . . . . . . . . . . . . . . . 7
2.2. Vowel sounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3. Consonantal sounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4. Acoustic properties of speech sounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5. Coarticulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6. Phonotactics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.7. Summary and implications for the design of a Silent-Speech Interface (SSI) . . . . . . . 21
3. Articulatory data acquisition techniques in Silent-Speech Interfaces 25
3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2. Scope of the literature review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3. Video Recordings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4. Ultrasonography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.5. Electromyography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.6. Permanent-Magnetic Articulography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.7. Electromagnetic Articulography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.8. Radio waves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.9. Palatography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.10.Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4. Electro-Optical Stomatography 55
4.1. Contact sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2. Optical distance sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3. Lip sensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.4. Sensor Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.5. Control Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.6. Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5. Articulation-to-Text 99
5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.2. Command word recognition pilot study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.3. Command word recognition small-scale study . . . . . . . . . . . . . . . . . . . . . . . . 102
6. Articulation-to-Speech 109
6.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.2. Articulatory synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.3. The six point vocal tract model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.4. Objective evaluation of the vocal tract model . . . . . . . . . . . . . . . . . . . . . . . . 116
6.5. Perceptual evaluation of the vocal tract model . . . . . . . . . . . . . . . . . . . . . . . . 120
6.6. Direct synthesis using EOS to control the vocal tract model . . . . . . . . . . . . . . . . 125
6.7. Pitch and voicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
7. Summary and outlook 145
7.1. Summary of the contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.2. Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
A. Overview of the International Phonetic Alphabet 151
B. Mathematical proofs and derivations 153
B.1. Combinatoric calculations illustrating the reduction of possible syllables using phonotactics
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
B.2. Signal Averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
B.3. Effect of the contact sensor area on the conductance . . . . . . . . . . . . . . . . . . . . 155
B.4. Calculation of the forward current for the OP280V diode . . . . . . . . . . . . . . . . . . 155
C. Schematics and layouts 157
C.1. Schematics of the control unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
C.2. Layout of the control unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
C.3. Bill of materials of the control unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
C.4. Schematics of the sensor unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
C.5. Layout of the sensor unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
C.6. Bill of materials of the sensor unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
D. Sensor unit assembly 169
E. Firmware flow and data protocol 177
F. Palate file format 181
G. Supplemental material regarding the vocal tract model 183
H. Articulation-to-Speech: Optimal hyperparameters 189
Bibliography 19
Speaker Independent Acoustic-to-Articulatory Inversion
Acoustic-to-articulatory inversion, the determination of articulatory parameters from acoustic signals, is a difficult but important problem for many speech processing applications, such as automatic speech recognition (ASR) and computer aided pronunciation training (CAPT). In recent years, several approaches have been successfully implemented for speaker dependent models with parallel acoustic and kinematic training data. However, in many practical applications inversion is needed for new speakers for whom no articulatory data is available. In order to address this problem, this dissertation introduces a novel speaker adaptation approach called Parallel Reference Speaker Weighting (PRSW), based on parallel acoustic and articulatory Hidden Markov Models (HMM). This approach uses a robust normalized articulatory space and palate referenced articulatory features combined with speaker-weighted adaptation to form an inversion mapping for new speakers that can accurately estimate articulatory trajectories. The proposed PRSW method is evaluated on the newly collected Marquette electromagnetic articulography - Mandarin Accented English (EMA-MAE) corpus using 20 native English speakers. Cross-speaker inversion results show that given a good selection of reference speakers with consistent acoustic and articulatory patterns, the PRSW approach gives good speaker independent inversion performance even without kinematic training data
- …