340 research outputs found
MUST&P-SRL: Multi-lingual and Unified Syllabification in Text and Phonetic Domains for Speech Representation Learning
In this paper, we present a methodology for linguistic feature extraction,
focusing particularly on automatically syllabifying words in multiple
languages, with a design to be compatible with a forced-alignment tool, the
Montreal Forced Aligner (MFA). In both the textual and phonetic domains, our
method focuses on the extraction of phonetic transcriptions from text, stress
marks, and a unified automatic syllabification (in text and phonetic domains).
The system was built with open-source components and resources. Through an
ablation study, we demonstrate the efficacy of our approach in automatically
syllabifying words from several languages (English, French and Spanish).
Additionally, we apply the technique to the transcriptions of the CMU ARCTIC
dataset, generating valuable annotations available
online\footnote{\url{https://github.com/noetits/MUST_P-SRL}} that are ideal for
speech representation learning, speech unit discovery, and disentanglement of
speech factors in several speech-related fields.Comment: Accepted for publication at EMNLP 202
Practical Attacks Against Graph-based Clustering
Graph modeling allows numerous security problems to be tackled in a general
way, however, little work has been done to understand their ability to
withstand adversarial attacks. We design and evaluate two novel graph attacks
against a state-of-the-art network-level, graph-based detection system. Our
work highlights areas in adversarial machine learning that have not yet been
addressed, specifically: graph-based clustering techniques, and a global
feature space where realistic attackers without perfect knowledge must be
accounted for (by the defenders) in order to be practical. Even though less
informed attackers can evade graph clustering with low cost, we show that some
practical defenses are possible.Comment: ACM CCS 201
Automatic Speech Recognition for Low-Resource and Morphologically Complex Languages
The application of deep neural networks to the task of acoustic modeling for automatic speech recognition (ASR) has resulted in dramatic decreases of word error rates, allowing for the use of this technology in smart phones and personal home assistants in high-resource languages. Developing ASR models of this caliber, however, requires hundreds or thousands of hours of transcribed speech recordings, which presents challenges for most of the world’s languages. In this work, we investigate the applicability of three distinct architectures that have previously been used for ASR in languages with limited training resources. We tested these architectures using publicly available ASR datasets for several typologically and orthographically diverse languages, whose data was produced under a variety of conditions using different speech collection strategies, practices, and equipment. Additionally, we performed data augmentation on this audio, such that the amount of data could increase nearly tenfold, synthetically creating higher resource training. The architectures and their individual components were modified, and parameters explored such that we might find a best-fit combination of features and modeling schemas to fit a specific language morphology. Our results point to the importance of considering language-specific and corpus-specific factors and experimenting with multiple approaches when developing ASR systems for resource-constrained languages
The Unsupervised Acquisition of a Lexicon from Continuous Speech
We present an unsupervised learning algorithm that acquires a
natural-language lexicon from raw speech. The algorithm is based on the optimal
encoding of symbol sequences in an MDL framework, and uses a hierarchical
representation of language that overcomes many of the problems that have
stymied previous grammar-induction procedures. The forward mapping from symbol
sequences to the speech stream is modeled using features based on articulatory
gestures. We present results on the acquisition of lexicons and language models
from raw speech, text, and phonetic transcripts, and demonstrate that our
algorithm compares very favorably to other reported results with respect to
segmentation performance and statistical efficiency.Comment: 27 page technical repor
Learning [Voice]
The [voice] distinction between homorganic stops and fricatives is made by a number of acoustic correlates including voicing, segment duration, and preceding vowel duration. The present work looks at [voice] from a number of multidimensional perspectives.
This dissertation\u27s focus is a corpus study of the phonetic realization of [voice] in two English-learning infants aged 1;1--3;5. While preceding vowel duration has been studied before in infants, the other correlates of post-vocalic voicing investigated here --- preceding F1, consonant duration, and closure voicing intensity --- had not been measured before in infant speech. The study makes empirical contributions regarding the development of the production of [voice] in infants, not just from a surface-level perspective but also with implications for the phonetics-phonology interface in the adult and developing linguistic systems. Additionally, several methodological contributions will be made in the use of large sized corpora and data modeling techniques.
The study revealed that even in infants, F1 at the midpoint of a vowel preceding a voiced consonant was lower by roughly 50 Hz compared to a vowel before a voiceless consonant, which is in line with the effect found in adults. But while the effect has been considered most likely to be a physiological and nonlinguistic phenomenon in adults, it actually appeared to be correlated in the wrong direction with other aspects of [voice] here, casting doubt on a physiological explanation. Some of the consonant pairs had statistically significant differences in duration and closure voicing. Additionally, a preceding vowel duration difference was found and as well a preliminary indication of a developmental trend that suggests the preceding vowel duration difference is being learned.
The phonetics of adult speech is also considered. Results are presented from a dialectal corpus study of North American English and a lab speech experiment which clarifies the relationship between preceding vowel duration and flapping and the relationship between [voice] and F1 in preceding vowels. Fluent adult speech is also described and machine learning algorithms are applied to learning the [voice] distinction using multidimensional acoustic input plus some lexical knowledge
A robust methodology for automated essay grading
None of the available automated essay grading systems can be used to grade essays according to the National Assessment Program – Literacy and Numeracy (NAPLAN) analytic scoring rubric used in Australia. This thesis is a humble effort to address this limitation. The objective of this thesis is to develop a robust methodology for automatically grading essays based on the NAPLAN rubric by using heuristics and rules based on English language and neural network modelling
An acoustic-phonetic approach in automatic Arabic speech recognition
In a large vocabulary speech recognition system the broad phonetic classification
technique is used instead of detailed phonetic analysis to overcome the variability in the
acoustic realisation of utterances. The broad phonetic description of a word is used as a
means of lexical access, where the lexicon is structured into sets of words sharing the
same broad phonetic labelling.
This approach has been applied to a large vocabulary isolated word Arabic speech
recognition system. Statistical studies have been carried out on 10,000 Arabic words
(converted to phonemic form) involving different combinations of broad phonetic
classes. Some particular features of the Arabic language have been exploited. The results
show that vowels represent about 43% of the total number of phonemes. They also show
that about 38% of the words can uniquely be represented at this level by using eight
broad phonetic classes. When introducing detailed vowel identification the percentage of
uniquely specified words rises to 83%. These results suggest that a fully detailed
phonetic analysis of the speech signal is perhaps unnecessary.
In the adopted word recognition model, the consonants are classified into four broad
phonetic classes, while the vowels are described by their phonemic form. A set of 100
words uttered by several speakers has been used to test the performance of the
implemented approach.
In the implemented recognition model, three procedures have been developed, namely
voiced-unvoiced-silence segmentation, vowel detection and identification, and automatic
spectral transition detection between phonemes within a word. The accuracy of both the
V-UV-S and vowel recognition procedures is almost perfect. A broad phonetic
segmentation procedure has been implemented, which exploits information from the
above mentioned three procedures. Simple phonological constraints have been used to
improve the accuracy of the segmentation process. The resultant sequence of labels are
used for lexical access to retrieve the word or a small set of words sharing the same broad
phonetic labelling. For the case of having more than one word-candidates, a verification
procedure is used to choose the most likely one
Modeling Language Variation and Universals: A Survey on Typological Linguistics for Natural Language Processing
Linguistic typology aims to capture structural and semantic variation across
the world's languages. A large-scale typology could provide excellent guidance
for multilingual Natural Language Processing (NLP), particularly for languages
that suffer from the lack of human labeled resources. We present an extensive
literature survey on the use of typological information in the development of
NLP techniques. Our survey demonstrates that to date, the use of information in
existing typological databases has resulted in consistent but modest
improvements in system performance. We show that this is due to both intrinsic
limitations of databases (in terms of coverage and feature granularity) and
under-employment of the typological features included in them. We advocate for
a new approach that adapts the broad and discrete nature of typological
categories to the contextual and continuous nature of machine learning
algorithms used in contemporary NLP. In particular, we suggest that such
approach could be facilitated by recent developments in data-driven induction
of typological knowledge
- …