570 research outputs found
Scaling Speech Technology to 1,000+ Languages
Expanding the language coverage of speech technology has the potential to
improve access to information for many more people. However, current speech
technology is restricted to about one hundred languages which is a small
fraction of the over 7,000 languages spoken around the world. The Massively
Multilingual Speech (MMS) project increases the number of supported languages
by 10-40x, depending on the task. The main ingredients are a new dataset based
on readings of publicly available religious texts and effectively leveraging
self-supervised learning. We built pre-trained wav2vec 2.0 models covering
1,406 languages, a single multilingual automatic speech recognition model for
1,107 languages, speech synthesis models for the same number of languages, as
well as a language identification model for 4,017 languages. Experiments show
that our multilingual speech recognition model more than halves the word error
rate of Whisper on 54 languages of the FLEURS benchmark while being trained on
a small fraction of the labeled data
Spoken content retrieval: A survey of techniques and technologies
Speech media, that is, digital audio and video containing spoken content, has blossomed in recent years. Large collections are accruing on the Internet as well as in private and enterprise settings. This growth has motivated extensive research on techniques and technologies that facilitate reliable indexing and retrieval. Spoken content retrieval (SCR) requires the combination of audio and speech processing technologies with methods from information retrieval (IR). SCR research initially investigated planned speech structured in document-like units, but has subsequently shifted focus to more informal spoken content produced spontaneously, outside of the studio and in conversational settings. This survey provides an overview of the field of SCR encompassing component technologies, the relationship of SCR to text IR and automatic speech recognition and user interaction issues. It is aimed at researchers with backgrounds in speech technology or IR who are seeking deeper insight on how these fields are integrated to support research and development, thus addressing the core challenges of SCR
ATLAS: A flexible and extensible architecture for linguistic annotation
We describe a formal model for annotating linguistic artifacts, from which we
derive an application programming interface (API) to a suite of tools for
manipulating these annotations. The abstract logical model provides for a range
of storage formats and promotes the reuse of tools that interact through this
API. We focus first on ``Annotation Graphs,'' a graph model for annotations on
linear signals (such as text and speech) indexed by intervals, for which
efficient database storage and querying techniques are applicable. We note how
a wide range of existing annotated corpora can be mapped to this annotation
graph model. This model is then generalized to encompass a wider variety of
linguistic ``signals,'' including both naturally occuring phenomena (as
recorded in images, video, multi-modal interactions, etc.), as well as the
derived resources that are increasingly important to the engineering of natural
language processing systems (such as word lists, dictionaries, aligned
bilingual corpora, etc.). We conclude with a review of the current efforts
towards implementing key pieces of this architecture.Comment: 8 pages, 9 figure
Essential Speech and Language Technology for Dutch: Results by the STEVIN-programme
Computational Linguistics; Germanic Languages; Artificial Intelligence (incl. Robotics); Computing Methodologie
Acoustic Approaches to Gender and Accent Identification
There has been considerable research on the problems of speaker and language recognition
from samples of speech. A less researched problem is that of accent recognition. Although this
is a similar problem to language identification, di�erent accents of a language exhibit more
fine-grained di�erences between classes than languages. This presents a tougher problem
for traditional classification techniques. In this thesis, we propose and evaluate a number of
techniques for gender and accent classification. These techniques are novel modifications and
extensions to state of the art algorithms, and they result in enhanced performance on gender
and accent recognition.
The first part of the thesis focuses on the problem of gender identification, and presents a
technique that gives improved performance in situations where training and test conditions are
mismatched.
The bulk of this thesis is concerned with the application of the i-Vector technique to accent
identification, which is the most successful approach to acoustic classification to have emerged
in recent years. We show that it is possible to achieve high accuracy accent identification without
reliance on transcriptions and without utilising phoneme recognition algorithms. The thesis
describes various stages in the development of i-Vector based accent classification that improve
the standard approaches usually applied for speaker or language identification, which are
insu�cient. We demonstrate that very good accent identification performance is possible with
acoustic methods by considering di�erent i-Vector projections, frontend parameters, i-Vector
configuration parameters, and an optimised fusion of the resulting i-Vector classifiers we can
obtain from the same data.
We claim to have achieved the best accent identification performance on the test corpus
for acoustic methods, with up to 90% identification rate. This performance is even better than
previously reported acoustic-phonotactic based systems on the same corpus, and is very close
to performance obtained via transcription based accent identification. Finally, we demonstrate
that the utilization of our techniques for speech recognition purposes leads to considerably
lower word error rates.
Keywords: Accent Identification, Gender Identification, Speaker Identification, Gaussian
Mixture Model, Support Vector Machine, i-Vector, Factor Analysis, Feature Extraction, British
English, Prosody, Speech Recognition
Thousands of Voices for HMM-Based Speech Synthesis-Analysis and Application of TTS Systems Built on Various ASR Corpora
In conventional speech synthesis, large amounts of phonetically balanced speech data recorded in highly controlled recording studio environments are typically required to build a voice. Although using such data is a straightforward solution for high quality synthesis, the number of voices available will always be limited, because recording costs are high. On the other hand, our recent experiments with HMM-based speech synthesis systems have demonstrated that speaker-adaptive HMM-based speech synthesis (which uses an "average voice model" plus model adaptation) is robust to non-ideal speech data that are recorded under various conditions and with varying microphones, that are not perfectly clean, and/or that lack phonetic balance. This enables us to consider building high-quality voices on "non-TTS" corpora such as ASR corpora. Since ASR corpora generally include a large number of speakers, this leads to the possibility of producing an enormous number of voices automatically. In this paper, we demonstrate the thousands of voices for HMM-based speech synthesis that we have made from several popular ASR corpora such as the Wall Street Journal (WSJ0, WSJ1, and WSJCAM0), Resource Management, Globalphone, and SPEECON databases. We also present the results of associated analysis based on perceptual evaluation, and discuss remaining issues
Sessizliğin Kaldırılması ve Konuşmanın Parçalara Ayrılması İşleminin Türkçe Otomatik Konuşma Tanıma Üzerindeki Etkisi
Otomatik Konuşma Tanıma sistemleri
temel olarak akustik bilgiden faydalanılarak geliştirilmektedir. Akustik bilgiden
fonem bilgisinin elde edilmesi için eşleştirilmiş konuşma ve metin verileri
kullanılmaktadır. Bu veriler ile eğitilen akustik modeller gerçek hayattaki
bütün akustik bilgiyi modelleyememektedir. Bu nedenle belirli ön işlemlerin
yapılması ve otomatik konuşma tanıma sistemlerinin başarımını düşürecek akustik
bilgilerin ortadan kaldırılması gerekmektedir. Bu çalışmada konuşma içerisinde
geçen sessizliklerin kaldırılması için bir yöntem önerilmiştir. Önerilen
yöntemin amacı sessizlik bilgisinin ortadan kaldırılması ve akustik bilgide
uzun bağımlılıklar sağlayan konuşmaların parçalara ayrılmasıdır. Geliştirilen
yöntemin sonunda elde edilen sessizlik içermeyen ve parçalara ayrılan konuşma
bilgisi bir Türkçe Otomatik Konuşma Tanıma sistemine girdi olarak verilmiştir.
Otomatik Konuşma Tanıma sisteminin çıkışında sisteme giriş olarak verilen
konuşma parçalarına karşılık gelen metinler birleştirilerek sunulmuştur.
Gerçekleştirilen deneylerde sessizliğin kaldırılması ve konuşmanın parçalara
ayrılması işleminin Otomatik Konuşma Tanıma sistemlerinin başarımını artırdığı
görülmüştür
The use of speech recognition technology by people living with Amyotrophic Lateral Sclerosis: a scoping review
More than 80% of people living with Amyotrophic Lateral Sclerosis (plwALS) develop difficulties with their speech, affecting communication, self-identity and quality of life. Automatic speech recognition technology (ASR) is becoming a common way to interact with a broad range of devices, to find information and control the environment.
ASR can be problematic for people with acquired neurogenic motor speech difficulties (dysarthria). Given that the field is rapidly developing, a scoping review is warranted
Exploring the effects of accent on cognitive processes: behavioral and electrophysiological insights
167 p.Previous research has found that speaker accent can have an impact on a range of offline and online cognitive processes (Baus, Bas, Calabria, & Costa, 2017; McAleer, Todorov, & Belin, 2014; Stevenage, Clarke, & McNeill, 2012; Sporer, 2001). Indeed, previous studies show that there are differences in native and non-native speech processing (Lev-Ari, 2018). Processing foreign-accented speech requires the listener to adapt to an extra range of variability, suggesting that there may be an increase in the amount of attentional and cognitive resources that are needed to successfully interpret the speech signal of a foreign-accented speaker. However, less is known about the differences between processing native and dialectal accents. Is dialectal processing more similar to foreign or native speech? To address this, two theories have been proposed (Clarke & Garrett, 2004; Floccia et al, 2009). Previous studies have contributed to the plausibility of both hypotheses and importantly for the purposes of this project, previous electroencephalography experiments exploring the question have mainly used sentences as material. More studies are needed to elucidate whether foreign accent is processed uniquely from all types of native speech (both native and dialectal accents) or whether dialectal accent is treated differently from native accent, despite both being native speech variations. Accordingly, the central aim of this dissertation is to further investigate processing mechanisms of speech accent across different levels of linguistic analysis using evidence from both behavioral and electrophysiological experiments. An additional aim of this project was to look at the effects of accent on information retention. In addition to fluctuations in attentional demands, it seems that non-native accent can lead to differences in the depth of listeners¿ memory encoding (Atkinson et al., 2005). This project further aimed to study how changing the accent of the information delivered may affect how well people remember the information received. Three experiments were carried out to investigate accent processing, results and future directions are discussed
- …