Search CORE

570 research outputs found

Scaling Speech Technology to 1,000+ Languages

Author: Adi Yossi
Auli Michael
Babu Arun
Baevski Alexei
Conneau Alexis
Elkahky Ali
Fazel-Zarandi Maryam
Hsu Wei-Ning
Kundu Sayani
Ni Zhaoheng
Pratap Vineel
Shi Bowen
Tjandra Andros
Tomasello Paden
Vyas Apoorv
Zhang Xiaohui
Publication venue
Publication date: 22/05/2023
Field of study

Expanding the language coverage of speech technology has the potential to improve access to information for many more people. However, current speech technology is restricted to about one hundred languages which is a small fraction of the over 7,000 languages spoken around the world. The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task. The main ingredients are a new dataset based on readings of publicly available religious texts and effectively leveraging self-supervised learning. We built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, as well as a language identification model for 4,017 languages. Experiments show that our multilingual speech recognition model more than halves the word error rate of Whisper on 54 languages of the FLEURS benchmark while being trained on a small fraction of the labeled data

arXiv.org e-Print Archive

Spoken content retrieval: A survey of techniques and technologies

Author: Ani Nenkova
C A. Nenkova
K. Mckeown
Kathleen Mckeown
Publication venue: 'Now Publishers'
Publication date: 01/01/2012
Field of study

Speech media, that is, digital audio and video containing spoken content, has blossomed in recent years. Large collections are accruing on the Internet as well as in private and enterprise settings. This growth has motivated extensive research on techniques and technologies that facilitate reliable indexing and retrieval. Spoken content retrieval (SCR) requires the combination of audio and speech processing technologies with methods from information retrieval (IR). SCR research initially investigated planned speech structured in document-like units, but has subsequently shifted focus to more informal spoken content produced spontaneously, outside of the studio and in conversational settings. This survey provides an overview of the field of SCR encompassing component technologies, the relationship of SCR to text IR and automatic speech recognition and user interaction issues. It is aimed at researchers with backgrounds in speech technology or IR who are seeking deeper insight on how these fields are integrated to support research and development, thus addressing the core challenges of SCR

CiteSeerX

Crossref

Irish Universities

DCU Online Research Access Service

ATLAS: A flexible and extensible architecture for linguistic annotation

Author: Bird Steven
Day David
Garofolo John
Henderson John
Laprun Christophe
Liberman Mark
Publication venue
Publication date: 01/01/2000
Field of study

We describe a formal model for annotating linguistic artifacts, from which we derive an application programming interface (API) to a suite of tools for manipulating these annotations. The abstract logical model provides for a range of storage formats and promotes the reuse of tools that interact through this API. We focus first on ``Annotation Graphs,'' a graph model for annotations on linear signals (such as text and speech) indexed by intervals, for which efficient database storage and querying techniques are applicable. We note how a wide range of existing annotated corpora can be mapped to this annotation graph model. This model is then generalized to encompass a wider variety of linguistic ``signals,'' including both naturally occuring phenomena (as recorded in images, video, multi-modal interactions, etc.), as well as the derived resources that are increasingly important to the engineering of natural language processing systems (such as word lists, dictionaries, aligned bilingual corpora, etc.). We conclude with a review of the current efforts towards implementing key pieces of this architecture.Comment: 8 pages, 9 figure

arXiv.org e-Print Archive

CiteSeerX

Essential Speech and Language Technology for Dutch: Results by the STEVIN-programme

Author: Peter Spyns Jan Odijk
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/04/2020
Field of study

Computational Linguistics; Germanic Languages; Artificial Intelligence (incl. Robotics); Computing Methodologie

Directory of Open Access Books (DOAB)

Acoustic Approaches to Gender and Accent Identification

Author: DeMarco Andrea
Publication venue
Publication date: 01/06/2015
Field of study

There has been considerable research on the problems of speaker and language recognition from samples of speech. A less researched problem is that of accent recognition. Although this is a similar problem to language identification, di�erent accents of a language exhibit more fine-grained di�erences between classes than languages. This presents a tougher problem for traditional classification techniques. In this thesis, we propose and evaluate a number of techniques for gender and accent classification. These techniques are novel modifications and extensions to state of the art algorithms, and they result in enhanced performance on gender and accent recognition. The first part of the thesis focuses on the problem of gender identification, and presents a technique that gives improved performance in situations where training and test conditions are mismatched. The bulk of this thesis is concerned with the application of the i-Vector technique to accent identification, which is the most successful approach to acoustic classification to have emerged in recent years. We show that it is possible to achieve high accuracy accent identification without reliance on transcriptions and without utilising phoneme recognition algorithms. The thesis describes various stages in the development of i-Vector based accent classification that improve the standard approaches usually applied for speaker or language identification, which are insu�cient. We demonstrate that very good accent identification performance is possible with acoustic methods by considering di�erent i-Vector projections, frontend parameters, i-Vector configuration parameters, and an optimised fusion of the resulting i-Vector classifiers we can obtain from the same data. We claim to have achieved the best accent identification performance on the test corpus for acoustic methods, with up to 90% identification rate. This performance is even better than previously reported acoustic-phonotactic based systems on the same corpus, and is very close to performance obtained via transcription based accent identification. Finally, we demonstrate that the utilization of our techniques for speech recognition purposes leads to considerably lower word error rates. Keywords: Accent Identification, Gender Identification, Speaker Identification, Gaussian Mixture Model, Support Vector Machine, i-Vector, Factor Analysis, Feature Extraction, British English, Prosody, Speech Recognition

University of East Anglia digital repository

Thousands of Voices for HMM-Based Speech Synthesis-Analysis and Application of TTS Systems Built on Various ASR Corpora

Author: Dines John
Guan Yong
Hu Rile
Karhila Reima
King Simon
Kurimo Mikko
Oura Keiichiro
Tian Jilei
Tokuda Keiichi
Usabaev Bela
Watts Oliver
Wu Yi-Jian
Yamagishi Junichi
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/07/2010
Field of study

In conventional speech synthesis, large amounts of phonetically balanced speech data recorded in highly controlled recording studio environments are typically required to build a voice. Although using such data is a straightforward solution for high quality synthesis, the number of voices available will always be limited, because recording costs are high. On the other hand, our recent experiments with HMM-based speech synthesis systems have demonstrated that speaker-adaptive HMM-based speech synthesis (which uses an "average voice model" plus model adaptation) is robust to non-ideal speech data that are recorded under various conditions and with varying microphones, that are not perfectly clean, and/or that lack phonetic balance. This enables us to consider building high-quality voices on "non-TTS" corpora such as ASR corpora. Since ASR corpora generally include a large number of speakers, this leads to the possibility of producing an enormous number of voices automatically. In this paper, we demonstrate the thousands of voices for HMM-based speech synthesis that we have made from several popular ASR corpora such as the Wall Street Journal (WSJ0, WSJ1, and WSJCAM0), Resource Management, Globalphone, and SPEECON databases. We also present the results of associated analysis based on perceptual evaluation, and discuss remaining issues

Edinburgh Research Archive

Edinburgh Research Explorer

Sessizliğin Kaldırılması ve Konuşmanın Parçalara Ayrılması İşleminin Türkçe Otomatik Konuşma Tanıma Üzerindeki Etkisi

Author: Hayri Sever
Hüseyin Polat
Saadin Oyucu
Publication venue: 'Duzce Universitesi Bilim ve Teknoloji Dergisi'
Publication date: 01/01/2020
Field of study

Otomatik Konuşma Tanıma sistemleri temel olarak akustik bilgiden faydalanılarak geliştirilmektedir. Akustik bilgiden fonem bilgisinin elde edilmesi için eşleştirilmiş konuşma ve metin verileri kullanılmaktadır. Bu veriler ile eğitilen akustik modeller gerçek hayattaki bütün akustik bilgiyi modelleyememektedir. Bu nedenle belirli ön işlemlerin yapılması ve otomatik konuşma tanıma sistemlerinin başarımını düşürecek akustik bilgilerin ortadan kaldırılması gerekmektedir. Bu çalışmada konuşma içerisinde geçen sessizliklerin kaldırılması için bir yöntem önerilmiştir. Önerilen yöntemin amacı sessizlik bilgisinin ortadan kaldırılması ve akustik bilgide uzun bağımlılıklar sağlayan konuşmaların parçalara ayrılmasıdır. Geliştirilen yöntemin sonunda elde edilen sessizlik içermeyen ve parçalara ayrılan konuşma bilgisi bir Türkçe Otomatik Konuşma Tanıma sistemine girdi olarak verilmiştir. Otomatik Konuşma Tanıma sisteminin çıkışında sisteme giriş olarak verilen konuşma parçalarına karşılık gelen metinler birleştirilerek sunulmuştur. Gerçekleştirilen deneylerde sessizliğin kaldırılması ve konuşmanın parçalara ayrılması işleminin Otomatik Konuşma Tanıma sistemlerinin başarımını artırdığı görülmüştür

Directory of Open Access Journals

Duzce University Open Access

The use of speech recognition technology by people living with Amyotrophic Lateral Sclerosis: a scoping review

Author: Bloch S
Cave R
Publication venue: 'Informa UK Limited'
Publication date: 01/08/2022
Field of study

More than 80% of people living with Amyotrophic Lateral Sclerosis (plwALS) develop difficulties with their speech, affecting communication, self-identity and quality of life. Automatic speech recognition technology (ASR) is becoming a common way to interact with a broad range of devices, to find information and control the environment. ASR can be problematic for people with acquired neurogenic motor speech difficulties (dysarthria). Given that the field is rapidly developing, a scoping review is warranted

UCL Discovery

Exploring the effects of accent on cognitive processes: behavioral and electrophysiological insights

Author: Thomas Trisha Breanne
Publication venue
Publication date: 27/08/2023
Field of study

167 p.Previous research has found that speaker accent can have an impact on a range of offline and online cognitive processes (Baus, Bas, Calabria, & Costa, 2017; McAleer, Todorov, & Belin, 2014; Stevenage, Clarke, & McNeill, 2012; Sporer, 2001). Indeed, previous studies show that there are differences in native and non-native speech processing (Lev-Ari, 2018). Processing foreign-accented speech requires the listener to adapt to an extra range of variability, suggesting that there may be an increase in the amount of attentional and cognitive resources that are needed to successfully interpret the speech signal of a foreign-accented speaker. However, less is known about the differences between processing native and dialectal accents. Is dialectal processing more similar to foreign or native speech? To address this, two theories have been proposed (Clarke & Garrett, 2004; Floccia et al, 2009). Previous studies have contributed to the plausibility of both hypotheses and importantly for the purposes of this project, previous electroencephalography experiments exploring the question have mainly used sentences as material. More studies are needed to elucidate whether foreign accent is processed uniquely from all types of native speech (both native and dialectal accents) or whether dialectal accent is treated differently from native accent, despite both being native speech variations. Accordingly, the central aim of this dissertation is to further investigate processing mechanisms of speech accent across different levels of linguistic analysis using evidence from both behavioral and electrophysiological experiments. An additional aim of this project was to look at the effects of accent on information retention. In addition to fluctuations in attentional demands, it seems that non-native accent can lead to differences in the depth of listeners¿ memory encoding (Atkinson et al., 2005). This project further aimed to study how changing the accent of the information delivered may affect how well people remember the information received. Three experiments were carried out to investigate accent processing, results and future directions are discussed

Archivo Digital para la Docencia y la Investigación