34,869 research outputs found
Generic Indic Text-to-speech Synthesisers with Rapid Adaptation in an End-to-end Framework
Building text-to-speech (TTS) synthesisers for Indian languages is a
difficult task owing to a large number of active languages. Indian languages
can be classified into a finite set of families, prominent among them,
Indo-Aryan and Dravidian. The proposed work exploits this property to build a
generic TTS system using multiple languages from the same family in an
end-to-end framework. Generic systems are quite robust as they are capable of
capturing a variety of phonotactics across languages. These systems are then
adapted to a new language in the same family using small amounts of adaptation
data. Experiments indicate that good quality TTS systems can be built using
only 7 minutes of adaptation data. An average degradation mean opinion score of
3.98 is obtained for the adapted TTSes.
Extensive analysis of systematic interactions between languages in the
generic TTSes is carried out. x-vectors are included as speaker embedding to
synthesise text in a particular speaker's voice. An interesting observation is
that the prosody of the target speaker's voice is preserved. These results are
quite promising as they indicate the capability of generic TTSes to handle
speaker and language switching seamlessly, along with the ease of adaptation to
a new language
A Systematic Review and Analysis of Multilingual Data Strategies in Text-to-Speech for Low-Resource Languages
We provide a systematic review of past studies that use multilingual data for text-to-speech (TTS) of low-resource languages (LRLs). We focus on the strategies used by these studies for incorporating multilingual data and how they affect output speech quality. To investigate the difference in output quality between corresponding monolingual and multilingual models, we propose a novel measure to compare this difference across the included studies and their various evaluation metrics. This measure, called the Multilingual Model Effect (MLME), is found to be affected by: acoustic model architecture, the difference ratio of target language data between corresponding multilingual and monolingual experiments, the balance ratio of target language data to total data, and the amount of target language data used. These findings can act as reference for data strategies in future experiments with multilingual TTS models for LRLs. Language family classification, despite being widely used, is not found to be an effective criterion for selecting source languages
Current trends in multilingual speech processing
In this paper, we describe recent work at Idiap Research Institute in the domain of multilingual speech processing and provide some insights into emerging challenges for the research community. Multilingual speech processing has been a topic of ongoing interest to the research community for many years and the field is now receiving renewed interest owing to two strong driving forces. Firstly, technical advances in speech recognition and synthesis are posing new challenges and opportunities to researchers. For example, discriminative features are seeing wide application by the speech recognition community, but additional issues arise when using such features in a multilingual setting. Another example is the apparent convergence of speech recognition and speech synthesis technologies in the form of statistical parametric methodologies. This convergence enables the investigation of new approaches to unified modelling for automatic speech recognition and text-to-speech synthesis (TTS) as well as cross-lingual speaker adaptation for TTS. The second driving force is the impetus being provided by both government and industry for technologies to help break down domestic and international language barriers, these also being barriers to the expansion of policy and commerce. Speech-to-speech and speech-to-text translation are thus emerging as key technologies at the heart of which lies multilingual speech processin
An overview of the research evidence on ethnicity and communication in healthcare
• The aim of the present study was to identify and review the available
research evidence on 'ethnicity and communication' in areas relevant to
ensuring effective provision of mainstream services (e.g. via interpreter,
advocacy and translation services); provision of services targeted on
communication (e.g. speech and language therapy, counselling,
psychotherapy); consensual/ participatory activities (e.g. consent to
interventions), and; procedures for managing and planning for linguistic
diversity
Harnessing AI for Speech Reconstruction using Multi-view Silent Video Feed
Speechreading or lipreading is the technique of understanding and getting
phonetic features from a speaker's visual features such as movement of lips,
face, teeth and tongue. It has a wide range of multimedia applications such as
in surveillance, Internet telephony, and as an aid to a person with hearing
impairments. However, most of the work in speechreading has been limited to
text generation from silent videos. Recently, research has started venturing
into generating (audio) speech from silent video sequences but there have been
no developments thus far in dealing with divergent views and poses of a
speaker. Thus although, we have multiple camera feeds for the speech of a user,
but we have failed in using these multiple video feeds for dealing with the
different poses. To this end, this paper presents the world's first ever
multi-view speech reading and reconstruction system. This work encompasses the
boundaries of multimedia research by putting forth a model which leverages
silent video feeds from multiple cameras recording the same subject to generate
intelligent speech for a speaker. Initial results confirm the usefulness of
exploiting multiple camera views in building an efficient speech reading and
reconstruction system. It further shows the optimal placement of cameras which
would lead to the maximum intelligibility of speech. Next, it lays out various
innovative applications for the proposed system focusing on its potential
prodigious impact in not just security arena but in many other multimedia
analytics problems.Comment: 2018 ACM Multimedia Conference (MM '18), October 22--26, 2018, Seoul,
Republic of Kore
- …