27 research outputs found
Homogenous Ensemble Phonotactic Language Recognition Based on SVM Supervector Reconstruction
Currently, acoustic spoken language recognition (SLR) and phonotactic SLR systems are widely used language recognition systems. To achieve better performance, researchers combine multiple subsystems with the results often much better than a single SLR system. Phonotactic SLR subsystems may vary in the acoustic features vectors or include multiple language-specific phone recognizers and different acoustic models. These methods achieve good performance but usually compute at high computational cost. In this paper, a new diversification for phonotactic language recognition systems is proposed using vector space models by support vector machine (SVM) supervector reconstruction (SSR). In this architecture, the subsystems share the same feature extraction, decoding, and N-gram counting preprocessing steps, but model in a different vector space by using the SSR algorithm without significant additional computation. We term this a homogeneous ensemble phonotactic language recognition (HEPLR) system. The system integrates three different SVM supervector reconstruction algorithms, including relative SVM supervector reconstruction, functional SVM supervector reconstruction, and perturbing SVM supervector reconstruction. All of the algorithms are incorporated using a linear discriminant analysis-maximum mutual information (LDA-MMI) backend for improving language recognition evaluation (LRE) accuracy. Evaluated on the National Institute of Standards and Technology (NIST) LRE 2009 task, the proposed HEPLR system achieves better performance than a baseline phone recognition-vector space modeling (PR-VSM) system with minimal extra computational cost. The performance of the HEPLR system yields 1.39%, 3.63%, and 14.79% equal error rate (EER), representing 6.06%, 10.15%, and 10.53% relative improvements over the baseline system, respectively, for the 30-, 10-, and 3-s test conditions
Automatic language identification using deep neural networks
Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. I. LĂłpez-Moreno, J. González-DomĂnguez, P. Oldrich, D. R. MartĂnez, J. González-RodrĂguez, "Automatic language identification using deep neural networks", IEEE International Conference on Acoustics, Speech, and Signal Processing ICASSP, Florence (Italy), 2014This work studies the use of deep neural networks (DNNs)
to address automatic language identification (LID). Motivated
by their recent success in acoustic modelling, we adapt DNNs
to the problem of identifying the language of a given spoken
utterance from short-term acoustic features. The proposed approach
is compared to state-of-the-art i-vector based acoustic
systems on two different datasets: Google 5M LID corpus and
NIST LRE 2009. Results show how LID can largely benefit
from using DNNs, especially when a large amount of training
data is available. We found relative improvements up to 70%,
in Cavg, over the baseline system
Robust language recognition via adaptive language factor extraction
This paper presents a technique to adapt an acoustically based
language classifier to the background conditions and speaker
accents. This adaptation improves language classification on
a broad spectrum of TV broadcasts. The core of the system
consists of an iVector-based setup in which language and channel
variabilities are modeled separately. The subsequent language
classifier (the backend) operates on the language factors,
i.e. those features in the extracted iVectors that explain the observed
language variability. The proposed technique adapts the
language variability model to the background conditions and
to the speaker accents present in the audio. The effect of the
adaptation is evaluated on a 28 hours corpus composed of documentaries and monolingual as well as multilingual broadcast
news shows. Consistent improvements in the automatic identification
of Flemish (Belgian Dutch), English and French are demonstrated for all broadcast types
Language Identification Using Visual Features
Automatic visual language identification (VLID) is the technology of using information derived from the visual appearance and movement of the speech articulators to iden- tify the language being spoken, without the use of any audio information. This technique for language identification (LID) is useful in situations in which conventional audio processing is ineffective (very noisy environments), or impossible (no audio signal is available). Research in this field is also beneficial in the related field of automatic lip-reading. This paper introduces several methods for visual language identification (VLID). They are based upon audio LID techniques, which exploit language phonology and phonotactics to discriminate languages. We show that VLID is possible in a speaker-dependent mode by discrimi- nating different languages spoken by an individual, and we then extend the technique to speaker-independent operation, taking pains to ensure that discrimination is not due to artefacts, either visual (e.g. skin-tone) or audio (e.g. rate of speaking). Although the low accuracy of visual speech recognition currently limits the performance of VLID, we can obtain an error-rate of < 10% in discriminating between Arabic and English on 19 speakers and using about 30s of visual speech
Unsupervised crosslingual adaptation of tokenisers for spoken language recognition
Phone tokenisers are used in spoken language recognition (SLR) to obtain elementary
phonetic information. We present a study on the use of deep neural
network tokenisers. Unsupervised crosslingual adaptation was performed to
adapt the baseline tokeniser trained on English conversational telephone speech
data to different languages. Two training and adaptation approaches, namely
cross-entropy adaptation and state-level minimum Bayes risk adaptation, were
tested in a bottleneck i-vector and a phonotactic SLR system. The SLR systems
using the tokenisers adapted to different languages were combined using score
fusion, giving 7-18% reduction in minimum detection cost function (minDCF)
compared with the baseline configurations without adapted tokenisers. Analysis
of results showed that the ensemble tokenisers gave diverse representation of
phonemes, thus bringing complementary effects when SLR systems with different
tokenisers were combined. SLR performance was also shown to be related
to the quality of the adapted tokenisers