2,803 research outputs found
PERSON NAME RECOGNITION IN ASR OUTPUTS USING CONTINUOUS CONTEXT MODELS
ABSTRACT The detection and characterization, in audiovisual documents, of speech utterances where person names are pronounced, is an important cue for spoken content analysis. This paper tackles the problematic of retrieving spoken person names in the 1-Best ASR outputs of broadcast TV shows. Our assumption is that a person name is a latent variable produced by the lexical context it appears in. Thereby, a spoken name could be derived from ASR outputs even if it has not been proposed by the speech recognition system. A new context modelling is proposed in order to capture lexical and structural information surrounding a spoken name. The fundamental hypothesis of this study has been validated on broadcast TV documents available in the context of the REPERE challenge
Embedding-Based Speaker Adaptive Training of Deep Neural Networks
An embedding-based speaker adaptive training (SAT) approach is proposed and
investigated in this paper for deep neural network acoustic modeling. In this
approach, speaker embedding vectors, which are a constant given a particular
speaker, are mapped through a control network to layer-dependent element-wise
affine transformations to canonicalize the internal feature representations at
the output of hidden layers of a main network. The control network for
generating the speaker-dependent mappings is jointly estimated with the main
network for the overall speaker adaptive acoustic modeling. Experiments on
large vocabulary continuous speech recognition (LVCSR) tasks show that the
proposed SAT scheme can yield superior performance over the widely-used
speaker-aware training using i-vectors with speaker-adapted input features
Practical Hidden Voice Attacks against Speech and Speaker Recognition Systems
Voice Processing Systems (VPSes), now widely deployed, have been made
significantly more accurate through the application of recent advances in
machine learning. However, adversarial machine learning has similarly advanced
and has been used to demonstrate that VPSes are vulnerable to the injection of
hidden commands - audio obscured by noise that is correctly recognized by a VPS
but not by human beings. Such attacks, though, are often highly dependent on
white-box knowledge of a specific machine learning model and limited to
specific microphones and speakers, making their use across different acoustic
hardware platforms (and thus their practicality) limited. In this paper, we
break these dependencies and make hidden command attacks more practical through
model-agnostic (blackbox) attacks, which exploit knowledge of the signal
processing algorithms commonly used by VPSes to generate the data fed into
machine learning systems. Specifically, we exploit the fact that multiple
source audio samples have similar feature vectors when transformed by acoustic
feature extraction algorithms (e.g., FFTs). We develop four classes of
perturbations that create unintelligible audio and test them against 12 machine
learning models, including 7 proprietary models (e.g., Google Speech API, Bing
Speech API, IBM Speech API, Azure Speaker API, etc), and demonstrate successful
attacks against all targets. Moreover, we successfully use our maliciously
generated audio samples in multiple hardware configurations, demonstrating
effectiveness across both models and real systems. In so doing, we demonstrate
that domain-specific knowledge of audio signal processing represents a
practical means of generating successful hidden voice command attacks
Error analysis in automatic speech recognition and machine translation
Automatic speech recognition and machine translation are well-known terms in
the translation world nowadays. Systems that carry out these processes are taking over the work
of humans more and more. Reasons for this are the speed at which the tasks are performed and
their costs. However, the quality of these systems is debatable. They are not yet capable of
delivering the same performance as human transcribers or translators. The lack of creativity,
the ability to interpret texts and the sense of language is often cited as the reason why the
performance of machines is not yet at the level of human translation or transcribing work.
Despite this, there are companies that use these machines in their production pipelines.
Unbabel, an online translation platform powered by artificial intelligence, is one of these
companies. Through a combination of human translators and machines, Unbabel tries to
provide its customers with a translation of good quality. This internship report was written with
the aim of gaining an overview of the performance of these systems and the errors they produce.
Based on this work, we try to get a picture of possible error patterns produced by both systems.
The present work consists of an extensive analysis of errors produced by automatic speech
recognition and machine translation systems after automatically transcribing and translating 10
English videos into Dutch. Different videos were deliberately chosen to see if there were
significant differences in the error patterns between videos. The generated data and results from
this work, aims at providing possible ways to improve the quality of the services already
mentioned.O reconhecimento automático de fala e a tradução automática são termos conhecidos
no mundo da tradução, hoje em dia. Os sistemas que realizam esses processos estão a assumir
cada vez mais o trabalho dos humanos. As razões para isso são a velocidade com que as tarefas
são realizadas e os seus custos. No entanto, a qualidade desses sistemas é discutível. As
máquinas ainda não são capazes de ter o mesmo desempenho dos transcritores ou tradutores
humanos. A falta de criatividade, de capacidade de interpretar textos e de sensibilidade
linguística são motivos frequentemente usados para justificar o facto de as máquinas ainda não
estarem suficientemente desenvolvidas para terem um desempenho comparável com o trabalho
de tradução ou transcrição humano. Mesmo assim, existem empresas que fazem uso dessas
máquinas. A Unbabel, uma plataforma de tradução online baseada em inteligência artificial, é
uma dessas empresas. Através de uma combinação de tradutores humanos e de máquinas, a
Unbabel procura oferecer aos seus clientes traduções de boa qualidade. O presente relatório de
estágio foi feito com o intuito de obter uma visão geral do desempenho desses sistemas e das
falhas que cometem, propondo delinear uma imagem dos possíveis padrões de erro existentes
nos mesmos. Para tal, fez-se uma análise extensa das falhas que os sistemas de reconhecimento
automático de fala e de tradução automática cometeram, após a transcrição e a tradução
automática de 10 vídeos. Foram deliberadamente escolhidos registos videográficos diversos,
de modo a verificar possíveis diferenças nos padrões de erro. Através dos dados gerados e dos
resultados obtidos, propõe-se encontrar uma forma de melhorar a qualidade dos serviços já
mencionados
- …