230,483 research outputs found
Exploring Transfer Learning For End-to-End Spoken Language Understanding
Voice Assistants such as Alexa, Siri, and Google Assistant typically use a
two-stage Spoken Language Understanding pipeline; first, an Automatic Speech
Recognition (ASR) component to process customer speech and generate text
transcriptions, followed by a Natural Language Understanding (NLU) component to
map transcriptions to an actionable hypothesis. An end-to-end (E2E) system that
goes directly from speech to a hypothesis is a more attractive option. These
systems were shown to be smaller, faster, and better optimized. However, they
require massive amounts of end-to-end training data and in addition, don't take
advantage of the already available ASR and NLU training data.
In this work, we propose an E2E system that is designed to jointly train on
multiple speech-to-text tasks, such as ASR (speech-transcription) and SLU
(speech-hypothesis), and text-to-text tasks, such as NLU (text-hypothesis). We
call this the Audio-Text All-Task (AT-AT) Model and we show that it beats the
performance of E2E models trained on individual tasks, especially ones trained
on limited data. We show this result on an internal music dataset and two
public datasets, FluentSpeech and SNIPS Audio, where we achieve
state-of-the-art results. Since our model can process both speech and text
input sequences and learn to predict a target sequence, it also allows us to do
zero-shot E2E SLU by training on only text-hypothesis data (without any speech)
from a new domain. We evaluate this ability of our model on the Facebook TOP
dataset and set a new benchmark for zeroshot E2E performance. We will soon
release the audio data collected for the TOP dataset for future research.Comment: AAAI 202
Smooth inverse frequency based text data selection for medical dictation
Under-resourced domain problem is significant in automatic speech recognition, especially in small languages such as Hungarian or in fields where data is often confidential such as finance and medicine. We introduce a method using word embedding and smooth inverse frequency (SIF) based distance measurement to filter public domain web corpora. The selection for (medical) domain matching documents can be scaled. The resulted text is used to train an augmented language model for a medical dictation system. We show that using the appropriately scaled selection leads to optimal performance of the ASR system over the baselines where no data augmentation was applied or all the augmentation data was added
Computer interfaces for the visually impaired
Information access via computer terminals extends to blind and low vision persons employed in many technical and nontechnical disciplines. Two aspects are detailed of providing computer technology for persons with a vision related handicap. First, research into the most effective means of integrating existing adaptive technologies into information systems was made. This was conducted to integrate off the shelf products with adaptive equipment for cohesive integrated information processing systems. Details are included that describe the type of functionality required in software to facilitate its incorporation into a speech and/or braille system. The second aspect is research into providing audible and tactile interfaces to graphics based interfaces. Parameters are included for the design and development of the Mercator Project. The project will develop a prototype system for audible access to graphics based interfaces. The system is being built within the public domain architecture of X windows to show that it is possible to provide access to text based applications within a graphical environment. This information will be valuable to suppliers to ADP equipment since new legislation requires manufacturers to provide electronic access to the visually impaired
Exploration of audiovisual heritage using audio indexing technology
This paper discusses audio indexing tools that have been implemented for the disclosure of Dutch audiovisual cultural heritage collections. It explains the role of language models and their adaptation to historical settings and the adaptation of acoustic models for homogeneous audio collections. In addition to the benefits of cross-media linking, the requirements for successful tuning and improvement of available tools for indexing the heterogeneous A/V collections from the cultural heritage domain are reviewed. And finally the paper argues that research is needed to cope with the varying information needs for different types of users
Golan v. Holder: Copyright in the Image of the First Amendment
Does copyright violate the First Amendment? Professor Melville Nimmer asked this question forty years ago, and then answered it by concluding that copyright itself is affirmatively speech protective. Despite ample reason to doubt Nimmerâs response, the Supreme Court has avoided an independent, thoughtful, plenary review of the question. Copyright has come to enjoy an all-but-categorical immunity to First Amendment constraints. Now, however, the Court faces a new challenge to its back-of-the-hand treatment of this vital conflict. In Golan v. Holder the Tenth Circuit considered legislation (enacted pursuant to the Berne Convention and TRIPS) ârestoringâ copyright protection to millions of foreign works previously thought to belong to the public domain. The Tenth Circuit upheld the legislation, but not without noting that it appeared to raise important First Amendment concerns. The Supreme Court granted certiorari. This article addresses the issues in the Golan case, literally on the eve of oral argument before the Court. This article first considers the Copyright and Treaty Clauses, and then addresses the relationship between copyright and the First Amendment. The discussion endorses an understanding of that relationship in which the Amendment is newly seen as paramount, and copyright is newly seen in the image of the Amendment
- âŠ