18 research outputs found
Generating Punctuation for Spoken Text
Generally, the present disclosure is directed to generating punctuation for spoken text from a speech-to-text system. In particular, in some implementations, the systems and methods of the present disclosure can include or otherwise leverage one or more machine-learned models to generate punctuation data for spoken text based on speech data for the spoken tex
Post-Processing of Machine Classifier Output for Object Classification
Machine classifiers are typically trained using labeled data sets. If the training data set has categories of objects that naturally co-occur, the machine classifier may have difficulty in distinguishing those categories. For example, audio streams often contain instances of sounds that occur simultaneously; e.g., speech and laughter. In this example, the different sounds are the objects that are to be classified. A machine classifier trained with such audio streams generates false positives; e.g., conflates speech with laughter, if the training data set does not label speech separately from laughter. The difficulty of obtaining well-labeled training sets compounds the problem of misclassification. For example, most transcriptions of audio streams containing laughter also include speech in close proximity, since laughter occurs just after speech; e.g., at the end of a joke. Furthermore, humans that produce training data typically annotate rather long audio segments at once, without specifying precise times for each word or audio event, so segments that contain laughter typically include both “speech” and “laughter” without labeling exactly when each occurred. This disclosure describes techniques to improve classification accuracy that are applicable for machine classifiers that act on any type of data; e.g., video, documents, images, etc
Captions Based On Speaker Identification
Disclosed herein is a mechanism for generating and providing captions based on speaker identification. In some instances, the mechanism can be used to determine intervals where a single-speaker is speaking within particular image frames to assist the task of manual captioning or manual transcription. In some instances, the mechanism can be used to provide an awareness or indication of speaker turn-changes in captions, where a particular word or phrase can be grouped by particular speaker. In some instances, the mechanism can be used to provide an awareness or indication of speaker position and identity information corresponding to the speaker
CNN Architectures for Large-Scale Audio Classification
Convolutional Neural Networks (CNNs) have proven very effective in image
classification and show promise for audio. We use various CNN architectures to
classify the soundtracks of a dataset of 70M training videos (5.24 million
hours) with 30,871 video-level labels. We examine fully connected Deep Neural
Networks (DNNs), AlexNet [1], VGG [2], Inception [3], and ResNet [4]. We
investigate varying the size of both training set and label vocabulary, finding
that analogs of the CNNs used in image classification do well on our audio
classification task, and larger training and label sets help up to a point. A
model using embeddings from these classifiers does much better than raw
features on the Audio Set [5] Acoustic Event Detection (AED) classification
task.Comment: Accepted for publication at ICASSP 2017 Changes: Added definitions of
mAP, AUC, and d-prime. Updated mAP/AUC/d-prime numbers for Audio Set based on
changes of latest Audio Set revision. Changed wording to fit 4 page limit
with new addition