18 research outputs found

    Generating Punctuation for Spoken Text

    Get PDF
    Generally, the present disclosure is directed to generating punctuation for spoken text from a speech-to-text system. In particular, in some implementations, the systems and methods of the present disclosure can include or otherwise leverage one or more machine-learned models to generate punctuation data for spoken text based on speech data for the spoken tex

    Post-Processing of Machine Classifier Output for Object Classification

    Get PDF
    Machine classifiers are typically trained using labeled data sets. If the training data set has categories of objects that naturally co-occur, the machine classifier may have difficulty in distinguishing those categories. For example, audio streams often contain instances of sounds that occur simultaneously; e.g., speech and laughter. In this example, the different sounds are the objects that are to be classified. A machine classifier trained with such audio streams generates false positives; e.g., conflates speech with laughter, if the training data set does not label speech separately from laughter. The difficulty of obtaining well-labeled training sets compounds the problem of misclassification. For example, most transcriptions of audio streams containing laughter also include speech in close proximity, since laughter occurs just after speech; e.g., at the end of a joke. Furthermore, humans that produce training data typically annotate rather long audio segments at once, without specifying precise times for each word or audio event, so segments that contain laughter typically include both “speech” and “laughter” without labeling exactly when each occurred. This disclosure describes techniques to improve classification accuracy that are applicable for machine classifiers that act on any type of data; e.g., video, documents, images, etc

    Captions Based On Speaker Identification

    Get PDF
    Disclosed herein is a mechanism for generating and providing captions based on speaker identification. In some instances, the mechanism can be used to determine intervals where a single-speaker is speaking within particular image frames to assist the task of manual captioning or manual transcription. In some instances, the mechanism can be used to provide an awareness or indication of speaker turn-changes in captions, where a particular word or phrase can be grouped by particular speaker. In some instances, the mechanism can be used to provide an awareness or indication of speaker position and identity information corresponding to the speaker

    CNN Architectures for Large-Scale Audio Classification

    Full text link
    Convolutional Neural Networks (CNNs) have proven very effective in image classification and show promise for audio. We use various CNN architectures to classify the soundtracks of a dataset of 70M training videos (5.24 million hours) with 30,871 video-level labels. We examine fully connected Deep Neural Networks (DNNs), AlexNet [1], VGG [2], Inception [3], and ResNet [4]. We investigate varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on our audio classification task, and larger training and label sets help up to a point. A model using embeddings from these classifiers does much better than raw features on the Audio Set [5] Acoustic Event Detection (AED) classification task.Comment: Accepted for publication at ICASSP 2017 Changes: Added definitions of mAP, AUC, and d-prime. Updated mAP/AUC/d-prime numbers for Audio Set based on changes of latest Audio Set revision. Changed wording to fit 4 page limit with new addition
    corecore