70 research outputs found

    Audio Caption: Listen and Tell

    Full text link
    Increasing amount of research has shed light on machine perception of audio events, most of which concerns detection and classification tasks. However, human-like perception of audio scenes involves not only detecting and classifying audio sounds, but also summarizing the relationship between different audio events. Comparable research such as image caption has been conducted, yet the audio field is still quite barren. This paper introduces a manually-annotated dataset for audio caption. The purpose is to automatically generate natural sentences for audio scene description and to bridge the gap between machine perception of audio and image. The whole dataset is labelled in Mandarin and we also include translated English annotations. A baseline encoder-decoder model is provided for both English and Mandarin. Similar BLEU scores are derived for both languages: our model can generate understandable and data-related captions based on the dataset.Comment: accepted by ICASSP201

    A Comprehensive Survey of Automated Audio Captioning

    Full text link
    Automated audio captioning, a task that mimics human perception as well as innovatively links audio processing and natural language processing, has overseen much progress over the last few years. Audio captioning requires recognizing the acoustic scene, primary audio events and sometimes the spatial and temporal relationship between events in an audio clip. It also requires describing these elements by a fluent and vivid sentence. Deep learning-based approaches are widely adopted to tackle this problem. This current paper situates itself as a comprehensive review covering the benchmark datasets, existing deep learning techniques and the evaluation metrics in automated audio captioning

    Strategies for Improving Chinese Writing Exercises of Senior Grade Students in the Primary School of the Yi Ethnic Group

    Get PDF
    There has been a strong consensus among previous researchers that first language interferes with second language acquisition. A limited number of studies have been conducted to examine this interference in writing Chinese. Chinese writing exercises for primary school students of the Yi ethnic group in rural areas tend to encourage students to incorporate their native language habits into Chinese writing practices. This enables students to apply modern Chinese word order to the Yi language paradigm. Consequently, they are prone to grammatical errors, word order errors, and collocation errors in their compositions. This study focuses on elementary school students' writing abilities in order to alleviate their fears associated with Chinese writing. It also assists them in grasping the second language writing paradigm, and enjoying writing. In addition, by analyzing the composition errors of the students in the corpus, we propose some practical recommendations from three perspectives: educators, students, and institutions. These recommendations will enhance language and writing education within ethnic communities through theoretical and practical strategies

    Enhance Temporal Relations in Audio Captioning with Sound Event Detection

    Full text link
    Automated audio captioning aims at generating natural language descriptions for given audio clips, not only detecting and classifying sounds, but also summarizing the relationships between audio events. Recent research advances in audio captioning have introduced additional guidance to improve the accuracy of audio events in generated sentences. However, temporal relations between audio events have received little attention while revealing complex relations is a key component in summarizing audio content. Therefore, this paper aims to better capture temporal relationships in caption generation with sound event detection (SED), a task that locates events' timestamps. We investigate the best approach to integrate temporal information in a captioning model and propose a temporal tag system to transform the timestamps into comprehensible relations. Results evaluated by the proposed temporal metrics suggest that great improvement is achieved in terms of temporal relation generation
    • …
    corecore