613 research outputs found

    Stacked Convolutional and Recurrent Neural Networks for Bird Audio Detection

    Full text link
    This paper studies the detection of bird calls in audio segments using stacked convolutional and recurrent neural networks. Data augmentation by blocks mixing and domain adaptation using a novel method of test mixing are proposed and evaluated in regard to making the method robust to unseen data. The contributions of two kinds of acoustic features (dominant frequency and log mel-band energy) and their combinations are studied in the context of bird audio detection. Our best achieved AUC measure on five cross-validations of the development data is 95.5% and 88.1% on the unseen evaluation data.Comment: Accepted for European Signal Processing Conference 201

    A Mobile Application Framework to Classify Philippine Currency Images to Audio Labels Using Deep Learning

    Get PDF
    This research presents a mobile application framework designed to empower visually impaired individuals in Legazpi City by providing real-time audio feedback for currency identification. Leveraging deep learning techniques, the proposed framework employs a robust model trained on a comprehensive dataset of Philippine currency images. The deep learning model is capable of accurately classifying various denominations of bills and coins, enabling the development of an inclusive solution for the visually impaired community. The researcher employed a qualitative approach in this study, which included a focus group discussion. Respondents were chosen using purposive sampling. Among those who responded were masseuses, chiropractors, herbal street vendors, and students. Through an online meeting, the selected participants contributed to the focus group discussion. In addition, an in-depth informal interview was conducted to gather additional information for the development of an architectural framework. Based on the result of this study, it was discovered that by implementing this architectural framework, these groups would be able to more easily identify money, increasing efficiency and reducing errors in cash transactions. The use of audio labels is particularly helpful for visually impaired individuals, as it provides an accessible way for them to independently handle and identify money

    Identifying patterns of human and bird activities using bioacoustic data

    Get PDF
    In general, humans and animals often interact within the same environment at the same time. Human activities may disturb or affect some bird activities. Therefore, it is important to monitor and study the relationships between human and animal activities. This paper proposed a system able not only to automatically classify human and bird activities using bioacoustic data, but also to automatically summarize patterns of events over time. To perform automatic summarization of acoustic events, a frequency–duration graph (FDG) framework was proposed to summarize the patterns of human and bird activities. This system first performs data pre-processing work on raw bioacoustic data and then applies a support vector machine (SVM) model and a multi-layer perceptron (MLP) model to classify human and bird chirping activities before using the FDG framework to summarize results. The SVM model achieved 98% accuracy on average and the MLP model achieved 98% accuracy on average across several day-long recordings. Three case studies with real data show that the FDG framework correctly determined the patterns of human and bird activities over time and provided both statistical and graphical insight into the relationships between these two events

    BIRB: A Generalization Benchmark for Information Retrieval in Bioacoustics

    Full text link
    The ability for a machine learning model to cope with differences in training and deployment conditions--e.g. in the presence of distribution shift or the generalization to new classes altogether--is crucial for real-world use cases. However, most empirical work in this area has focused on the image domain with artificial benchmarks constructed to measure individual aspects of generalization. We present BIRB, a complex benchmark centered on the retrieval of bird vocalizations from passively-recorded datasets given focal recordings from a large citizen science corpus available for training. We propose a baseline system for this collection of tasks using representation learning and a nearest-centroid search. Our thorough empirical evaluation and analysis surfaces open research directions, suggesting that BIRB fills the need for a more realistic and complex benchmark to drive progress on robustness to distribution shifts and generalization of ML models

    Consecutive Decoding for Speech-to-text Translation

    Full text link
    Speech-to-text translation (ST), which directly translates the source language speech to the target language text, has attracted intensive attention recently. However, the combination of speech recognition and machine translation in a single model poses a heavy burden on the direct cross-modal cross-lingual mapping. To reduce the learning difficulty, we propose COnSecutive Transcription and Translation (COSTT), an integral approach for speech-to-text translation. The key idea is to generate source transcript and target translation text with a single decoder. It benefits the model training so that additional large parallel text corpus can be fully exploited to enhance the speech translation training. Our method is verified on three mainstream datasets, including Augmented LibriSpeech English-French dataset, TED English-German dataset, and TED English-Chinese dataset. Experiments show that our proposed COSTT outperforms the previous state-of-the-art methods. The code is available at https://github.com/dqqcasia/st.Comment: Accepted by AAAI 2021. arXiv admin note: text overlap with arXiv:2009.0970
    • …
    corecore