Search CORE

13 research outputs found

Automated Audio Captioning with Recurrent Neural Networks

Author: Adavanne Sharath
Drossos Konstantinos
Virtanen Tuomas
Publication venue
Publication date: 24/10/2017
Field of study

We present the first approach to automated audio captioning. We employ an encoder-decoder scheme with an alignment model in between. The input to the encoder is a sequence of log mel-band energies calculated from an audio file, while the output is a sequence of words, i.e. a caption. The encoder is a multi-layered, bi-directional gated recurrent unit (GRU) and the decoder a multi-layered GRU with a classification layer connected to the last GRU of the decoder. The classification layer and the alignment model are fully connected layers with shared weights between timesteps. The proposed method is evaluated using data drawn from a commercial sound effects library, ProSound Effects. The resulting captions were rated through metrics utilized in machine translation and image captioning fields. Results from metrics show that the proposed method can predict words appearing in the original caption, but not always correctly ordered.Comment: Presented at the 11th IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 201

arXiv.org e-Print Archive

Crossref

MARVEL: multimodal extreme scale data analytics for smart cities environments

Author: Bajovic Dragana
Bakhtiarnia Arian
Bravos George
Brutti Alessio
Burkhardt Felix
Cauchi Daniel
Chazapis Antony
Cianco Claire
Dall'Asen Nicola
Delic Vlado
Dimou Christos
Djokic Djordje
Escobar-Molero Antonio
Esterle Lukas
Eyben Florian
Farella Elisabetta
Festi Thomas
Geromitsos Artemios
Giakoumakis Giannis
Hatzivasilis George
Ioannidis Sotiris
Iosifidis Alexandros
Kallipolitou Theodora
Kalogiannis Grigorios
Kiousi Akrivi
Kopanaki Despina
Lobo Tomas Pariente
Marazakis Manolis
Markopoulou Stella
Muscat Adrian
Paissan Francesco
Pavlovic Dusan
Raptis Theofanis P.
Ricci Elisa
Saez Borja
Sahito Farhan
Scerri Kenneth
Schuller Björn
Simic Nikola
Spanoudakis George
Tomasi Alex
Triantafyllopoulos Andreas
Valerio Lorenzo
Villazan Javier
Wang Yiming
Xuereb Andre
Zammit Johan
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2021
Field of study

OPUS Augsburg

Recuperação da informação em vídeos do YouTube

Author: Lopes Kaique Vandersar Ribeiro
Publication venue: 'Associacao Portuguesa de Sistemas de Informacao'
Publication date: 06/06/2022
Field of study

Trabalho de Conclusão de Curso (Graduação)Este trabalho auxilia os usuários fornecendo uma ferramenta simples, visual, interativa e dinâmica para facilitar o processo de busca de informações em vídeos. O desenvolvimento foi dividido em levantamento de requisitos, estudo das tecnologias necessárias para implantação do sistema e desenvolvimento do sistema. O projeto resultou em uma aplicação web feita com JavaScript e Python com auxílio das ferramentas Bootstrap, Flask e IBM Watson, sendo um sistema para realizar busca de informações em vídeos. Desta maneira, o sistema oferece a possibilidade de adicionar novos vídeos à base de dados e realizar consultas a fim de recuperar documentos relevantes. A ferramenta poderá ser utilizada por qualquer pessoa que queira realizar busca de informação em vídeos

Repositório Institucional da Universidade Federal de Uberlândia

Sequence Temporal Sub-Sampling for Automated Audio Captioning

Author: Nguyen Khoa
Publication venue
Publication date: 12/11/2020
Field of study

Audio captioning is a novel task in machine learning which involves the generation of textual description for an audio signal. For example, a method for audio captioning must be able to generate descriptions like “two people talking about football”, or “college clock striking” from the corresponding audio signals. Audio captioning is one of the tasks in the Detection and Classification of Acoustic Scenes and Events 2020 (DCASE2020). Most audio captioning methods use the encoder-decoder deep neural networks architecture as a function to map the extracted features from input audio sequence to the output captions. However, the length of an output caption is considerably less than the length of an input audio signal, for example, 10 words versus 2000 audio feature vectors. This thesis work reports an attempt to take advantage of this difference in length by employing temporal sub-sampling in the encoder-decoder neural networks. The method is evaluated using the Clotho audio captioning dataset and the DCASE2020 evaluation metrics. Experimental results show that temporal sequence sub-sampling is able to improve all considered metrics, as well as memory and time complexity while training and calculating predicted output

TamPub Julkaisuarkisto - TamPub Institutional Repository

Trepo - Institutional Repository of Tampere University

Deep Learning Based Sound Event Detection and Classification

Author: Nasiri Alireza
Publication venue: Scholar Commons
Publication date: 01/04/2021
Field of study

Hearing sense has an important role in our daily lives. During the recent years, there has been many studies to transfer this capability to the computers. In this dissertation, we design and implement deep learning based algorithms to improve the ability of the computers in recognizing the different sound events. In the first topic, we investigate sound event detection, which identifies the time boundaries of the sound events in addition to the type of the events. For sound event detection, we propose a new method, AudioMask, to benefit from the object-detection techniques in computer vision. In this method, we convert the question of identifying time boundaries for sound events, into the problem of identifying objects in images by treating the spectrograms of the sound as images. AudioMask first applies Mask R-CNN, an algorithm for detecting objects in images, to the log-scaled mel-spectrograms of the sound files. Then we use a frame-based sound event classifier trained independently from Mask R-CNN, to analyze each individual frame in the candidate segments. Our experiments show that, this approach has promising results and can successfully identify the exact time boundaries of the sound events. The code for this study is available at https://github.com/alireza-nasiri/AudioMask. In the second topic, we present SoundCLR, a supervised contrastive learning based method for effective environmental sound classification with state-of-the-art performance, which works by learning representations that disentangle the samples of each class from those of other classes. We also exploit transfer learning and strong data augmentation to improve the results. Our extensive benchmark experiments show that our hybrid deep network models trained with combined contrastive and cross-entropy loss achieved the state-of-the-art performance on three benchmark datasets ESC-10, ESC-50, and US8K with validation accuracies of 99.75%, 93.4%, and 86.49% respectively. The ensemble version of our models also outperforms other top ensemble methods. Finally, we analyze the acoustic emissions that are generated during the degradation process of SiC composites. The aim here is to identify the state of the degradation in the material, by classifying its emitted acoustic signals. As our baseline, we use random forest method on expert-defined features. Also we propose a deep neural network of convolutional layers to identify the patterns in the raw sound signals. Our experiments show that both of our methods are reliably capable of identifying the degradation state of the composite, and in average, the convolutional model significantly outperforms the random forest technique

Scholar Commons - Institutional Repository of the University of South Carolina