39 research outputs found

    Multitask learning in Audio Captioning: a sentence embedding regression loss acts as a regularizer

    Full text link
    In this work, we propose to study the performance of a model trained with a sentence embedding regression loss component for the Automated Audio Captioning task. This task aims to build systems that can describe audio content with a single sentence written in natural language. Most systems are trained with the standard Cross-Entropy loss, which does not take into account the semantic closeness of the sentence. We found that adding a sentence embedding loss term reduces overfitting, but also increased SPIDEr from 0.397 to 0.418 in our first setting on the AudioCaps corpus. When we increased the weight decay value, we found our model to be much closer to the current state-of-the-art methods, with a SPIDEr score up to 0.444 compared to a 0.475 score. Moreover, this model uses eight times less trainable parameters. In this training setting, the sentence embedding loss has no more impact on the model performance

    Killing two birds with one stone: Can an audio captioning system also be used for audio-text retrieval?

    Full text link
    Automated Audio Captioning (AAC) aims to develop systems capable of describing an audio recording using a textual sentence. In contrast, Audio-Text Retrieval (ATR) systems seek to find the best matching audio recording(s) for a given textual query (Text-to-Audio) or vice versa (Audio-to-Text). These tasks require different types of systems: AAC employs a sequence-to-sequence model, while ATR utilizes a ranking model that compares audio and text representations within a shared projection subspace. However, this work investigates the relationship between AAC and ATR by exploring the ATR capabilities of an unmodified AAC system, without fine-tuning for the new task. Our AAC system consists of an audio encoder (ConvNeXt-Tiny) trained on AudioSet for audio tagging, and a transformer decoder responsible for generating sentences. For AAC, it achieves a high SPIDEr-FL score of 0.298 on Clotho and 0.472 on AudioCaps on average. For ATR, we propose using the standard Cross-Entropy loss values obtained for any audio/caption pair. Experimental results on the Clotho and AudioCaps datasets demonstrate decent recall values using this simple approach. For instance, we obtained a Text-to-Audio R@1 value of 0.382 for Au-dioCaps, which is above the current state-of-the-art method without external data. Interestingly, we observe that normalizing the loss values was necessary for Audio-to-Text retrieval.Comment: cam ready version (14/08/23

    Temperature-modulation of protein phosphorylation in cell-free extracts of alfalfa

    No full text
    The effects of temperature on a 58-kDa phosphoprotein (PP58) have been examined in cell-free extracts of two alfalfa (Medicago sativa L.) cultivars, Apica and Trek. In the extracts prepared without the use of Triton X-100, PP58 is present in a 12,000 x g (P12), 28,000 x g (P28) and 100,000 x g (P100) pellets but is enriched in the P28 fraction. In these fractions PP58 is substantially and equally phosphorylated at both 4° and 24°C. When extracts are prepared in the presence Triton X-100, PP58 is present in the 28,000 x g supernatant (TXS fraction) is extensively dephosphorylated at 24°C but highly phosphorylated at 4°C. The phosphorylation of this protein increased sharply as temperature declined below 12°C, and was 15 times greater at 0° than at 24°C. The phosphorylation level doubled between 12° and 8°C and again between 8° and 4°C. Thus temperature effect is not mediated by Q10 effect. Interestingly, temperature-response curve of PP58 phosphorylation is similar to that of the reported cold-induced calcium influx (Plant Cell 7: 321-331). Labeling reactions carried out in the presence of [gamma-35S]thioATP indicated that low temperature inhibited the dephosphorylation reaction. These results were not mimicked at room temperature by the protein phosphatase 1 and 2A inhibitor okadaic acid. In reactions performed at 4°C, addition of calcium caused a 2-fold increase in the phosphorylation of PP58. A decrease in phosphorylation was observed when equimolar amounts of EGTA were added in the presence of MgCl 2 or MnCl2, but not in the presence of CaCl2, suggesting that this protein is phosphorylated by a calcium-dependent protein kinase. These results are consistent with the suggestion that PP58 and its putative kinase are membrane-localized whereas the putative PP58 phosphatase is a loosely-associated membrane peripheral protein lost to the supernatant during fractionation. We suggest that PP58 could be involved in low t

    Faire d'une pierre deux coups : Un système de description de l'audio peut-il être utilisé pour la recherche audio-textuelle ?

    No full text
    International audienceAutomated Audio Captioning (AAC) aims to develop systems capable of describing an audio recording using a textual sentence. In contrast, Audio-Text Retrieval (ATR) systems seek to find the best matching audio recording(s) for a given textual query (Text-to-Audio) or vice versa (Audio-to-Text). These tasks require different types of systems: AAC employs a sequence-to-sequence model, while ATR utilizes a ranking model that compares audio and text representations within a shared projection subspace. However, this work investigates the relationship between AAC and ATR by exploring the ATR capabilities of an unmodified AAC system, without fine-tuning for the new task. Our AAC system consists of an audio encoder (ConvNeXt-Tiny) trained on AudioSet for audio tagging, and a transformer decoder responsible for generating sentences. For AAC, it achieves a high SPIDEr-FL score of 0.298 on Clotho and 0.472 on AudioCaps on average. For ATR, we propose using the standard Cross-Entropy loss values obtained for any audio/caption pair. Experimental results on the Clotho and AudioCaps datasets demonstrate decent recall values using this simple approach. For instance, we obtained a Text-to-Audio R@1 value of 0.382 for Au-dioCaps, which is above the current state-of-the-art method without external data. Interestingly, we observe that normalizing the loss values was necessary for Audio-to-Text retrieval

    Multilingual Audio Captioning using machine translated data

    No full text
    Automated Audio Captioning (AAC) systems attempt to generate a natural language sentence, a caption, that describes the content of an audio recording, in terms of sound events. Existing datasets provide audio-caption pairs, with captions written in English only. In this work, we explore multilingual AAC, using machine translated captions. We translated automatically two prominent AAC datasets, AudioCaps and Clotho, from English to French, German and Spanish. We trained and evaluated monolingual systems in the four languages, on AudioCaps and Clotho. In all cases, the models achieved similar performance, about 75% CIDEr on AudioCaps and 43% on Clotho. In French, we acquired manual captions of the AudioCaps eval subset. The French system, trained on the machine translated version of AudioCaps, achieved significantly better results on the manual eval subset, compared to the English system for which we automatically translated the outputs to French. This advocates in favor of building systems in a target language instead of simply translating to a target language the English captions from the English system. Finally, we built a multilingual model, which achieved results in each language comparable to each monolingual system, while using much less parameters than using a collection of monolingual systems

    Comparison of Semi-supervised Deep Learning Algorithms for Audio Classification

    No full text
    International audienceIn this article, we adapted five recent SSL methods to the task of audio classification. The first two methods, namely Deep Co-Training (DCT) and Mean Teacher (MT), involve two collaborative neural networks. The three other algorithms, called MixMatch (MM), ReMixMatch (RMM), and FixMatch (FM), are single-model methods that rely primarily on data augmentation strategies. Using the Wide-ResNet-28-2 architecture in all our experiments, 10% of labeled data and the remaining 90% as unlabeled data for training, we first compare the error rates of the five methods on three standard benchmark audio datasets: Environmental Sound Classification (ESC-10), UrbanSound8K (UBS8K), and Google Speech Commands (GSC). In all but one cases, MM, RMM, and FM outperformed MT and DCT significantly, MM and RMM being the best methods in most experiments. On UBS8K and GSC, MM achieved 18.02% and 3.25% error rate (ER) respectively, outperforming models trained with 100% of the available labeled data, which reached 23.29% and 4.94%, respectively. RMM achieved the best results on ESC-10 (12.00% ER), followed by FM which reached 13.33%. Second, we explored adding the mixup augmentation, used in MM and RMM, to DCT, MT and FM. In almost all cases, mixup brought consistent gains. For instance, on GSC, FM reached 4.44% and 3.31% ER without and with mixup. Our PyTorch code will be made available upon paper acceptance at https://github.com/Labbeti/SSLH

    Is my automatic audio captioning system so bad? spider-max: a metric to consider several caption candidates

    No full text
    Detection and Classification of Acoustic Scenes and EventsInternational audienceAutomatic Audio Captioning (AAC) is the task that aims to describe an audio signal using natural language. AAC systems take as input an audio signal and output a free-form text sentence, called a caption. Evaluating such systems is not trivial, since there are many ways to express the same idea. For this reason, several complementary metrics, such as BLEU, CIDEr, SPICE and SPIDEr, are used to compare a single automatic caption to one or several captions of reference, produced by a human annotator. Nevertheless, an automatic system can produce several caption candidates, either using some randomness in the sentence generation process, or by considering the various competing hypothesized captions during decoding with beam-search, for instance. If we consider an end-user of an AAC system, presenting several captions instead of a single one seems relevant to provide some diversity, similarly to information retrieval systems. In this work, we explore the possibility to consider several predicted captions in the evaluation process instead of one. For this purpose, we propose SPIDEr-max, a metric that takes the maximum SPIDEr value among the scores of several caption candidates. To advocate for our metric, we report experiments on Clotho v2.1 and AudioCaps, with a transformed-based system. On AudioCaps for example, this system reached a SPIDEr-max value (with 5 candidates) close to the SPIDEr human score of reference

    Multitask learning in Audio Captioning: a sentence embedding regression loss acts as a regularizer

    No full text
    International audienceIn this work, we propose to study the performance of a model trained with a sentence embedding regression loss component for the Automated Audio Captioning task. This task aims to build systems that can describe audio content with a single sentence written in natural language. Most systems are trained with the standard Cross-Entropy loss, which does not take into account the semantic closeness of the sentence. We found that adding a sentence embedding loss term reduces overfitting, but also increased SPIDEr from 0.397 to 0.418 in our first setting on the AudioCaps corpus. When we increased the weight decay value, we found our model to be much closer to the current state-of-the-art scores, with a SPIDEr score up to 0.444 compared to a 0.475 score. Moreover, this model uses eight times less trainable parameters than the current state-of-the-art method Multi-TTA. In this training setting, the sentence embedding loss has no more impact on the model performance
    corecore