17 research outputs found

    JRLV at SemEval-2022 Task 5: The Importance of Visual Elements for Misogyny Identification in Memes

    Get PDF
    Gender discrimination is a serious and widespread problem on social media and online in general. Besides offensive messages, memes are one of the main means of dissemination for such content. With these premises, the MAMI task was proposed at the SemEval-2022, which consists of identifying memes with misogynous characteristics. In this work, we propose a solution to this problem based on Mask R-CNN and VisualBERT that leverages the multimodal nature of the task. Our study focuses on observing how the two sources of data in memes (text and image) and their possible combinations impact performances. Our best result slightly exceeds the higher baseline, but the experiments allowed us to draw important considerations regarding the importance of correctly exploiting the visual information and the relevance of the elements present in the memes images

    PoliTo at MULTI-Fake-DetectiVE: Improving FND-CLIP for Multimodal Italian Fake News Detection

    Get PDF
    The MULTI-Fake-DetectiVE challenge addresses the automatic detection of Italian fake news in a multimodal setting, where both textual and visual components contribute as potential sources of fake content. This paper describes the PoliTO approach to the tasks of fake news detection and analysis of the modality contributions. Our solution turns out to be the best performer on both tasks. It leverages the established FND-CLIP multimodal architecture and proposes ad hoc extensions including sentiment-based text encoding, image transformation in the frequency domain, and data augmentation via back-translation. Thanks to its effectiveness in combining visual and textual content, our solution contributes to fighting the spread of disinformation in the Italian news flow

    How Much Attention Should we Pay to Mosquitoes?

    Get PDF
    Mosquitoes are a major global health problem. They are responsible for the transmission of diseases and can have a large impact on local economies. Monitoring mosquitoes is therefore helpful in preventing the outbreak of mosquito-borne diseases. In this paper, we propose a novel data-driven approach that leverages Transformer-based models for the identification of mosquitoes in audio recordings. The task aims at detecting the time intervals corresponding to the acoustic mosquito events in an audio signal. We formulate the problem as a sequence tagging task and train a Transformer-based model using a real-world dataset collecting mosquito recordings. By leveraging the sequential nature of mosquito recordings, we formulate the training objective so that the input recordings do not require fine-grained annotations. We show that our approach is able to outperform baseline methods using standard evaluation metrics, albeit suffering from unexpectedly high false negatives detection rates. In view of the achieved results, we propose future directions for the design of more effective mosquito detection models

    PoliToHFI at SemEval-2023 Task 6: Leveraging Entity-Aware and Hierarchical Transformers For Legal Entity Recognition and Court Judgment Prediction

    Get PDF
    The use of Natural Language Processing techniques in the legal domain has become established for supporting attorneys and domain experts in content retrieval and decision-making. However, understanding the legal text poses relevant challenges in the recognition of domain-specific entities and the adaptation and explanation of predictive models. This paper addresses the Legal Entity Name Recognition (L-NER) and Court judgment Prediction (CPJ) and Explanation (CJPE) tasks. The L-NER solution explores the use of various transformer-based models, including an entity-aware method attending domain-specific entities. The CJPE proposed method relies on hierarchical BERT-based classifiers combined with local input attribution explainers. We propose a broad comparison of eXplainable AI methodologies along with a novel approach based on NER. For the LNER task, the experimental results remark on the importance of domain-specific pre-training. For CJP our lightweight solution shows performance in line with existing approaches, and our NER-boosted explanations show promising CJPE results in terms of the conciseness of the prediction explanations

    Transformer-based Non-Verbal Emotion Recognition: Exploring Model Portability across Speakers’ Genders

    Get PDF
    Recognizing emotions in non-verbal audio tracks requires a deep understanding of their underlying features. Traditional classifiers relying on excitation, prosodic, and vocal traction features are not always capable of effectively generalizing across speakers' genders. In the ComParE 2022 vocalisation sub-challenge we explore the use of a Transformer architecture trained on contrastive audio examples. We leverage augmented data to learn robust non-verbal emotion classifiers. We also investigate the impact of different audio transformations, including neural voice conversion, on the classifier capability to generalize across speakers' genders. The empirical findings indicate that neural voice conversion is beneficial in the pretraining phase, yielding an improved model generality, whereas is harmful at the finetuning stage as hinders model specialization for the task of non-verbal emotion recognition

    Designing Logic Tensor Networks for Visual Sudoku puzzle classification

    Get PDF
    Given the increasing importance of the neurosymbolic (NeSy) approach in artificial intelligence, there is a growing interest in studying benchmarks specifically designed to emphasize the ability of AI systems to combine low-level representation learning with high-level symbolic reasoning. One such recent benchmark is Visual Sudoku Puzzle Classification, that combines visual perception with relational constraints. In this work, we investigate the application of Logic Tensork Networks (LTNs) to the Visual Sudoku Classification task and discuss various alternatives in terms of logical constraint formulation, integration with the perceptual module and training procedure

    Learning Confidence Intervals for Feature Importance: A Fast Shapley-based Approach

    No full text
    Inferring feature importance is a well-known machine learning problem. Giving importance scores to the input data features is particularly helpful for explaining black-box models. Existing approaches rely on either statistical or Neural Network-based methods. Among them, Shapley Value estimates are among the mostly used scores to explain individual classification models or ensemble methods. As a drawback, state-of-the-art neural network-based approaches neglects the uncertainty of the input predictions while computing the confidence intervals of the feature importance scores. The paper extends a state-of-the-art neural method for Shapley Value estimation to handle uncertain predictions made by ensemble methods and to estimate a confidence interval for the feature importances. The results show that (1) The estimated confidence intervals are coherent with the expectation and more reliable than baseline methods; (2) The efficiency of the Shapley value estimator is comparable to those of traditional models; (3) The level of uncertainty of the Shapley value estimates decreases while producing ensembles of larger numbers of predictors

    PoliTo at SemEval-2023 Task 1: CLIP-based Visual-Word Sense Disambiguation Based on Back-Translation

    No full text
    Visual-Word Sense Disambiguation (V-WSD) entails resolving the linguistic ambiguity in a text by selecting a clarifying image from a set of (potentially misleading) candidates. In this paper, we address V-WSD using a state-of-the-art Image-Text Retrieval system, namely CLIP. We propose to alleviate the linguistic ambiguity across multiple domains and languages via text and image augmentation. To augment the textual content we rely on backtranslation with the aid of a variety of auxiliary languages. The approach based on fine-tuning CLIP on the full phrases is effective in accurately disambiguating words and incorporating back-translation enhances the system’s robustness and performance on the test samples written in Indo-European languages

    Leveraging multimodal content for podcast summarization

    No full text
    Podcasts are becoming an increasingly popular way to share streaming audio content. Podcast summarization aims at improving the accessibility of podcast content by automatically generating a concise summary consisting of text/audio extracts. Existing approaches either extract short audio snippets by means of speech summarization techniques or produce abstractive summaries of the speech transcription disregarding the podcast audio. To leverage the multimodal information hidden in podcast episodes we propose an end-to-end architecture for extractive summarization that encodes both acoustic and textual contents. It learns how to attend relevant multimodal features using an ad hoc, deep feature fusion network. The experimental results achieved on a real benchmark dataset show the benefits of integrating audio encodings into the extractive summarization process. The quality of the generated summaries is superior to those achieved by existing extractive methods
    corecore