9 research outputs found
Time-of-Flight Cameras in Space: Pose Estimation with Deep Learning Methodologies
Recently introduced 3D Time-of-Flight (ToF) cameras have shown a huge potential for mobile robotic applications, proposing a smart and fast technology that outputs 3D point clouds, lacking however in measurement precision and robustness. With the development of this low-cost sensing hardware, 3D perception gathers more and more importance in robotics as well as in many other fields, and object registration continues to gain momentum. Registration is a transformation estimation problem between a source and a target point clouds, seeking to find the transformation that best aligns them. This work aims at building a full pipeline, from data acquisition to transformation identification, to robustly detect known objects observed by a ToF camera within a short range, estimating their 6 degrees of freedom position. We focus this work to demonstrating the capability of detecting a part of a satellite floating in space, to support in-orbit servicing missions (e.g. for space debris removal). Experiments reveal that deep learning techniques can obtain higher accuracy and robustness w.r.t. classical methods, handling significant amount of noise while still keeping real-time performance and low complexity of the models themselves
Explaining Speech Classification Models via Word-Level Audio Segments and Paralinguistic Features
Recent advances in eXplainable AI (XAI) have provided new insights into how
models for vision, language, and tabular data operate. However, few approaches
exist for understanding speech models. Existing work focuses on a few spoken
language understanding (SLU) tasks, and explanations are difficult to interpret
for most users. We introduce a new approach to explain speech classification
models. We generate easy-to-interpret explanations via input perturbation on
two information levels. 1) Word-level explanations reveal how each word-related
audio segment impacts the outcome. 2) Paralinguistic features (e.g., prosody
and background noise) answer the counterfactual: ``What would the model
prediction be if we edited the audio signal in this way?'' We validate our
approach by explaining two state-of-the-art SLU models on two speech
classification tasks in English and Italian. Our findings demonstrate that the
explanations are faithful to the model's inner workings and plausible to
humans. Our method and findings pave the way for future research on
interpreting speech models.Comment: 8 page
How Much Attention Should we Pay to Mosquitoes?
Mosquitoes are a major global health problem. They are responsible for the transmission of diseases and can have a large impact on local economies. Monitoring mosquitoes is therefore helpful in preventing the outbreak of mosquito-borne diseases. In this paper, we propose a novel data-driven approach that leverages Transformer-based models for the identification of mosquitoes in audio recordings. The task aims at detecting the time intervals corresponding to the acoustic mosquito events in an audio signal. We formulate the problem as a sequence tagging task and train a Transformer-based model using a real-world dataset collecting mosquito recordings. By leveraging the sequential nature of mosquito recordings, we formulate the training objective so that the input recordings do not require fine-grained annotations. We show that our approach is able to outperform baseline methods using standard evaluation metrics, albeit suffering from unexpectedly high false negatives detection rates. In view of the achieved results, we propose future directions for the design of more effective mosquito detection models
PoliToHFI at SemEval-2023 Task 6: Leveraging Entity-Aware and Hierarchical Transformers For Legal Entity Recognition and Court Judgment Prediction
The use of Natural Language Processing techniques in the legal domain has become established for supporting attorneys and domain experts in content retrieval and decision-making. However, understanding the legal text poses relevant challenges in the recognition of domain-specific entities and the adaptation and explanation of predictive models. This paper addresses the Legal Entity Name Recognition (L-NER) and Court judgment Prediction (CPJ) and Explanation (CJPE) tasks. The L-NER solution explores the use of various transformer-based models, including an entity-aware method attending domain-specific entities. The CJPE proposed method relies on hierarchical BERT-based classifiers combined with local input attribution explainers. We propose a broad comparison of eXplainable AI methodologies along with a novel approach based on NER. For the LNER task, the experimental results remark on the importance of domain-specific pre-training. For CJP our lightweight solution shows performance in line with existing approaches, and our NER-boosted explanations show promising CJPE results in terms of the conciseness of the prediction explanations
Transformer-based Non-Verbal Emotion Recognition: Exploring Model Portability across Speakersā Genders
Recognizing emotions in non-verbal audio tracks requires a deep understanding of their underlying features. Traditional classifiers relying on excitation, prosodic, and vocal traction features are not always capable of effectively generalizing across speakers' genders. In the ComParE 2022 vocalisation sub-challenge we explore the use of a Transformer architecture trained on contrastive audio examples. We leverage augmented data to learn robust non-verbal emotion classifiers. We also investigate the impact of different audio transformations, including neural voice conversion, on the classifier capability to generalize across speakers' genders. The empirical findings indicate that neural voice conversion is beneficial in the pretraining phase, yielding an improved model generality, whereas is harmful at the finetuning stage as hinders model specialization for the task of non-verbal emotion recognition
Exploring Subgroup Performance In End-to-End Speech Models
End-to-End Spoken Language Understanding models are generally evaluated according to their overall accuracy, or separately on (a priori defined) data subgroups of interest. We propose a technique for analyzing model performance at the subgroup level, which considers all subgroups that can be defined via a given set of metadata and are above a specified minimum size. The metadata can represent user characteristics, recording conditions, and speech targets. Our technique is based on advances in model bias analysis, enabling efficient exploration of resulting subgroups. A fine-grained analysis reveals how model performance varies across subgroups, identifying modeling issues or bias towards specific subgroups. We compare the subgroup-level performance of models based on wav2vec 2.0 and HuBERT on the Fluent Speech Commands dataset. The experimental results illustrate how subgroup-level analysis reveals a finer and more complete picture of performance changes when models are replaced, automatically identifying the subgroups that most benefit or fail to benefit from the chang
Designing Logic Tensor Networks for Visual Sudoku puzzle classification
Given the increasing importance of the neurosymbolic (NeSy) approach in artificial intelligence, there is a growing interest in studying benchmarks specifically designed to emphasize the ability of AI systems to combine low-level representation learning with high-level symbolic reasoning. One such recent benchmark is Visual Sudoku Puzzle Classification, that combines visual perception with relational constraints. In this work, we investigate the application of Logic Tensork Networks (LTNs) to the Visual Sudoku Classification task and discuss various alternatives in terms of logical constraint formulation, integration with the perceptual module and training procedure
Reconstructing Atmospheric Parameters of Exoplanets Using Deep Learning
Exploring exoplanets has transformed our understanding of the universe by revealing many planetary systems that defy our current understanding. To study their atmospheres, spectroscopic observations are used to infer essential atmospheric properties that are not directly measurable. Estimating atmospheric parameters that best fit the observed spectrum within a specified atmospheric model is a complex problem that is difficult to model. In this paper, we present a multi-target probabilistic regression approach that combines deep learning and inverse modeling techniques within a multimodal architecture to extract atmospheric parameters from exoplanets. Our methodology overcomes computational limitations and outperforms previous approaches, enabling efficient analysis of exoplanetary atmospheres. This research contributes to advancements in the field of exoplanet research and offers valuable insights for future studies
baĻtti at GeoLingIt: Beyond Boundaries, Enhancing Geolocation Prediction and Dialect Classification on Social Media in Italy
The proliferation of social media platforms has presented researchers with valuable avenues to examine language usage within
diverse sociolinguistic frameworks. Italy, renowned for its rich linguistic diversity, provides a distinctive context for exploring
diatopic variation, encompassing regional languages, dialects, and variations of Standard Italian. This paper presents our
contributions to the GeoLingIt shared task, focusing on predicting the locations of social media posts in Italy based on
linguistic content. For Task A, we propose a novel approach, combining data augmentation and contrastive learning, that
outperforms the baseline in region prediction. For Task B, we introduce a joint multi-task learning approach leveraging the
synergies with Task A and incorporate a post-processing rectification module for improved geolocation accuracy, surpassing
the baseline and achieving first place in the competition