Search CORE

56 research outputs found

Complete Cross-triplet Loss in Label Space for Audio-visual Cross-modal Retrieval

Author: Ikeda Kazushi
Wang Yanan
Wu Jianming
Zeng Donghuo
Publication venue
Publication date: 07/11/2022
Field of study

The heterogeneity gap problem is the main challenge in cross-modal retrieval. Because cross-modal data (e.g. audiovisual) have different distributions and representations that cannot be directly compared. To bridge the gap between audiovisual modalities, we learn a common subspace for them by utilizing the intrinsic correlation in the natural synchronization of audio-visual data with the aid of annotated labels. TNN-CCCA is the best audio-visual cross-modal retrieval (AV-CMR) model so far, but the model training is sensitive to hard negative samples when learning common subspace by applying triplet loss to predict the relative distance between inputs. In this paper, to reduce the interference of hard negative samples in representation learning, we propose a new AV-CMR model to optimize semantic features by directly predicting labels and then measuring the intrinsic correlation between audio-visual data using complete cross-triple loss. In particular, our model projects audio-visual features into label space by minimizing the distance between predicted label features after feature projection and ground label representations. Moreover, we adopt complete cross-triplet loss to optimize the predicted label features by leveraging the relationship between all possible similarity and dissimilarity semantic information across modalities. The extensive experimental results on two audio-visual double-checked datasets have shown an improvement of approximately 2.1% in terms of average MAP over the current state-of-the-art method TNN-CCCA for the AV-CMR task, which indicates the effectiveness of our proposed model.Comment: 9 pages, 5 figures, 3 tables, accepted by IEEE ISM 202

arXiv.org e-Print Archive

Multi-Mode Clustering for Graph-Based Lifelog Retrieval

Author: Bernstein Abraham
Inel Oana
Lange Svenja
Rossetto Luca
Ruosch Florian
Wang Ruijie
Publication venue: ACM Digital library
Publication date: 15/07/2023
Field of study

As part of the 6th Lifelog Search Challenge, this paper presents an approach to arrange Lifelog data in a multi-modal knowledge graph based on cluster hierarchies. We use multiple sequence clustering approaches to address the multi-modal nature of Lifelogs in relation to temporal, spatial, and visual factors. The resulting clusters, along with semantic metadata captions and augmentations based on OpenCLIP, provide for the semantic structure of a graph including all Lifelogs as entries. Textual queries on this hierarchical graph can be expressed to retrieve individual Lifelogs, as well as clusters of Lifelogs

ZORA

Temporal Success Analyses in Music Collaboration Networks: Brazilian and Global Scenarios

Author: B. Seufitelli Danilo
M. Moro Mirella
O. Silva Mariana
P. Oliveira Gabriel
Publication venue: 'Universidade Estadual do Parana - Unespar'
Publication date: 03/08/2023
Field of study

Collaboration is a part of the music industry and has increased over recent decades; but little do we know about its effects on success and evolution. Our goal is to analyze how success has evolved over collaboration networks and compare its global scenario to a local, thriving one: the Brazilian music industry. Specifically, we build collaboration networks from data collected from Spotify's Global and Brazilian daily charts, analyze them and identify collaboration profiles in such networks. Analyses over their topological characteristics reveal collaboration patterns mapped into four different profiles: Standard, Niche, Ephemeral and Absent, where the two first have a higher level of success. Furthermore, we do deeper by evaluating the temporal evolution of such profiles through case studies: pop and k-pop globally, and pop and forró in Brazil. Overall, our findings emphasize the importance of collaboration profiles in assessing success, and show differences between the global and Brazilian scenarios

UNESPAR - Portal de Periódicos (E-Journal)

Temporal Success Analyses in Music Collaboration Networks: Brazilian and Global Scenarios

Author: Danilo B. Seufitelli
Gabriel P. Oliveira
Mariana O. Silva
Mirella M. Moro
Publication venue: Universidade Estadual do Paraná
Publication date: 01/08/2023
Field of study

Directory of Open Access Journals

Multimodal Automated Fact-Checking: A Survey

Author: Akhtar Mubashara
Cocarascu Oana
Guo Zhijiang
Schlichtkrull Michael
Simperl Elena
Vlachos Andreas
Publication venue
Publication date: 25/10/2023
Field of study

Misinformation is often conveyed in multiple modalities, e.g. a miscaptioned image. Multimodal misinformation is perceived as more credible by humans, and spreads faster than its text-only counterparts. While an increasing body of research investigates automated fact-checking (AFC), previous surveys mostly focus on text. In this survey, we conceptualise a framework for AFC including subtasks unique to multimodal misinformation. Furthermore, we discuss related terms used in different communities and map them to our framework. We focus on four modalities prevalent in real-world fact-checking: text, image, audio, and video. We survey benchmarks and models, and discuss limitations and promising directions for future researchComment: The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP): Finding

arXiv.org e-Print Archive

Selected Papers from the First International Symposium on Future ICT (Future-ICT 2019) in Conjunction with 4th International Symposium on Mobile Internet Security (MobiSec 2019)

Author
Publication venue: 'MDPI AG'
Publication date: 11/01/2022
Field of study

The International Symposium on Future ICT (Future-ICT 2019) in conjunction with the 4th International Symposium on Mobile Internet Security (MobiSec 2019) was held on 17–19 October 2019 in Taichung, Taiwan. The symposium provided academic and industry professionals an opportunity to discuss the latest issues and progress in advancing smart applications based on future ICT and its relative security. The symposium aimed to publish high-quality papers strictly related to the various theories and practical applications concerning advanced smart applications, future ICT, and related communications and networks. It was expected that the symposium and its publications would be a trigger for further related research and technology improvements in this field

Directory of Open Access Books (DOAB)

Listener Modeling and Context-aware Music Recommendation Based on Country Archetypes

Author: Bauer Christine
Kowald Dominik
Lex Elisabeth
Reisinger Wolfgang
Schedl Markus
Sub Human-Centered Computing
Publication venue: 'Frontiers Media SA'
Publication date: 11/09/2020
Field of study

Music preferences are strongly shaped by the cultural and socio-economic background of the listener, which is reflected, to a considerable extent, in country-specific music listening profiles. Previous work has already identified several country-specific differences in the popularity distribution of music artists listened to. In particular, what constitutes the "music mainstream" strongly varies between countries. To complement and extend these results, the article at hand delivers the following major contributions: First, using state-of-the-art unsupervised learning techniques, we identify and thoroughly investigate (1) country profiles of music preferences on the fine-grained level of music tracks (in contrast to earlier work that relied on music preferences on the artist level) and (2) country archetypes that subsume countries sharing similar patterns of listening preferences. Second, we formulate four user models that leverage the user's country information on music preferences. Among others, we propose a user modeling approach to describe a music listener as a vector of similarities over the identified country clusters or archetypes. Third, we propose a context-aware music recommendation system that leverages implicit user feedback, where context is defined via the four user models. More precisely, it is a multi-layer generative model based on a variational autoencoder, in which contextual features can influence recommendations through a gating mechanism. Fourth, we thoroughly evaluate the proposed recommendation system and user models on a real-world corpus of more than one billion listening records of users around the world (out of which we use 369 million in our experiments) and show its merits vis-a-vis state-of-the-art algorithms that do not exploit this type of context information.Comment: 30 pages, 3 tables, 12 figure

arXiv.org e-Print Archive

Utrecht University Repository

A Closer Look into Recent Video-based Learning Research: A Comprehensive Review of Video Characteristics, Tools, Technologies, and Learning Effectiveness

Author: Ewerth Ralph
Hoppe Anett
Navarrete Evelyn
Nehring Andreas
Schanze Sascha
Publication venue
Publication date: 11/08/2023
Field of study

People increasingly use videos on the Web as a source for learning. To support this way of learning, researchers and developers are continuously developing tools, proposing guidelines, analyzing data, and conducting experiments. However, it is still not clear what characteristics a video should have to be an effective learning medium. In this paper, we present a comprehensive review of 257 articles on video-based learning for the period from 2016 to 2021. One of the aims of the review is to identify the video characteristics that have been explored by previous work. Based on our analysis, we suggest a taxonomy which organizes the video characteristics and contextual aspects into eight categories: (1) audio features, (2) visual features, (3) textual features, (4) instructor behavior, (5) learners activities, (6) interactive features (quizzes, etc.), (7) production style, and (8) instructional design. Also, we identify four representative research directions: (1) proposals of tools to support video-based learning, (2) studies with controlled experiments, (3) data analysis studies, and (4) proposals of design guidelines for learning videos. We find that the most explored characteristics are textual features followed by visual features, learner activities, and interactive features. Text of transcripts, video frames, and images (figures and illustrations) are most frequently used by tools that support learning through videos. The learner activity is heavily explored through log files in data analysis studies, and interactive features have been frequently scrutinized in controlled experiments. We complement our review by contrasting research findings that investigate the impact of video characteristics on the learning effectiveness, report on tasks and technologies used to develop tools that support learning, and summarize trends of design guidelines to produce learning video

arXiv.org e-Print Archive

SwimmerNET: Underwater 2D Swimmer Pose Estimation Exploiting Fully Convolutional Neural Networks

Author: Caputo Alessia
Castellini Paolo
Chiariotti Paolo
Giulietti Nicola
Publication venue: 'MDPI AG'
Publication date: 01/01/2023
Field of study

Professional swimming coaches make use of videos to evaluate their athletes' performances. Specifically, the videos are manually analyzed in order to observe the movements of all parts of the swimmer's body during the exercise and to give indications for improving swimming technique. This operation is time-consuming, laborious and error prone. In recent years, alternative technologies have been introduced in the literature, but they still have severe limitations that make their correct and effective use impossible. In fact, the currently available techniques based on image analysis only apply to certain swimming styles; moreover, they are strongly influenced by disturbing elements (i.e., the presence of bubbles, splashes and reflections), resulting in poor measurement accuracy. The use of wearable sensors (accelerometers or photoplethysmographic sensors) or optical markers, although they can guarantee high reliability and accuracy, disturb the performance of the athletes, who tend to dislike these solutions. In this work we introduce swimmerNET, a new marker-less 2D swimmer pose estimation approach based on the combined use of computer vision algorithms and fully convolutional neural networks. By using a single 8 Mpixel wide-angle camera, the proposed system is able to estimate the pose of a swimmer during exercise while guaranteeing adequate measurement accuracy. The method has been successfully tested on several athletes (i.e., different physical characteristics and different swimming technique), obtaining an average error and a standard deviation (worst case scenario for the dataset analyzed) of approximately 1 mm and 10 mm, respectively

Archivio istituzionale della ricerca - Politecnico di Milano

Machine Learning for Multimedia Communications

Author: Maugey T
Thomos N
Toni L
Publication venue: 'MDPI AG'
Publication date: 21/01/2022
Field of study

Machine learning is revolutionizing the way multimedia information is processed and transmitted to users. After intensive and powerful training, some impressive efficiency/accuracy improvements have been made all over the transmission pipeline. For example, the high model capacity of the learning-based architectures enables us to accurately model the image and video behavior such that tremendous compression gains can be achieved. Similarly, error concealment, streaming strategy or even user perception modeling have widely benefited from the recent learningoriented developments. However, learning-based algorithms often imply drastic changes to the way data are represented or consumed, meaning that the overall pipeline can be affected even though a subpart of it is optimized. In this paper, we review the recent major advances that have been proposed all across the transmission chain, and we discuss their potential impact and the research challenges that they raise

UCL Discovery