27 research outputs found
Unraveling the Contribution of Image Captioning and Neural Machine Translation for Multimodal Machine Translation
Recent work on multimodal machine translation has attempted to address the problem of producing target language image descriptions based on both the source language description and the corresponding image. However, existing work has not been conclusive on the contribution of visual information. This paper presents an in-depth study of the problem by examining the differences and complementarities of two related but distinct approaches to this task: textonly neural machine translation and image captioning. We analyse the scope for improvement and the effect of different data and settings to build models for these tasks. We also propose ways of combining these two approaches for improved translation quality
CUNI System for the WMT17 Multimodal Translation Task
In this paper, we describe our submissions to the WMT17 Multimodal
Translation Task. For Task 1 (multimodal translation), our best scoring system
is a purely textual neural translation of the source image caption to the
target language. The main feature of the system is the use of additional data
that was acquired by selecting similar sentences from parallel corpora and by
data synthesis with back-translation. For Task 2 (cross-lingual image
captioning), our best submitted system generates an English caption which is
then translated by the best system used in Task 1. We also present negative
results, which are based on ideas that we believe have potential of making
improvements, but did not prove to be useful in our particular setup.Comment: 8 pages; Camera-ready submission to WMT1
Deep Architectures for Visual Recognition and Description
In recent times, digital media contents are inherently of multimedia type, consisting of the form text, audio, image and video. Several of the outstanding computer Vision (CV) problems are being successfully solved with the help of modern Machine Learning (ML) techniques. Plenty of research work has already been carried out in the field of Automatic Image Annotation (AIA), Image Captioning and Video Tagging. Video Captioning, i.e., automatic description generation from digital video, however, is a different and complex problem altogether. This study compares various existing video captioning approaches available today and attempts their classification and analysis based on different parameters, viz., type of captioning methods (generation/retrieval), type of learning models employed, the desired output description length generated, etc. This dissertation also attempts to critically analyze the existing benchmark datasets used in various video captioning models and the evaluation metrics for assessing the final quality of the resultant video descriptions generated. A detailed study of important existing models, highlighting their comparative advantages as well as disadvantages are also included.
In this study a novel approach for video captioning on the Microsoft Video Description (MSVD) dataset and Microsoft Video-to-Text (MSR-VTT) dataset is proposed using supervised learning techniques to train a deep combinational framework, for achieving better quality video captioning via predicting semantic tags. We develop simple shallow CNN (2D and 3D) as feature extractors, Deep Neural Networks (DNNs and Bidirectional LSTMs (BiLSTMs) as tag prediction models and Recurrent Neural Networks (RNNs) (LSTM) model as the language model. The aim of the work was to provide an alternative narrative to generating captions from videos via semantic tag predictions and deploy simpler shallower deep model architectures with lower memory requirements as solution so that it is not very memory extensive and the developed models prove to be stable and viable options when the scale of the data is increased.
This study also successfully employed deep architectures like the Convolutional Neural Network (CNN) for speeding up automation process of hand gesture recognition and classification of the sign languages of the Indian classical dance form, ‘Bharatnatyam’. This hand gesture classification is primarily aimed at 1) building a novel dataset of 2D single hand gestures belonging to 27 classes that were collected from (i) Google search engine (Google images), (ii) YouTube videos (dynamic and with background considered) and (iii) professional artists under staged environment constraints (plain backgrounds). 2) exploring the effectiveness of CNNs for identifying and classifying the single hand gestures by optimizing the hyperparameters, and 3) evaluating the impacts of transfer learning and double transfer learning, which is a novel concept explored for achieving higher classification accuracy
The role of image representations in vision to language tasks
Tasks that require modeling of both language and visual information such as image captioning have become very popular in recent years. Most state-of-the-art approaches make use of image representations obtained from a deep neural network, which are used to generate language information in a variety of ways with end-to-end neural network-based models. However, it is not clear how different image representations contribute to language generation tasks. In this paper, we probe the representational contribution of the image features in an end-to-end neural modeling framework and study the properties of different types of image representations. We focus on two popular vision to language problems: the task of image captioning and the task of multimodal machine translation. Our analysis provides interesting insights into the representational properties and suggests that end-to-end approaches implicitly learn a visual-semantic subspace and exploit the subspace to generate captions
Progress and Opportunities of Foundation Models in Bioinformatics
Bioinformatics has witnessed a paradigm shift with the increasing integration
of artificial intelligence (AI), particularly through the adoption of
foundation models (FMs). These AI techniques have rapidly advanced, addressing
historical challenges in bioinformatics such as the scarcity of annotated data
and the presence of data noise. FMs are particularly adept at handling
large-scale, unlabeled data, a common scenario in biological contexts due to
the time-consuming and costly nature of experimentally determining labeled
data. This characteristic has allowed FMs to excel and achieve notable results
in various downstream validation tasks, demonstrating their ability to
represent diverse biological entities effectively. Undoubtedly, FMs have
ushered in a new era in computational biology, especially in the realm of deep
learning. The primary goal of this survey is to conduct a systematic
investigation and summary of FMs in bioinformatics, tracing their evolution,
current research status, and the methodologies employed. Central to our focus
is the application of FMs to specific biological problems, aiming to guide the
research community in choosing appropriate FMs for their research needs. We
delve into the specifics of the problem at hand including sequence analysis,
structure prediction, function annotation, and multimodal integration,
comparing the structures and advancements against traditional methods.
Furthermore, the review analyses challenges and limitations faced by FMs in
biology, such as data noise, model explainability, and potential biases.
Finally, we outline potential development paths and strategies for FMs in
future biological research, setting the stage for continued innovation and
application in this rapidly evolving field. This comprehensive review serves
not only as an academic resource but also as a roadmap for future explorations
and applications of FMs in biology.Comment: 27 pages, 3 figures, 2 table
Towards Video Transformers for Automatic Human Analysis
[eng] With the aim of creating artificial systems capable of mirroring the nuanced understanding and interpretative powers inherent to human cognition, this thesis embarks on an exploration of the intersection between human analysis and Video Transformers. The objective is to harness the potential of Transformers, a promising architectural paradigm, to comprehend the intricacies of human interaction, thus paving the way for the development of empathetic and context-aware intelligent systems. In order to do so, we explore the whole Computer Vision pipeline, from data gathering, to deeply analyzing recent developments, through model design and experimentation.
Central to this study is the creation of UDIVA, an expansive multi-modal, multi-view dataset capturing dyadic face-to-face human interactions. Comprising 147 participants across 188 sessions, UDIVA integrates audio-visual recordings, heart-rate measurements, personality assessments, socio- demographic metadata, and conversational transcripts, establishing itself as the largest dataset for dyadic human interaction analysis up to this date. This dataset provides a rich context for probing the capabilities of Transformers within complex environments. In order to validate its utility, as well as to elucidate Transformers' ability to assimilate diverse contextual cues, we focus on addressing the challenge of personality regression within interaction scenarios. We first adapt an existing Video Transformer to handle multiple contextual sources and conduct rigorous experimentation. We empirically observe a progressive enhancement in model performance as more context is added, reinforcing the potential of Transformers to decode intricate human dynamics. Building upon these findings, the Dyadformer emerges as a novel architecture, adept at long-range modeling of dyadic interactions. By jointly modeling both participants in the interaction, as well as embedding multi- modal integration into the model itself, the Dyadformer surpasses the baseline and other concurrent approaches, underscoring Transformers' aptitude in deciphering multifaceted, noisy, and challenging tasks such as the analysis of human personality in interaction.
Nonetheless, these experiments unveil the ubiquitous challenges when training Transformers, particularly in managing overfitting due to their demand for extensive datasets. Consequently, we conclude this thesis with a comprehensive investigation into Video Transformers, analyzing topics ranging from architectural designs and training strategies, to input embedding and tokenization, traversing through multi-modality and specific applications. Across these, we highlight trends which optimally harness spatio-temporal representations that handle video redundancy and high dimensionality. A culminating performance comparison is conducted in the realm of video action classification, spotlighting strategies that exhibit superior efficacy, even compared to traditional CNN-based methods.[cat] Aquesta tesi busca crear sistemes artificials que reflecteixin les habilitats de comprensió i interpretació humanes a través de l'ús de Transformers per a vÃdeo. L'objectiu és utilitzar aquestes arquitectures per comprendre millor la interacció humana i desenvolupar sistemes intel·ligents i conscients de l'entorn. Això implica explorar à mplies à rees de la Visió per Computador, des de la recopilació de dades fins a l'anà lisi de l'estat de l'art i la prova experimental d'aquests models.
Una part essencial d'aquest estudi és la creació d'UDIVA, un ampli conjunt de dades multimodal i multivista que enregistra interaccions humanes cara a cara. Amb 147 participants i 188 sessions, UDIVA inclou contingut audiovisual, freqüència cardÃaca, perfils de personalitat, dades sociodemogrà fiques i transcripcions de les converses. És el conjunt de dades més gran conegut per a l'anà lisi de la interacció humana dià dica i proporciona un context ric per a l'estudi de les capacitats dels Transformers en entorns complexos. Per tal de validar la seva utilitat i les habilitats dels Transformers, ens centrem en la regressió de la personalitat. Inicialment, adaptem un Transformer de vÃdeo per integrar diverses fonts de context. Mitjançant experiments exhaustius, observem millores progressives en els resultats amb la inclusió de més context, confirmant la capacitat dels Transformers. Motivats per aquests resultats, desenvolupem el Dyadformer, una arquitectura per interaccions dià diques de llarga duració. Aquesta nova arquitectura considera simultà niament els dos participants en la interacció i incorpora la multimodalitat en un sol model. El Dyadformer supera la nostra proposta inicial i altres treballs similars, destacant la capacitat dels Transformers per abordar tasques complexes.
No obstant això, aquestos experiments revelen reptes d'entrenament dels Transformers, com el sobreajustament, per la seva necessitat de grans conjunts de dades. La tesi conclou amb una anà lisi profunda dels Transformers per a vÃdeo, incloent dissenys arquitectònics, estratègies d'entrenament, preprocessament de vÃdeos, tokenització i multimodalitat. S'identifiquen tendències per gestionar la redundà ncia i alta dimensionalitat de vÃdeos i es realitza una comparació de rendiment en la classificació d'accions a vÃdeo, destacant estratègies d'eficà cia superior als mètodes tradicionals basats en convolucions
Detecting Multimedia Generated by Large AI Models: A Survey
The rapid advancement of Large AI Models (LAIMs), particularly diffusion
models and large language models, has marked a new era where AI-generated
multimedia is increasingly integrated into various aspects of daily life.
Although beneficial in numerous fields, this content presents significant
risks, including potential misuse, societal disruptions, and ethical concerns.
Consequently, detecting multimedia generated by LAIMs has become crucial, with
a marked rise in related research. Despite this, there remains a notable gap in
systematic surveys that focus specifically on detecting LAIM-generated
multimedia. Addressing this, we provide the first survey to comprehensively
cover existing research on detecting multimedia (such as text, images, videos,
audio, and multimodal content) created by LAIMs. Specifically, we introduce a
novel taxonomy for detection methods, categorized by media modality, and
aligned with two perspectives: pure detection (aiming to enhance detection
performance) and beyond detection (adding attributes like generalizability,
robustness, and interpretability to detectors). Additionally, we have presented
a brief overview of generation mechanisms, public datasets, and online
detection tools to provide a valuable resource for researchers and
practitioners in this field. Furthermore, we identify current challenges in
detection and propose directions for future research that address unexplored,
ongoing, and emerging issues in detecting multimedia generated by LAIMs. Our
aim for this survey is to fill an academic gap and contribute to global AI
security efforts, helping to ensure the integrity of information in the digital
realm. The project link is
https://github.com/Purdue-M2/Detect-LAIM-generated-Multimedia-Survey