1,255 research outputs found
Foundations and Recent Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions
Multimodal machine learning is a vibrant multi-disciplinary research field
that aims to design computer agents with intelligent capabilities such as
understanding, reasoning, and learning through integrating multiple
communicative modalities, including linguistic, acoustic, visual, tactile, and
physiological messages. With the recent interest in video understanding,
embodied autonomous agents, text-to-image generation, and multisensor fusion in
application domains such as healthcare and robotics, multimodal machine
learning has brought unique computational and theoretical challenges to the
machine learning community given the heterogeneity of data sources and the
interconnections often found between modalities. However, the breadth of
progress in multimodal research has made it difficult to identify the common
themes and open questions in the field. By synthesizing a broad range of
application domains and theoretical frameworks from both historical and recent
perspectives, this paper is designed to provide an overview of the
computational and theoretical foundations of multimodal machine learning. We
start by defining two key principles of modality heterogeneity and
interconnections that have driven subsequent innovations, and propose a
taxonomy of 6 core technical challenges: representation, alignment, reasoning,
generation, transference, and quantification covering historical and recent
trends. Recent technical achievements will be presented through the lens of
this taxonomy, allowing researchers to understand the similarities and
differences across new approaches. We end by motivating several open problems
for future research as identified by our taxonomy
PERICLES Deliverable 4.3:Content Semantics and Use Context Analysis Techniques
The current deliverable summarises the work conducted within task T4.3 of WP4, focusing on the extraction and the subsequent analysis of semantic information from digital content, which is imperative for its preservability. More specifically, the deliverable defines content semantic information from a visual and textual perspective, explains how this information can be exploited in long-term digital preservation and proposes novel approaches for extracting this information in a scalable manner. Additionally, the deliverable discusses novel techniques for retrieving and analysing the context of use of digital objects. Although this topic has not been extensively studied by existing literature, we believe use context is vital in augmenting the semantic information and maintaining the usability and preservability of the digital objects, as well as their ability to be accurately interpreted as initially intended.PERICLE
Attribute Learning for Image/Video Understanding
PhDFor the past decade computer vision research has achieved increasing success in visual recognition
including object detection and video classification. Nevertheless, these achievements still
cannot meet the urgent needs of image and video understanding. The recently rapid development
of social media sharing has created a huge demand for automatic media classification and annotation
techniques. In particular, these types of media data usually contain very complex social
activities of a group of people (e.g. YouTube video of a wedding reception) and are captured
by consumer devices with poor visual quality. Thus it is extremely challenging to automatically
understand such a high number of complex image and video categories, especially when these
categories have never been seen before.
One way to understand categories with no or few examples is by transfer learning which
transfers knowledge across related domains, tasks, or distributions. In particular, recently lifelong
learning has become popular which aims at transferring information to tasks without any
observed data. In computer vision, transfer learning often takes the form of attribute learning.
The key underpinning idea of attribute learning is to exploit transfer learning via an intermediatelevel
semantic representations – attributes. The semantic attributes are most commonly used as a
semantically meaningful bridge between low feature data and higher level class concepts, since
they can be used both descriptively (e.g., ’has legs’) and discriminatively (e.g., ’cats have it but
dogs do not’). Previous works propose many different attribute learning models for image and
video understanding. However, there are several intrinsic limitations and problems that exist in
previous attribute learning work. Such limitations discussed in this thesis include limitations of
user-defined attributes, projection domain-shift problems, prototype sparsity problems, inability
to combine multiple semantic representations and noisy annotations of relative attributes. To
tackle these limitations, this thesis explores attribute learning on image and video understanding
from the following three aspects.
Firstly to break the limitations of user-defined attributes, a framework for learning latent
attributes is present for automatic classification and annotation of unstructured group social activity
in videos, which enables the tasks of attribute learning for understanding complex multimedia
data with sparse and incomplete labels. We investigate the learning of latent attributes
for content-based understanding, which aims to model and predict classes and tags relevant to
objects, sounds and events – anything likely to be used by humans to describe or search for
media. Secondly, we propose the framework of transductive multi-view embedding hypergraph
label propagation and solve three inherent limitations of most previous attribute learning work,
i.e., the projection domain shift problems, the prototype sparsity problems and the inability to
combine multiple semantic representations. We explore the manifold structure of the data distributions
of different views projected onto the same embedding space via label propagation on
a graph. Thirdly a novel framework for robust learning is presented to effectively learn relative
attributes from the extremely noisy and sparse annotations. Relative attributes are increasingly
learned from pairwise comparisons collected via crowdsourcing tools which are more economic
and scalable than the conventional laboratory based data annotation. However, a major challenge
for taking a crowdsourcing strategy is the detection and pruning of outliers. We thus propose
a principled way to identify annotation outliers by formulating the relative attribute prediction
task as a unified robust learning to rank problem, tackling both the outlier detection and relative
attribute prediction tasks jointly.
In summary, this thesis studies and solves the key challenges and limitations of attribute
learning in image/video understanding. We show the benefits of solving these challenges and
limitations in our approach which thus achieves better performance than previous methods
Web archives: the future
T his report is structured first, to engage in some speculative thought about the possible futures of the web as an exercise in prom pting us to think about what we need to do now in order to make sure that we can reliably and fruitfully use archives of the w eb in the future. Next, we turn to considering the methods and tools being used to research the live web, as a pointer to the types of things that can be developed to help unde rstand the archived web. Then , we turn to a series of topics and questions that researchers want or may want to address using the archived web. In this final section, we i dentify some of the challenges individuals, organizations, and international bodies can target to increase our ability to explore these topi cs and answer these quest ions. We end the report with some conclusions based on what we have learned from this exercise
Pathway to Future Symbiotic Creativity
This report presents a comprehensive view of our vision on the development
path of the human-machine symbiotic art creation. We propose a classification
of the creative system with a hierarchy of 5 classes, showing the pathway of
creativity evolving from a mimic-human artist (Turing Artists) to a Machine
artist in its own right. We begin with an overview of the limitations of the
Turing Artists then focus on the top two-level systems, Machine Artists,
emphasizing machine-human communication in art creation. In art creation, it is
necessary for machines to understand humans' mental states, including desires,
appreciation, and emotions, humans also need to understand machines' creative
capabilities and limitations. The rapid development of immersive environment
and further evolution into the new concept of metaverse enable symbiotic art
creation through unprecedented flexibility of bi-directional communication
between artists and art manifestation environments. By examining the latest
sensor and XR technologies, we illustrate the novel way for art data collection
to constitute the base of a new form of human-machine bidirectional
communication and understanding in art creation. Based on such communication
and understanding mechanisms, we propose a novel framework for building future
Machine artists, which comes with the philosophy that a human-compatible AI
system should be based on the "human-in-the-loop" principle rather than the
traditional "end-to-end" dogma. By proposing a new form of inverse
reinforcement learning model, we outline the platform design of machine
artists, demonstrate its functions and showcase some examples of technologies
we have developed. We also provide a systematic exposition of the ecosystem for
AI-based symbiotic art form and community with an economic model built on NFT
technology. Ethical issues for the development of machine artists are also
discussed
CHORUS Deliverable 2.1: State of the Art on Multimedia Search Engines
Based on the information provided by European projects and national initiatives related to multimedia search as well as domains experts that participated in the CHORUS Think-thanks and workshops, this document reports on the state of the art related to multimedia content search from, a technical, and socio-economic perspective.
The technical perspective includes an up to date view on content based indexing and retrieval technologies, multimedia search in the context of mobile devices and peer-to-peer networks, and an overview of current evaluation and benchmark inititiatives to measure the performance of multimedia search engines.
From a socio-economic perspective we inventorize the impact and legal consequences of these technical advances and point out future directions of research
PersoNER: Persian named-entity recognition
© 1963-2018 ACL. Named-Entity Recognition (NER) is still a challenging task for languages with low digital resources. The main difficulties arise from the scarcity of annotated corpora and the consequent problematic training of an effective NER pipeline. To abridge this gap, in this paper we target the Persian language that is spoken by a population of over a hundred million people world-wide. We first present and provide ArmanPerosNERCorpus, the first manually-annotated Persian NER corpus. Then, we introduce PersoNER, an NER pipeline for Persian that leverages a word embedding and a sequential max-margin classifier. The experimental results show that the proposed approach is capable of achieving interesting MUC7 and CoNNL scores while outperforming two alternatives based on a CRF and a recurrent neural network
GeXSe (Generative Explanatory Sensor System): An Interpretable Deep Generative Model for Human Activity Recognition in Smart Spaces
We introduce GeXSe (Generative Explanatory Sensor System), a novel framework
designed to extract interpretable sensor-based and vision domain features from
non-invasive smart space sensors. We combine these to provide a comprehensive
explanation of sensor-activation patterns in activity recognition tasks. This
system leverages advanced machine learning architectures, including transformer
blocks, Fast Fourier Convolution (FFC), and diffusion models, to provide a more
detailed understanding of sensor-based human activity data. A standout feature
of GeXSe is our unique Multi-Layer Perceptron (MLP) with linear, ReLU, and
normalization layers, specially devised for optimal performance on small
datasets. It also yields meaningful activation maps to explain sensor-based
activation patterns. The standard approach is based on a CNN model, which our
MLP model outperforms.GeXSe offers two types of explanations: sensor-based
activation maps and visual domain explanations using short videos. These
methods offer a comprehensive interpretation of the output from
non-interpretable sensor data, thereby augmenting the interpretability of our
model. Utilizing the Frechet Inception Distance (FID) for evaluation, it
outperforms established methods, improving baseline performance by about 6\%.
GeXSe also achieves a high F1 score of up to 0.85, demonstrating precision,
recall, and noise resistance, marking significant progress in reliable and
explainable smart space sensing systems.Comment: 29 pages,17 figure
- …