196 research outputs found
Applications of Large Scale Foundation Models for Autonomous Driving
Since DARPA Grand Challenges (rural) in 2004/05 and Urban Challenges in 2007,
autonomous driving has been the most active field of AI applications. Recently
powered by large language models (LLMs), chat systems, such as chatGPT and
PaLM, emerge and rapidly become a promising direction to achieve artificial
general intelligence (AGI) in natural language processing (NLP). There comes a
natural thinking that we could employ these abilities to reformulate autonomous
driving. By combining LLM with foundation models, it is possible to utilize the
human knowledge, commonsense and reasoning to rebuild autonomous driving
systems from the current long-tailed AI dilemma. In this paper, we investigate
the techniques of foundation models and LLMs applied for autonomous driving,
categorized as simulation, world model, data annotation and planning or E2E
solutions etc.Comment: 23 pages. A survey pape
Pathway to Future Symbiotic Creativity
This report presents a comprehensive view of our vision on the development
path of the human-machine symbiotic art creation. We propose a classification
of the creative system with a hierarchy of 5 classes, showing the pathway of
creativity evolving from a mimic-human artist (Turing Artists) to a Machine
artist in its own right. We begin with an overview of the limitations of the
Turing Artists then focus on the top two-level systems, Machine Artists,
emphasizing machine-human communication in art creation. In art creation, it is
necessary for machines to understand humans' mental states, including desires,
appreciation, and emotions, humans also need to understand machines' creative
capabilities and limitations. The rapid development of immersive environment
and further evolution into the new concept of metaverse enable symbiotic art
creation through unprecedented flexibility of bi-directional communication
between artists and art manifestation environments. By examining the latest
sensor and XR technologies, we illustrate the novel way for art data collection
to constitute the base of a new form of human-machine bidirectional
communication and understanding in art creation. Based on such communication
and understanding mechanisms, we propose a novel framework for building future
Machine artists, which comes with the philosophy that a human-compatible AI
system should be based on the "human-in-the-loop" principle rather than the
traditional "end-to-end" dogma. By proposing a new form of inverse
reinforcement learning model, we outline the platform design of machine
artists, demonstrate its functions and showcase some examples of technologies
we have developed. We also provide a systematic exposition of the ecosystem for
AI-based symbiotic art form and community with an economic model built on NFT
technology. Ethical issues for the development of machine artists are also
discussed
Learning Equivariant Representations
State-of-the-art deep learning systems often require large amounts of data
and computation. For this reason, leveraging known or unknown structure of the
data is paramount. Convolutional neural networks (CNNs) are successful examples
of this principle, their defining characteristic being the shift-equivariance.
By sliding a filter over the input, when the input shifts, the response shifts
by the same amount, exploiting the structure of natural images where semantic
content is independent of absolute pixel positions. This property is essential
to the success of CNNs in audio, image and video recognition tasks. In this
thesis, we extend equivariance to other kinds of transformations, such as
rotation and scaling. We propose equivariant models for different
transformations defined by groups of symmetries. The main contributions are (i)
polar transformer networks, achieving equivariance to the group of similarities
on the plane, (ii) equivariant multi-view networks, achieving equivariance to
the group of symmetries of the icosahedron, (iii) spherical CNNs, achieving
equivariance to the continuous 3D rotation group, (iv) cross-domain image
embeddings, achieving equivariance to 3D rotations for 2D inputs, and (v)
spin-weighted spherical CNNs, generalizing the spherical CNNs and achieving
equivariance to 3D rotations for spherical vector fields. Applications include
image classification, 3D shape classification and retrieval, panoramic image
classification and segmentation, shape alignment and pose estimation. What
these models have in common is that they leverage symmetries in the data to
reduce sample and model complexity and improve generalization performance. The
advantages are more significant on (but not limited to) challenging tasks where
data is limited or input perturbations such as arbitrary rotations are present
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action
We present Unified-IO 2, the first autoregressive multimodal model that is
capable of understanding and generating image, text, audio, and action. To
unify different modalities, we tokenize inputs and outputs -- images, text,
audio, action, bounding boxes, etc., into a shared semantic space and then
process them with a single encoder-decoder transformer model. Since training
with such diverse modalities is challenging, we propose various architectural
improvements to stabilize model training. We train our model from scratch on a
large multimodal pre-training corpus from diverse sources with a multimodal
mixture of denoisers objective. To learn an expansive set of skills, such as
following multimodal instructions, we construct and finetune on an ensemble of
120 datasets with prompts and augmentations. With a single unified model,
Unified-IO 2 achieves state-of-the-art performance on the GRIT benchmark and
strong results in more than 35 benchmarks, including image generation and
understanding, natural language understanding, video and audio understanding,
and robotic manipulation. We release all our models to the research community.Comment: 38 pages, 20 figure
Deep learning applied to computational mechanics: A comprehensive review, state of the art, and the classics
Three recent breakthroughs due to AI in arts and science serve as motivation:
An award winning digital image, protein folding, fast matrix multiplication.
Many recent developments in artificial neural networks, particularly deep
learning (DL), applied and relevant to computational mechanics (solid, fluids,
finite-element technology) are reviewed in detail. Both hybrid and pure machine
learning (ML) methods are discussed. Hybrid methods combine traditional PDE
discretizations with ML methods either (1) to help model complex nonlinear
constitutive relations, (2) to nonlinearly reduce the model order for efficient
simulation (turbulence), or (3) to accelerate the simulation by predicting
certain components in the traditional integration methods. Here, methods (1)
and (2) relied on Long-Short-Term Memory (LSTM) architecture, with method (3)
relying on convolutional neural networks. Pure ML methods to solve (nonlinear)
PDEs are represented by Physics-Informed Neural network (PINN) methods, which
could be combined with attention mechanism to address discontinuous solutions.
Both LSTM and attention architectures, together with modern and generalized
classic optimizers to include stochasticity for DL networks, are extensively
reviewed. Kernel machines, including Gaussian processes, are provided to
sufficient depth for more advanced works such as shallow networks with infinite
width. Not only addressing experts, readers are assumed familiar with
computational mechanics, but not with DL, whose concepts and applications are
built up from the basics, aiming at bringing first-time learners quickly to the
forefront of research. History and limitations of AI are recounted and
discussed, with particular attention at pointing out misstatements or
misconceptions of the classics, even in well-known references. Positioning and
pointing control of a large-deformable beam is given as an example.Comment: 275 pages, 158 figures. Appeared online on 2023.03.01 at
CMES-Computer Modeling in Engineering & Science
Multimodal and disentangled representation learning for medical image analysis
Automated medical image analysis is a growing research field with various applications in
modern healthcare. Furthermore, a multitude of imaging techniques (or modalities) have been
developed, such as Magnetic Resonance (MR) and Computed Tomography (CT), to attenuate
different organ characteristics. Research on image analysis is predominately driven by deep
learning methods due to their demonstrated performance. In this thesis, we argue that their success and generalisation relies on learning good latent representations. We propose methods for
learning spatial representations that are suitable for medical image data, and can combine information coming from different modalities. Specifically, we aim to improve cardiac MR segmentation, a challenging task due to varied images and limited expert annotations, by considering
complementary information present in (potentially unaligned) images of other modalities.
In order to evaluate the benefit of multimodal learning, we initially consider a synthesis task
on spatially aligned multimodal brain MR images. We propose a deep network of multiple
encoders and decoders, which we demonstrate outperforms existing approaches. The encoders
(one per input modality) map the multimodal images into modality invariant spatial feature
maps. Common and unique information is combined into a fused representation, that is robust
to missing modalities, and can be decoded into synthetic images of the target modalities. Different experimental settings demonstrate the benefit of multimodal over unimodal synthesis,
although input and output image pairs are required for training. The need for paired images can
be overcome with the cycle consistency principle, which we use in conjunction with adversarial
training to transform images from one modality (e.g. MR) to images in another (e.g. CT). This
is useful especially in cardiac datasets, where different spatial and temporal resolutions make
image pairing difficult, if not impossible.
Segmentation can also be considered as a form of image synthesis, if one modality consists of
semantic maps. We consider the task of extracting segmentation masks for cardiac MR images,
and aim to overcome the challenge of limited annotations, by taking into account unannanotated images which are commonly ignored. We achieve this by defining suitable latent spaces,
which represent the underlying anatomies (spatial latent variable), as well as the imaging characteristics (non-spatial latent variable). Anatomical information is required for tasks such as
segmentation and regression, whereas imaging information can capture variability in intensity
characteristics for example due to different scanners. We propose two models that disentangle
cardiac images at different levels: the first extracts the myocardium from the surrounding information, whereas the second fully separates the anatomical from the imaging characteristics.
Experimental analysis confirms the utility of disentangled representations in semi-supervised
segmentation, and in regression of cardiac indices, while maintaining robustness to intensity
variations such as the ones induced by different modalities.
Finally, our prior research is aggregated into one framework that encodes multimodal images
into disentangled anatomical and imaging factors. Several challenges of multimodal cardiac
imaging, such as input misalignments and the lack of expert annotations, are successfully handled in the shared anatomy space. Furthermore, we demonstrate that this approach can be used
to combine complementary anatomical information for the purpose of multimodal segmentation. This can be achieved even when no annotations are provided for one of the modalities.
This thesis creates new avenues for further research in the area of multimodal and disentangled learning with spatial representations, which we believe are key to more generalised deep
learning solutions in healthcare
Multimodal sentiment analysis in real-life videos
This thesis extends the emerging field of multimodal sentiment analysis of real-life videos, taking two components into consideration: the emotion and the emotion's target.
The emotion component of media is traditionally represented as a segment-based intensity model of emotion classes. This representation is replaced here by a value- and time-continuous view. Adjacent research fields, such as affective computing, have largely neglected the linguistic information available from automatic transcripts of audio-video material. As is demonstrated here, this text modality is well-suited for time- and value-continuous prediction. Moreover, source-specific problems, such as trustworthiness, have been largely unexplored so far.
This work examines perceived trustworthiness of the source, and its quantification, in user-generated video data and presents a possible modelling path. Furthermore, the transfer between the continuous and discrete emotion representations is explored in order to summarise the emotional context at a segment level.
The other component deals with the target of the emotion, for example, the topic the speaker is addressing. Emotion targets in a video dataset can, as is shown here, be coherently extracted based on automatic transcripts without limiting a priori parameters, such as the expected number of targets. Furthermore, alternatives to purely linguistic investigation in predicting targets, such as knowledge-bases and multimodal systems, are investigated.
A new dataset is designed for this investigation, and, in conjunction with proposed novel deep neural networks, extensive experiments are conducted to explore the components described above.
The developed systems show robust prediction results and demonstrate strengths of the respective modalities, feature sets, and modelling techniques. Finally, foundations are laid for cross-modal information prediction systems with applications to the correction of corrupted in-the-wild signals from real-life videos
Text Detection and Recognition in the Wild
Text detection and recognition (TDR) in highly structured environments with a clean background and consistent fonts (e.g., office documents, postal addresses and bank cheque) is a well understood problem (i.e., OCR), however this is not the case for unstructured environments.
The main objective for scene text detection is to locate text within images captured in the wild.
For scene text recognition, the techniques map each detected or cropped word image into string.
Nowadays, convolutional neural networks (CNNs) and Recurrent Neural Networks (RNN) deep learning architectures dominate most of the recent state-of-the-art (SOTA) scene TDR methods.
Most of the reported respective accuracies of current SOTA TDR methods are in the range of 80% to 90% on benchmark datasets with regular and clear text instances. However, those detecting and/or recognizing results drastically deteriorate 10% and 30% - in terms of F-measure detection and word recognition accuracy performances with irregular or occluded text images.
Transformers and their variations are new deep learning architectures that mitigate the above-mentioned issues for CNN and RNN-based pipelines.Unlike Recurrent Neural Networks (RNNs),
transformers are models that learn how to encode and decode data by looking not only backward but also forward in order to extract relevant information from a whole sequence.
This thesis utilizes the transformer architecture to address the irregular (multi-oriented and arbitrarily shaped) and occluded text challenges in the wild images. Our main contributions are as follows:
(1) We first targeted solving the irregular TDR in two separate architectures as follows:
In Chapter 4, unlike the SOTA text detection frameworks that have complex pipelines and use many hand-designed components and post-processing stages, we design a conceptually more straightforward and trainable end-to-end architecture of transformer-based detector for multi-oriented scene text detection, which can directly predict the set of detections (i.e., text and box regions) of the input image. A central contribution to our work is introducing a loss function tailored to the rotated text detection problem that leverages a rotated version of a generalized intersection over union score to capture the rotated text instances adequately.
In Chapter 5, we extend our previous architecture to arbitrary shaped scene text detection.
We design a new text detection technique that aims to better infer n-vertices of a polygon or the degree of a Bezier curve to represent irregular-text instances.
We also propose a loss function that combines a generalized-split-intersection-over union loss defined over the piece-wise polygons.
In Chapter 6, we show that our transformer-based architecture without rectifying the input curved text instances is more suitable than SOTA RNN-based frameworks equipped with rectification modules for irregular text recognition in the wild images.
Our main contribution to this chapter is leveraging a 2D Learnable Sinusoidal frequencies Positional Encoding (2LSPE) with a modified feed-forward neural network to better encode the 2D spatial dependencies of characters in the irregular text instances.
(2) Since TDR tasks encounter the same challenging problems (e.g., irregular text, illumination variations, low-resolution text, etc.), we present a new transformer model that can detect and recognize individual characters of text instances in an end-to-end manner. Reading individual characters later makes a robust occlusion and arbitrarily shaped text spotting model without needing polygon annotation or multiple stages of detection and recognition modules used in SOTA text spotting architectures.
In Chapter 7, unlike SOTA methods that combine two different pipelines of detection and recognition modules for a complete text reading, we utilize our text detection framework by leveraging a recent transformer-based technique, namely Deformable Patch-based Transformer (DPT), as a feature extracting backbone, to robustly read the class and box coordinates of irregular characters in the wild images.
(3) Finally, we address the occlusion problem by using a multi-task end-to-end scene text spotting framework.
In Chapter 8, we leverage a recent transformer-based framework in deep learning, namely Masked Auto Encoder (MAE), as a backbone for scene text recognition and end-to-end scene text spotting pipelines to overcome the partial occlusion limitation. We design a new multitask End-to-End transformer network that directly outputs characters, word instances, and their bounding box representations, saving the computational overhead as it eliminates multiple processing steps. The unified proposed framework can also detect and recognize arbitrarily shaped text instances without using polygon annotations
On Modeling Long-Range Dependencies for Visual Perception
One of the ultimate goals of computer vision is to extract useful information from visual inputs. An example is to recognize and segment objects from natural images. Recently, deep networks enable us to do a wide range of these tasks better than ever. These are mostly achieved with convolutional neural networks that model pixel relations within a small convolution kernel. Despite such success of convolution, the local window approximation makes it challenging to capture long-range relations. This limitation results in problems, such as unsatisfactory generalization and robustness to out-of-distribution examples.
In this dissertation, I aim to model long-range dependencies in the context of natural image perception. The first part of the dissertation is focused on designing neural architectures that are flexible enough to capture long-range relations. We start by improving convolutional networks with dynamic scaling policies. Then, we explore an alternative solution that completely replaces convolution with global self-attention to capture more context. The attention mechanism is further extended to modeling relations between the pixels and the objects with a transformer, enabling panoptic segmentation in an end-to-end manner. These flexible long-range models usually require a large amount of labeled data to train. In order to address this issue, we discuss self-supervised techniques that learn representation effectively without human annotation in the second part of the dissertation. We regularize the contrastive learning framework with a consistency term that refines self-supervision signals. We also study a more general pretext task, masked image modeling, and train transformers to learn better representations with an online semantic tokenizer
- …