4,637 research outputs found
Self-supervised learning for transferable representations
Machine learning has undeniably achieved remarkable advances thanks to large labelled datasets and supervised learning. However, this progress is constrained by the labour-intensive annotation process. It is not feasible to generate extensive labelled datasets for every problem we aim to address. Consequently, there has been a notable shift in recent times toward approaches that solely leverage raw data. Among these, self-supervised learning has emerged as a particularly powerful approach, offering scalability to massive datasets and showcasing considerable potential for effective knowledge transfer. This thesis investigates self-supervised representation learning with a strong focus on computer vision applications. We provide a comprehensive survey of self-supervised methods across various modalities, introducing a taxonomy that categorises them into four distinct families while also highlighting practical considerations for real-world implementation. Our focus thenceforth is on the computer vision modality, where we perform a comprehensive benchmark evaluation of state-of-the-art self supervised models against many diverse downstream transfer tasks. Our findings reveal that self-supervised models often outperform supervised learning across a spectrum of tasks, albeit with correlations weakening as tasks transition beyond classification, particularly for datasets with distribution shifts. Digging deeper, we investigate the influence of data augmentation on the transferability of contrastive learners, uncovering a trade-off between spatial and appearance-based invariances that generalise to real-world transformations. This begins to explain the differing empirical performances achieved by self-supervised learners on different downstream tasks, and it showcases the advantages of specialised representations produced with tailored augmentation. Finally, we introduce a novel self-supervised pre-training algorithm for object detection, aligning pre-training with downstream architecture and objectives, leading to reduced localisation errors and improved label efficiency. In conclusion, this thesis contributes a comprehensive understanding of self-supervised representation learning and its role in enabling effective transfer across computer vision tasks
Multidisciplinary perspectives on Artificial Intelligence and the law
This open access book presents an interdisciplinary, multi-authored, edited collection of chapters on Artificial Intelligence (‘AI’) and the Law. AI technology has come to play a central role in the modern data economy. Through a combination of increased computing power, the growing availability of data and the advancement of algorithms, AI has now become an umbrella term for some of the most transformational technological breakthroughs of this age. The importance of AI stems from both the opportunities that it offers and the challenges that it entails. While AI applications hold the promise of economic growth and efficiency gains, they also create significant risks and uncertainty. The potential and perils of AI have thus come to dominate modern discussions of technology and ethics – and although AI was initially allowed to largely develop without guidelines or rules, few would deny that the law is set to play a fundamental role in shaping the future of AI. As the debate over AI is far from over, the need for rigorous analysis has never been greater. This book thus brings together contributors from different fields and backgrounds to explore how the law might provide answers to some of the most pressing questions raised by AI. An outcome of the Católica Research Centre for the Future of Law and its interdisciplinary working group on Law and Artificial Intelligence, it includes contributions by leading scholars in the fields of technology, ethics and the law.info:eu-repo/semantics/publishedVersio
LIPIcs, Volume 251, ITCS 2023, Complete Volume
LIPIcs, Volume 251, ITCS 2023, Complete Volum
Novel View Synthesis of Humans using Differentiable Rendering
We present a new approach for synthesizing novel views of people in new
poses. Our novel differentiable renderer enables the synthesis of highly
realistic images from any viewpoint. Rather than operating over mesh-based
structures, our renderer makes use of diffuse Gaussian primitives that directly
represent the underlying skeletal structure of a human. Rendering these
primitives gives results in a high-dimensional latent image, which is then
transformed into an RGB image by a decoder network. The formulation gives rise
to a fully differentiable framework that can be trained end-to-end. We
demonstrate the effectiveness of our approach to image reconstruction on both
the Human3.6M and Panoptic Studio datasets. We show how our approach can be
used for motion transfer between individuals; novel view synthesis of
individuals captured from just a single camera; to synthesize individuals from
any virtual viewpoint; and to re-render people in novel poses. Code and video
results are available at
https://github.com/GuillaumeRochette/HumanViewSynthesis.Comment: Accepted at IEEE transactions on Biometrics, Behavior, and Identity
Science, 10 pages, 11 figures. arXiv admin note: substantial text overlap
with arXiv:2111.1273
Fast Learning Radiance Fields by Shooting Much Fewer Rays
Learning radiance fields has shown remarkable results for novel view
synthesis. The learning procedure usually costs lots of time, which motivates
the latest methods to speed up the learning procedure by learning without
neural networks or using more efficient data structures. However, these
specially designed approaches do not work for most of radiance fields based
methods. To resolve this issue, we introduce a general strategy to speed up the
learning procedure for almost all radiance fields based methods. Our key idea
is to reduce the redundancy by shooting much fewer rays in the multi-view
volume rendering procedure which is the base for almost all radiance fields
based methods. We find that shooting rays at pixels with dramatic color change
not only significantly reduces the training burden but also barely affects the
accuracy of the learned radiance fields. In addition, we also adaptively
subdivide each view into a quadtree according to the average rendering error in
each node in the tree, which makes us dynamically shoot more rays in more
complex regions with larger rendering error. We evaluate our method with
different radiance fields based methods under the widely used benchmarks.
Experimental results show that our method achieves comparable accuracy to the
state-of-the-art with much faster training.Comment: Accepted by lEEE Transactions on lmage Processing 2023. Project Page:
https://zparquet.github.io/Fast-Learning . Code:
https://github.com/zParquet/Fast-Learnin
OpenAGI: When LLM Meets Domain Experts
Human intelligence has the remarkable ability to assemble basic skills into
complex ones so as to solve complex tasks. This ability is equally important
for Artificial Intelligence (AI), and thus, we assert that in addition to the
development of large, comprehensive intelligent models, it is equally crucial
to equip such models with the capability to harness various domain-specific
expert models for complex task-solving in the pursuit of Artificial General
Intelligence (AGI). Recent developments in Large Language Models (LLMs) have
demonstrated remarkable learning and reasoning abilities, making them promising
as a controller to select, synthesize, and execute external models to solve
complex tasks. In this project, we develop OpenAGI, an open-source AGI research
platform, specifically designed to offer complex, multi-step tasks and
accompanied by task-specific datasets, evaluation metrics, and a diverse range
of extensible models. OpenAGI formulates complex tasks as natural language
queries, serving as input to the LLM. The LLM subsequently selects,
synthesizes, and executes models provided by OpenAGI to address the task.
Furthermore, we propose a Reinforcement Learning from Task Feedback (RLTF)
mechanism, which uses the task-solving result as feedback to improve the LLM's
task-solving ability. Thus, the LLM is responsible for synthesizing various
external models for solving complex tasks, while RLTF provides feedback to
improve its task-solving ability, enabling a feedback loop for self-improving
AI. We believe that the paradigm of LLMs operating various expert models for
complex task-solving is a promising approach towards AGI. To facilitate the
community's long-term improvement and evaluation of AGI's ability, we
open-source the code, benchmark, and evaluation methods of the OpenAGI project
at https://github.com/agiresearch/OpenAGI.Comment: 18 pages, 6 figures, 7 table
Modernizing Old Photos Using Multiple References via Photorealistic Style Transfer
This paper firstly presents old photo modernization using multiple references
by performing stylization and enhancement in a unified manner. In order to
modernize old photos, we propose a novel multi-reference-based old photo
modernization (MROPM) framework consisting of a network MROPM-Net and a novel
synthetic data generation scheme. MROPM-Net stylizes old photos using multiple
references via photorealistic style transfer (PST) and further enhances the
results to produce modern-looking images. Meanwhile, the synthetic data
generation scheme trains the network to effectively utilize multiple references
to perform modernization. To evaluate the performance, we propose a new old
photos benchmark dataset (CHD) consisting of diverse natural indoor and outdoor
scenes. Extensive experiments show that the proposed method outperforms other
baselines in performing modernization on real old photos, even though no old
photos were used during training. Moreover, our method can appropriately select
styles from multiple references for each semantic region in the old photo to
further improve the modernization performance.Comment: Accepted to CVPR 2023. Website:
https://kaist-viclab.github.io/old-photo-modernizatio
Sounding Video Generator: A Unified Framework for Text-guided Sounding Video Generation
As a combination of visual and audio signals, video is inherently
multi-modal. However, existing video generation methods are primarily intended
for the synthesis of visual frames, whereas audio signals in realistic videos
are disregarded. In this work, we concentrate on a rarely investigated problem
of text guided sounding video generation and propose the Sounding Video
Generator (SVG), a unified framework for generating realistic videos along with
audio signals. Specifically, we present the SVG-VQGAN to transform visual
frames and audio melspectrograms into discrete tokens. SVG-VQGAN applies a
novel hybrid contrastive learning method to model inter-modal and intra-modal
consistency and improve the quantized representations. A cross-modal attention
module is employed to extract associated features of visual frames and audio
signals for contrastive learning. Then, a Transformer-based decoder is used to
model associations between texts, visual frames, and audio signals at token
level for auto-regressive sounding video generation. AudioSetCap, a human
annotated text-video-audio paired dataset, is produced for training SVG.
Experimental results demonstrate the superiority of our method when compared
with existing textto-video generation methods as well as audio generation
methods on Kinetics and VAS datasets
A Comprehensive Survey on Applications of Transformers for Deep Learning Tasks
Transformer is a deep neural network that employs a self-attention mechanism
to comprehend the contextual relationships within sequential data. Unlike
conventional neural networks or updated versions of Recurrent Neural Networks
(RNNs) such as Long Short-Term Memory (LSTM), transformer models excel in
handling long dependencies between input sequence elements and enable parallel
processing. As a result, transformer-based models have attracted substantial
interest among researchers in the field of artificial intelligence. This can be
attributed to their immense potential and remarkable achievements, not only in
Natural Language Processing (NLP) tasks but also in a wide range of domains,
including computer vision, audio and speech processing, healthcare, and the
Internet of Things (IoT). Although several survey papers have been published
highlighting the transformer's contributions in specific fields, architectural
differences, or performance evaluations, there is still a significant absence
of a comprehensive survey paper encompassing its major applications across
various domains. Therefore, we undertook the task of filling this gap by
conducting an extensive survey of proposed transformer models from 2017 to
2022. Our survey encompasses the identification of the top five application
domains for transformer-based models, namely: NLP, Computer Vision,
Multi-Modality, Audio and Speech Processing, and Signal Processing. We analyze
the impact of highly influential transformer-based models in these domains and
subsequently classify them based on their respective tasks using a proposed
taxonomy. Our aim is to shed light on the existing potential and future
possibilities of transformers for enthusiastic researchers, thus contributing
to the broader understanding of this groundbreaking technology
Hyperbolic Image-Text Representations
Visual and linguistic concepts naturally organize themselves in a hierarchy,
where a textual concept ``dog'' entails all images that contain dogs. Despite
being intuitive, current large-scale vision and language models such as CLIP do
not explicitly capture such hierarchy. We propose MERU, a contrastive model
that yields hyperbolic representations of images and text. Hyperbolic spaces
have suitable geometric properties to embed tree-like data, so MERU can better
capture the underlying hierarchy in image-text data. Our results show that MERU
learns a highly interpretable representation space while being competitive with
CLIP's performance on multi-modal tasks like image classification and
image-text retrieval.Comment: Technical repor
- …