19,536 research outputs found
CLIP-Guided Vision-Language Pre-training for Question Answering in 3D Scenes
Training models to apply linguistic knowledge and visual concepts from 2D
images to 3D world understanding is a promising direction that researchers have
only recently started to explore. In this work, we design a novel 3D
pre-training Vision-Language method that helps a model learn semantically
meaningful and transferable 3D scene point cloud representations. We inject the
representational power of the popular CLIP model into our 3D encoder by
aligning the encoded 3D scene features with the corresponding 2D image and text
embeddings produced by CLIP. To assess our model's 3D world reasoning
capability, we evaluate it on the downstream task of 3D Visual Question
Answering. Experimental quantitative and qualitative results show that our
pre-training method outperforms state-of-the-art works in this task and leads
to an interpretable representation of 3D scene features.Comment: CVPRW 2023. Code will be made publicly available:
https://github.com/AlexDelitzas/3D-VQ
UniverSeg: Universal Medical Image Segmentation
While deep learning models have become the predominant method for medical
image segmentation, they are typically not capable of generalizing to unseen
segmentation tasks involving new anatomies, image modalities, or labels. Given
a new segmentation task, researchers generally have to train or fine-tune
models, which is time-consuming and poses a substantial barrier for clinical
researchers, who often lack the resources and expertise to train neural
networks. We present UniverSeg, a method for solving unseen medical
segmentation tasks without additional training. Given a query image and example
set of image-label pairs that define a new segmentation task, UniverSeg employs
a new Cross-Block mechanism to produce accurate segmentation maps without the
need for additional training. To achieve generalization to new tasks, we have
gathered and standardized a collection of 53 open-access medical segmentation
datasets with over 22,000 scans, which we refer to as MegaMedical. We used this
collection to train UniverSeg on a diverse set of anatomies and imaging
modalities. We demonstrate that UniverSeg substantially outperforms several
related methods on unseen tasks, and thoroughly analyze and draw insights about
important aspects of the proposed system. The UniverSeg source code and model
weights are freely available at https://universeg.csail.mit.eduComment: Victor and Jose Javier contributed equally to this work. Project
Website: https://universeg.csail.mit.ed
Technical Dimensions of Programming Systems
Programming requires much more than just writing code in a programming language. It is usually done in the context of a stateful environment, by interacting with a system through a graphical user interface. Yet, this wide space of possibilities lacks a common structure for navigation. Work on programming systems fails to form a coherent body of research, making it hard to improve on past work and advance the state of the art.
In computer science, much has been said and done to allow comparison of programming languages, yet no similar theory exists for programming systems; we believe that programming systems deserve a theory too.
We present a framework of technical dimensions which capture the underlying characteristics of programming systems and provide a means for conceptualizing and comparing them.
We identify technical dimensions by examining past influential programming systems and reviewing their design principles, technical capabilities, and styles of user interaction. Technical dimensions capture characteristics that may be studied, compared and advanced independently. This makes it possible to talk about programming systems in a way that can be shared and constructively debated rather than relying solely on personal impressions.
Our framework is derived using a qualitative analysis of past programming systems. We outline two concrete ways of using our framework. First, we show how it can analyze a recently developed novel programming system. Then, we use it to identify an interesting unexplored point in the design space of programming systems.
Much research effort focuses on building programming systems that are easier to use, accessible to non-experts, moldable and/or powerful, but such efforts are disconnected. They are informal, guided by the personal vision of their authors and thus are only evaluable and comparable on the basis of individual experience using them. By providing foundations for more systematic research, we can help programming systems researchers to stand, at last, on the shoulders of giants
VIVE3D: Viewpoint-Independent Video Editing using 3D-Aware GANs
We introduce VIVE3D, a novel approach that extends the capabilities of
image-based 3D GANs to video editing and is able to represent the input video
in an identity-preserving and temporally consistent way. We propose two new
building blocks. First, we introduce a novel GAN inversion technique
specifically tailored to 3D GANs by jointly embedding multiple frames and
optimizing for the camera parameters. Second, besides traditional semantic face
edits (e.g. for age and expression), we are the first to demonstrate edits that
show novel views of the head enabled by the inherent properties of 3D GANs and
our optical flow-guided compositing technique to combine the head with the
background video. Our experiments demonstrate that VIVE3D generates
high-fidelity face edits at consistent quality from a range of camera
viewpoints which are composited with the original video in a temporally and
spatially consistent manner.Comment: CVPR 2023. Project webpage and video available at
http://afruehstueck.github.io/vive3
Refactoring = Substitution + Rewriting: Towards Generic, Language-Independent Refactorings
Eelco Visser’s work has always encouraged stepping back from the particular to look at the underlying, conceptual problems.
In that spirit we present an approach to describing refactorings that abstracts away from particular refactorings to classes of similar transformations, and presents an implementation of these that works by substitution and subsequent rewriting.
Substitution is language-independent under this approach, while the rewrites embody language-specific aspects. Intriguingly, it also goes back to work on API migration by Huiqing Li and the first author, and sets refactoring in that general context
Being Comes from Not-being: Open-vocabulary Text-to-Motion Generation with Wordless Training
Text-to-motion generation is an emerging and challenging problem, which aims
to synthesize motion with the same semantics as the input text. However, due to
the lack of diverse labeled training data, most approaches either limit to
specific types of text annotations or require online optimizations to cater to
the texts during inference at the cost of efficiency and stability. In this
paper, we investigate offline open-vocabulary text-to-motion generation in a
zero-shot learning manner that neither requires paired training data nor extra
online optimization to adapt for unseen texts. Inspired by the prompt learning
in NLP, we pretrain a motion generator that learns to reconstruct the full
motion from the masked motion. During inference, instead of changing the motion
generator, our method reformulates the input text into a masked motion as the
prompt for the motion generator to ``reconstruct'' the motion. In constructing
the prompt, the unmasked poses of the prompt are synthesized by a text-to-pose
generator. To supervise the optimization of the text-to-pose generator, we
propose the first text-pose alignment model for measuring the alignment between
texts and 3D poses. And to prevent the pose generator from overfitting to
limited training texts, we further propose a novel wordless training mechanism
that optimizes the text-to-pose generator without any training texts. The
comprehensive experimental results show that our method obtains a significant
improvement against the baseline methods. The code is available at
https://github.com/junfanlin/oohmg
Deep Transfer Learning Applications in Intrusion Detection Systems: A Comprehensive Review
Globally, the external Internet is increasingly being connected to the
contemporary industrial control system. As a result, there is an immediate need
to protect the network from several threats. The key infrastructure of
industrial activity may be protected from harm by using an intrusion detection
system (IDS), a preventive measure mechanism, to recognize new kinds of
dangerous threats and hostile activities. The most recent artificial
intelligence (AI) techniques used to create IDS in many kinds of industrial
control networks are examined in this study, with a particular emphasis on
IDS-based deep transfer learning (DTL). This latter can be seen as a type of
information fusion that merge, and/or adapt knowledge from multiple domains to
enhance the performance of the target task, particularly when the labeled data
in the target domain is scarce. Publications issued after 2015 were taken into
account. These selected publications were divided into three categories:
DTL-only and IDS-only are involved in the introduction and background, and
DTL-based IDS papers are involved in the core papers of this review.
Researchers will be able to have a better grasp of the current state of DTL
approaches used in IDS in many different types of networks by reading this
review paper. Other useful information, such as the datasets used, the sort of
DTL employed, the pre-trained network, IDS techniques, the evaluation metrics
including accuracy/F-score and false alarm rate (FAR), and the improvement
gained, were also covered. The algorithms, and methods used in several studies,
or illustrate deeply and clearly the principle in any DTL-based IDS subcategory
are presented to the reader
Sparks of GPTs in Edge Intelligence for Metaverse: Caching and Inference for Mobile AIGC Services
Aiming at achieving artificial general intelligence (AGI) for Metaverse,
pretrained foundation models (PFMs), e.g., generative pretrained transformers
(GPTs), can effectively provide various AI services, such as autonomous
driving, digital twins, and AI-generated content (AIGC) for extended reality.
With the advantages of low latency and privacy-preserving, serving PFMs of
mobile AI services in edge intelligence is a viable solution for caching and
executing PFMs on edge servers with limited computing resources and GPU memory.
However, PFMs typically consist of billions of parameters that are computation
and memory-intensive for edge servers during loading and execution. In this
article, we investigate edge PFM serving problems for mobile AIGC services of
Metaverse. First, we introduce the fundamentals of PFMs and discuss their
characteristic fine-tuning and inference methods in edge intelligence. Then, we
propose a novel framework of joint model caching and inference for managing
models and allocating resources to satisfy users' requests efficiently.
Furthermore, considering the in-context learning ability of PFMs, we propose a
new metric to evaluate the freshness and relevance between examples in
demonstrations and executing tasks, namely the Age of Context (AoC). Finally,
we propose a least context algorithm for managing cached models at edge servers
by balancing the tradeoff among latency, energy consumption, and accuracy
SViTT: Temporal Learning of Sparse Video-Text Transformers
Do video-text transformers learn to model temporal relationships across
frames? Despite their immense capacity and the abundance of multimodal training
data, recent work has revealed the strong tendency of video-text models towards
frame-based spatial representations, while temporal reasoning remains largely
unsolved. In this work, we identify several key challenges in temporal learning
of video-text transformers: the spatiotemporal trade-off from limited network
size; the curse of dimensionality for multi-frame modeling; and the diminishing
returns of semantic information by extending clip length. Guided by these
findings, we propose SViTT, a sparse video-text architecture that performs
multi-frame reasoning with significantly lower cost than naive transformers
with dense attention. Analogous to graph-based networks, SViTT employs two
forms of sparsity: edge sparsity that limits the query-key communications
between tokens in self-attention, and node sparsity that discards uninformative
visual tokens. Trained with a curriculum which increases model sparsity with
the clip length, SViTT outperforms dense transformer baselines on multiple
video-text retrieval and question answering benchmarks, with a fraction of
computational cost. Project page: http://svcl.ucsd.edu/projects/svitt.Comment: CVPR 202
Generalized Weak Supervision for Neural Information Retrieval
Neural ranking models (NRMs) have demonstrated effective performance in
several information retrieval (IR) tasks. However, training NRMs often requires
large-scale training data, which is difficult and expensive to obtain. To
address this issue, one can train NRMs via weak supervision, where a large
dataset is automatically generated using an existing ranking model (called the
weak labeler) for training NRMs. Weakly supervised NRMs can generalize from the
observed data and significantly outperform the weak labeler. This paper
generalizes this idea through an iterative re-labeling process, demonstrating
that weakly supervised models can iteratively play the role of weak labeler and
significantly improve ranking performance without using manually labeled data.
The proposed Generalized Weak Supervision (GWS) solution is generic and
orthogonal to the ranking model architecture. This paper offers four
implementations of GWS: self-labeling, cross-labeling, joint cross- and
self-labeling, and greedy multi-labeling. GWS also benefits from a query
importance weighting mechanism based on query performance prediction methods to
reduce noise in the generated training data. We further draw a theoretical
connection between self-labeling and Expectation-Maximization. Our experiments
on two passage retrieval benchmarks suggest that all implementations of GWS
lead to substantial improvements compared to weak supervision in all cases
- …