424,181 research outputs found
Model-driven performance analysis of rule-based domain specific visual models
Context: Domain-Specific Visual Languages (DSVLs) play a crucial role in Model-Driven Engineering
(MDE). Most DSVLs already allow the specification of the structure and behavior of systems. However,
there is also an increasing need to model, simulate and reason about their non-functional properties.
In particular, QoS usage and management constraints (performance, reliability, etc.) are essential characteristics
of any non-trivial system.
Objective: Very few DSVLs currently offer support for modeling these kinds of properties. And those
which do, tend to require skilled knowledge of specialized notations, which clashes with the intuitive
nature of DSVLs. In this paper we present an alternative approach to specify QoS properties in a high-level
and platform-independent manner.
Method: We propose the use of special objects (observers) that can be added to the graphical specification
of a system for describing and monitoring some of its non-functional properties.
Results: Observers allow extending the global state of the system with the variables that the designer
wants to analyze, being able to capture the performance properties of interest. A performance evaluation
tool has also been developed as a proof of concept for the proposal.
Conclusion: The results show how non-functional properties can be specified in DSVLs using observers,
and how the performance of systems specified in this way can be evaluated in a flexible and effective
way.Ministerio de Ciencia e Innovación TIN2008-031087Ministerio de Ciencia e Innovación TIN2011-2379
A Comprehensive Evaluation of GPT-4V on Knowledge-Intensive Visual Question Answering
The emergence of multimodal large models (MLMs) has significantly advanced
the field of visual understanding, offering remarkable capabilities in the
realm of visual question answering (VQA). Yet, the true challenge lies in the
domain of knowledge-intensive VQA tasks, which necessitate not just recognition
of visual elements, but also a deep comprehension of the visual information in
conjunction with a vast repository of learned knowledge. To uncover such
capabilities of MLMs, particularly the newly introduced GPT-4V, we provide an
in-depth evaluation from three perspectives: 1) Commonsense Knowledge, which
assesses how well models can understand visual cues and connect to general
knowledge; 2) Fine-grained World Knowledge, which tests the model's skill in
reasoning out specific knowledge from images, showcasing their proficiency
across various specialized fields; 3) Comprehensive Knowledge with
Decision-making Rationales, which examines model's capability to provide
logical explanations for its inference, facilitating a deeper analysis from the
interpretability perspective. Extensive experiments indicate that GPT-4V
achieves SOTA performance on above three tasks. Interestingly, we find that: a)
GPT-4V demonstrates enhanced reasoning and explanation when using composite
images as few-shot; b) GPT-4V produces severe hallucinations when dealing with
world knowledge, highlighting the future need for advancements in this research
direction.Comment: 18 pages, 13pages; working in progres
From Data to Knowledge Graphs: A Multi-Layered Method to Model User's Visual Analytics Workflow for Analytical Purposes
The importance of knowledge generation drives much of Visual Analytics (VA).
User-tracking and behavior graphs have shown the value of understanding users'
knowledge generation while performing VA workflows. Works in theoretical
models, ontologies, and provenance analysis have greatly described means to
structure and understand the connection between knowledge generation and VA
workflows. Yet, two concepts are typically intermixed: the temporal aspect,
which indicates sequences of events, and the atemporal aspect, which indicates
the workflow state space. In works where these concepts are separated, they do
not discuss how to analyze the recorded user's knowledge gathering process when
compared to the VA workflow itself. This paper presents Visual Analytic
Knowledge Graph (VAKG), a conceptual framework that generalizes existing
knowledge models and ontologies by focusing on how humans relate to computer
processes temporally and how it relates to the workflow's state space. Our
proposal structures this relationship as a 4-way temporal knowledge graph with
specific emphasis on modeling the human and computer aspect of VA as separate
but interconnected graphs for, among others, analytical purposes. We compare
VAKG with relevant literature to show that VAKG's contribution allows VA
applications to use it as a provenance model and a state space graph, allowing
for analytics of domain-specific processes, usage patterns, and users'
knowledge gain performance. We also interviewed two domain experts to check, in
the wild, whether real practice and our contributions are aligned.Comment: 9 pgs, submitted to VIS 202
Self-trained Panoptic Segmentation
Panoptic segmentation is an important computer vision task which combines
semantic and instance segmentation. It plays a crucial role in domains of
medical image analysis, self-driving vehicles, and robotics by providing a
comprehensive understanding of visual environments. Traditionally, deep
learning panoptic segmentation models have relied on dense and accurately
annotated training data, which is expensive and time consuming to obtain.
Recent advancements in self-supervised learning approaches have shown great
potential in leveraging synthetic and unlabelled data to generate pseudo-labels
using self-training to improve the performance of instance and semantic
segmentation models. The three available methods for self-supervised panoptic
segmentation use proposal-based transformer architectures which are
computationally expensive, complicated and engineered for specific tasks. The
aim of this work is to develop a framework to perform embedding-based
self-supervised panoptic segmentation using self-training in a
synthetic-to-real domain adaptation problem setting
Surgical-DINO: Adapter Learning of Foundation Models for Depth Estimation in Endoscopic Surgery
Purpose: Depth estimation in robotic surgery is vital in 3D reconstruction,
surgical navigation and augmented reality visualization. Although the
foundation model exhibits outstanding performance in many vision tasks,
including depth estimation (e.g., DINOv2), recent works observed its
limitations in medical and surgical domain-specific applications. This work
presents a low-ranked adaptation (LoRA) of the foundation model for surgical
depth estimation. Methods: We design a foundation model-based depth estimation
method, referred to as Surgical-DINO, a low-rank adaptation of the DINOv2 for
depth estimation in endoscopic surgery. We build LoRA layers and integrate them
into DINO to adapt with surgery-specific domain knowledge instead of
conventional fine-tuning. During training, we freeze the DINO image encoder,
which shows excellent visual representation capacity, and only optimize the
LoRA layers and depth decoder to integrate features from the surgical scene.
Results: Our model is extensively validated on a MICCAI challenge dataset of
SCARED, which is collected from da Vinci Xi endoscope surgery. We empirically
show that Surgical-DINO significantly outperforms all the state-of-the-art
models in endoscopic depth estimation tasks. The analysis with ablation studies
has shown evidence of the remarkable effect of our LoRA layers and adaptation.
Conclusion: Surgical-DINO shed some light on the successful adaptation of the
foundation models into the surgical domain for depth estimation. There is clear
evidence in the results that zero-shot prediction on pre-trained weights in
computer vision datasets or naive fine-tuning is not sufficient to use the
foundation model in the surgical domain directly. Code is available at
https://github.com/BeileiCui/SurgicalDINO.Comment: Accepted by IPCAI 2024 (IJCAR Special Issue
Recommended from our members
Understanding of Visual Domains via the Lens of Natural Language
A joint understanding of vision and language can enable intelligent systems to perceive, act, and communicate with humans for a wide range of applications. For example, they can assist a human to navigate in an environment, edit the content of an image through natural language commands, or search through image collections using natural language queries. In this thesis, we aim to improve our understanding of visual domains through the lens of natural language. We specifically look into (1) images of categories within a fine-grained taxonomy such as species of birds or variants of aircraft, (2) images of textures that describe local color, shape, and patterns, and (3) regions in images that correspond to objects, materials, and textures.
In one line of work, we investigate ways to discover a domain-specific language by asking annotators to describe visual differences between instances within a fine-grained taxonomy. We show that a system trained to describe these differences leads to an accurate and interpretable basis for categorization. In another line of work, we investigate the effectiveness of language and vision models for describing textures, a problem that, despite the ubiquity of textures, has not been sufficiently studied in the literature. Textures are diverse, yet their local nature allows for the description of appearance of a wide range of visual categories. The locality also allows us to systematically generate synthetic variations to investigate how disentangled visual representations are for properties such as shape, color, and figure-ground segmentation. Finally, instead of modeling an image as a whole, we design a system that allows descriptions of regions within an image. A challenge is to handle the long-tail distribution of names and appearances of concepts within natural scenes. We design a modular framework that integrates object detection, semantic segmentation, and contextual reasoning with language that leads to better performance. In addition to methods and analysis, we contribute datasets and benchmarks to evaluate the performance of models in each of these domains.
The availability of large-scale pre-trained models for vision (e.g., ResNet) and language (e.g., BERT) have catalyzed improvements and novel applications in computer vision and natural language processing, but until recently similar models that could jointly reason about language and vision were not available. This has changed through the availability of models such as CLIP, which have been trained on a massive number of images with associated texts. Therefore, we analyze the effectiveness of CLIP-based representations for tasks posed in our earlier work. By comparing and contrasting these with domain-specific ones we presented in the earlier chapters, we shed some light on the nature of the learned representations and the biases they encode
Domain adaptive learning with disentangled features
Recognizing visual information is crucial for many real artificial-intelligence-based applications, ranging from domestic robots to autonomous vehicles. However, the success of deep learning methods on visual recognition tasks is highly dependent on access to large-scale labeled datasets, which are expensive and cumbersome to collect. Transfer learning provides a way to alleviate the burden of annotating data, which transfers the knowledge learned from a rich-labeled source domain to a scarce-labeled target domain. However, the performance of deep learning models degrades significantly when testing on novel domains due to the presence of domain shift. To tackle the domain shift, conventional domain adaptation methods diminish the domain shift between two domains with a distribution matching loss or adversarial loss. These models align the domain-specific feature distribution and the domain-invariant feature distribution simultaneously, which is sub-optimal towards solving deep domain adaptation tasks, given that deep neural networks are known to extract features in which multiple hidden factors are highly entangled.
This thesis explores how to learn effective transferable features by disentangling the deep features. The following questions are studied: (1) how to disentangle the deep features into domain-invariant and domain-specific features? (2) how would feature disentanglement help to learn transferable features under a synthetic-to-real domain adaptation scenario? (3) how would feature disentanglement facilitate transfer learning with multiple source or target domains? (4) how to leverage feature disentanglement to boost the performance in a federated system?
To address these needs, this thesis proposes deep adversarial feature disentanglement: a class/domain identifier is trained on the labeled source domain and the disentangler generates features to fool the class/domain identifier. Extensive experiments and empirical analysis demonstrate the effectiveness of the feature disentanglement method on many real-world domain adaptation tasks. Specifically, the following three unsupervised domain adaptation scenarios are explored: (1) domain agnostic learning with disentangled representations, (2) unsupervised federated domain adaptation, (3) multi-source domain adaptation
Budget-Aware Adapters for Multi-Domain Learning
Multi-Domain Learning (MDL) refers to the problem of learning a set of models
derived from a common deep architecture, each one specialized to perform a task
in a certain domain (e.g., photos, sketches, paintings). This paper tackles MDL
with a particular interest in obtaining domain-specific models with an
adjustable budget in terms of the number of network parameters and
computational complexity. Our intuition is that, as in real applications the
number of domains and tasks can be very large, an effective MDL approach should
not only focus on accuracy but also on having as few parameters as possible. To
implement this idea we derive specialized deep models for each domain by
adapting a pre-trained architecture but, differently from other methods, we
propose a novel strategy to automatically adjust the computational complexity
of the network. To this aim, we introduce Budget-Aware Adapters that select the
most relevant feature channels to better handle data from a novel domain. Some
constraints on the number of active switches are imposed in order to obtain a
network respecting the desired complexity budget. Experimentally, we show that
our approach leads to recognition accuracy competitive with state-of-the-art
approaches but with much lighter networks both in terms of storage and
computation.Comment: ICCV 201
- …