153,438 research outputs found
SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models
Diffusion models, which have emerged to become popular text-to-image
generation models, can produce high-quality and content-rich images guided by
textual prompts. However, there are limitations to semantic understanding and
commonsense reasoning in existing models when the input prompts are concise
narrative, resulting in low-quality image generation. To improve the capacities
for narrative prompts, we propose a simple-yet-effective parameter-efficient
fine-tuning approach called the Semantic Understanding and Reasoning adapter
(SUR-adapter) for pre-trained diffusion models. To reach this goal, we first
collect and annotate a new dataset SURD which consists of more than 57,000
semantically corrected multi-modal samples. Each sample contains a simple
narrative prompt, a complex keyword-based prompt, and a high-quality image.
Then, we align the semantic representation of narrative prompts to the complex
prompts and transfer knowledge of large language models (LLMs) to our
SUR-adapter via knowledge distillation so that it can acquire the powerful
semantic understanding and reasoning capabilities to build a high-quality
textual semantic representation for text-to-image generation. We conduct
experiments by integrating multiple LLMs and popular pre-trained diffusion
models to show the effectiveness of our approach in enabling diffusion models
to understand and reason concise natural language without image quality
degradation. Our approach can make text-to-image diffusion models easier to use
with better user experience, which demonstrates our approach has the potential
for further advancing the development of user-friendly text-to-image generation
models by bridging the semantic gap between simple narrative prompts and
complex keyword-based prompts. The code is released at
https://github.com/Qrange-group/SUR-adapter.Comment: accepted by ACM MM 202
ChatCAD: Interactive Computer-Aided Diagnosis on Medical Image using Large Language Models
Large language models (LLMs) have recently demonstrated their potential in
clinical applications, providing valuable medical knowledge and advice. For
example, a large dialog LLM like ChatGPT has successfully passed part of the US
medical licensing exam. However, LLMs currently have difficulty processing
images, making it challenging to interpret information from medical images,
which are rich in information that supports clinical decisions. On the other
hand, computer-aided diagnosis (CAD) networks for medical images have seen
significant success in the medical field by using advanced deep-learning
algorithms to support clinical decision-making. This paper presents a method
for integrating LLMs into medical-image CAD networks. The proposed framework
uses LLMs to enhance the output of multiple CAD networks, such as diagnosis
networks, lesion segmentation networks, and report generation networks, by
summarizing and reorganizing the information presented in natural language text
format. The goal is to merge the strengths of LLMs' medical domain knowledge
and logical reasoning with the vision understanding capability of existing
medical-image CAD models to create a more user-friendly and understandable
system for patients compared to conventional CAD systems. In the future, LLM's
medical knowledge can be also used to improve the performance of vision-based
medical-image CAD models
A multi-INT semantic reasoning framework for intelligence analysis support
Lockheed Martin Corp. has funded research to generate a framework
and methodology for developing semantic reasoning applications to support the
discipline oflntelligence Analysis. This chapter outlines that framework, discusses
how it may be used to advance the information sharing and integrated analytic
needs of the Intelligence Community, and suggests a system I software
architecture for such applications
High-throughput visual knowledge analysis and retrieval in big data ecosystems
Visual knowledge plays an important role in many highly skilled applications, such as medical diagnosis, geospatial image analysis and pathology diagnosis. Medical practitioners are able to interpret and reason about diagnostic images based on not only primitive-level image features such as color, texture, and spatial distribution but also their experience and tacit knowledge which are seldom articulated explicitly. This reasoning process is dynamic and closely related to real-time human cognition. Due to a lack of visual knowledge management and sharing tools, it is difficult to capture and transfer such tacit and hard-won expertise to novices. Moreover, many mission-critical applications require the ability to process such tacit visual knowledge in real time. Precisely how to index this visual knowledge computationally and systematically still poses a challenge to the computing community. My dissertation research results in novel computational approaches for high-throughput visual knowledge analysis and retrieval from large-scale databases using latest technologies in big data ecosystems. To provide a better understanding of visual reasoning, human gaze patterns are qualitatively measured spatially and temporally to model observers' cognitive process. These gaze patterns are then indexed in a NoSQL distributed database as a visual knowledge repository, which is accessed using various unique retrieval methods developed through this dissertation work. To provide meaningful retrievals in real time, deep-learning methods for automatic annotation of visual activities and streaming similarity comparisons are developed under a gaze-streaming framework using Apache Spark. This research has several potential applications that offer a broader impact among the scientific community and in the practical world. First, the proposed framework can be adapted for different domains, such as fine arts, life sciences, etc. with minimal effort to capture human reasoning processes. Second, with its real-time visual knowledge search function, this framework can be used for training novices in the interpretation of domain images, by helping them learn experts' reasoning processes. Third, by helping researchers to understand human visual reasoning, it may shed light on human semantics modeling. Finally, integrating reasoning process with multimedia data, future retrieval of media could embed human perceptual reasoning for database search beyond traditional content-based media retrievals
The VEX-93 environment as a hybrid tool for developing knowledge systems with different problem solving techniques
The paper describes VEX-93 as a hybrid environment for developing
knowledge-based and problem solver systems. It integrates methods and
techniques from artificial intelligence, image and signal processing and
data analysis, which can be mixed. Two hierarchical levels of reasoning
contains an intelligent toolbox with one upper strategic inference engine
and four lower ones containing specific reasoning models: truth-functional
(rule-based), probabilistic (causal networks), fuzzy (rule-based) and
case-based (frames). There are image/signal processing-analysis capabilities
in the form of programming languages with more than one hundred primitive
functions.
User-made programs are embeddable within knowledge basis, allowing the
combination of perception and reasoning. The data analyzer toolbox contains
a collection of numerical classification, pattern recognition and ordination
methods, with neural network tools and a data base query language at
inference engines's disposal.
VEX-93 is an open system able to communicate with external computer programs
relevant to a particular application. Metaknowledge can be used for
elaborate conclusions, and man-machine interaction includes, besides windows
and graphical interfaces, acceptance of voice commands and production of
speech output.
The system was conceived for real-world applications in general domains, but
an example of a concrete medical diagnostic support system at present under
completion as a cuban-spanish project is mentioned.
Present version of VEX-93 is a huge system composed by about one and half
millions of lines of C code and runs in microcomputers under Windows 3.1.Postprint (published version
- …