153,438 research outputs found

    SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models

    Full text link
    Diffusion models, which have emerged to become popular text-to-image generation models, can produce high-quality and content-rich images guided by textual prompts. However, there are limitations to semantic understanding and commonsense reasoning in existing models when the input prompts are concise narrative, resulting in low-quality image generation. To improve the capacities for narrative prompts, we propose a simple-yet-effective parameter-efficient fine-tuning approach called the Semantic Understanding and Reasoning adapter (SUR-adapter) for pre-trained diffusion models. To reach this goal, we first collect and annotate a new dataset SURD which consists of more than 57,000 semantically corrected multi-modal samples. Each sample contains a simple narrative prompt, a complex keyword-based prompt, and a high-quality image. Then, we align the semantic representation of narrative prompts to the complex prompts and transfer knowledge of large language models (LLMs) to our SUR-adapter via knowledge distillation so that it can acquire the powerful semantic understanding and reasoning capabilities to build a high-quality textual semantic representation for text-to-image generation. We conduct experiments by integrating multiple LLMs and popular pre-trained diffusion models to show the effectiveness of our approach in enabling diffusion models to understand and reason concise natural language without image quality degradation. Our approach can make text-to-image diffusion models easier to use with better user experience, which demonstrates our approach has the potential for further advancing the development of user-friendly text-to-image generation models by bridging the semantic gap between simple narrative prompts and complex keyword-based prompts. The code is released at https://github.com/Qrange-group/SUR-adapter.Comment: accepted by ACM MM 202

    ChatCAD: Interactive Computer-Aided Diagnosis on Medical Image using Large Language Models

    Full text link
    Large language models (LLMs) have recently demonstrated their potential in clinical applications, providing valuable medical knowledge and advice. For example, a large dialog LLM like ChatGPT has successfully passed part of the US medical licensing exam. However, LLMs currently have difficulty processing images, making it challenging to interpret information from medical images, which are rich in information that supports clinical decisions. On the other hand, computer-aided diagnosis (CAD) networks for medical images have seen significant success in the medical field by using advanced deep-learning algorithms to support clinical decision-making. This paper presents a method for integrating LLMs into medical-image CAD networks. The proposed framework uses LLMs to enhance the output of multiple CAD networks, such as diagnosis networks, lesion segmentation networks, and report generation networks, by summarizing and reorganizing the information presented in natural language text format. The goal is to merge the strengths of LLMs' medical domain knowledge and logical reasoning with the vision understanding capability of existing medical-image CAD models to create a more user-friendly and understandable system for patients compared to conventional CAD systems. In the future, LLM's medical knowledge can be also used to improve the performance of vision-based medical-image CAD models

    A multi-INT semantic reasoning framework for intelligence analysis support

    Get PDF
    Lockheed Martin Corp. has funded research to generate a framework and methodology for developing semantic reasoning applications to support the discipline oflntelligence Analysis. This chapter outlines that framework, discusses how it may be used to advance the information sharing and integrated analytic needs of the Intelligence Community, and suggests a system I software architecture for such applications

    High-throughput visual knowledge analysis and retrieval in big data ecosystems

    Get PDF
    Visual knowledge plays an important role in many highly skilled applications, such as medical diagnosis, geospatial image analysis and pathology diagnosis. Medical practitioners are able to interpret and reason about diagnostic images based on not only primitive-level image features such as color, texture, and spatial distribution but also their experience and tacit knowledge which are seldom articulated explicitly. This reasoning process is dynamic and closely related to real-time human cognition. Due to a lack of visual knowledge management and sharing tools, it is difficult to capture and transfer such tacit and hard-won expertise to novices. Moreover, many mission-critical applications require the ability to process such tacit visual knowledge in real time. Precisely how to index this visual knowledge computationally and systematically still poses a challenge to the computing community. My dissertation research results in novel computational approaches for high-throughput visual knowledge analysis and retrieval from large-scale databases using latest technologies in big data ecosystems. To provide a better understanding of visual reasoning, human gaze patterns are qualitatively measured spatially and temporally to model observers' cognitive process. These gaze patterns are then indexed in a NoSQL distributed database as a visual knowledge repository, which is accessed using various unique retrieval methods developed through this dissertation work. To provide meaningful retrievals in real time, deep-learning methods for automatic annotation of visual activities and streaming similarity comparisons are developed under a gaze-streaming framework using Apache Spark. This research has several potential applications that offer a broader impact among the scientific community and in the practical world. First, the proposed framework can be adapted for different domains, such as fine arts, life sciences, etc. with minimal effort to capture human reasoning processes. Second, with its real-time visual knowledge search function, this framework can be used for training novices in the interpretation of domain images, by helping them learn experts' reasoning processes. Third, by helping researchers to understand human visual reasoning, it may shed light on human semantics modeling. Finally, integrating reasoning process with multimedia data, future retrieval of media could embed human perceptual reasoning for database search beyond traditional content-based media retrievals

    The VEX-93 environment as a hybrid tool for developing knowledge systems with different problem solving techniques

    Get PDF
    The paper describes VEX-93 as a hybrid environment for developing knowledge-based and problem solver systems. It integrates methods and techniques from artificial intelligence, image and signal processing and data analysis, which can be mixed. Two hierarchical levels of reasoning contains an intelligent toolbox with one upper strategic inference engine and four lower ones containing specific reasoning models: truth-functional (rule-based), probabilistic (causal networks), fuzzy (rule-based) and case-based (frames). There are image/signal processing-analysis capabilities in the form of programming languages with more than one hundred primitive functions. User-made programs are embeddable within knowledge basis, allowing the combination of perception and reasoning. The data analyzer toolbox contains a collection of numerical classification, pattern recognition and ordination methods, with neural network tools and a data base query language at inference engines's disposal. VEX-93 is an open system able to communicate with external computer programs relevant to a particular application. Metaknowledge can be used for elaborate conclusions, and man-machine interaction includes, besides windows and graphical interfaces, acceptance of voice commands and production of speech output. The system was conceived for real-world applications in general domains, but an example of a concrete medical diagnostic support system at present under completion as a cuban-spanish project is mentioned. Present version of VEX-93 is a huge system composed by about one and half millions of lines of C code and runs in microcomputers under Windows 3.1.Postprint (published version
    • …
    corecore