2,681 research outputs found

    Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

    Full text link
    Large language models have demonstrated impressive universal capabilities across a wide range of open-ended tasks and have extended their utility to encompass multimodal conversations. However, existing methods encounter challenges in effectively handling both image and video understanding, particularly with limited visual tokens. In this work, we introduce Chat-UniVi, a unified vision-language model capable of comprehending and engaging in conversations involving images and videos through a unified visual representation. Specifically, we employ a set of dynamic visual tokens to uniformly represent images and videos. This representation framework empowers the model to efficiently utilize a limited number of visual tokens to simultaneously capture the spatial details necessary for images and the comprehensive temporal relationship required for videos. Moreover, we leverage a multi-scale representation, enabling the model to perceive both high-level semantic concepts and low-level visual details. Notably, Chat-UniVi is trained on a mixed dataset containing both images and videos, allowing direct application to tasks involving both mediums without requiring any modifications. Extensive experimental results demonstrate that Chat-UniVi, as a unified model, consistently outperforms even existing methods exclusively designed for either images or videos.Comment: 26 page

    Doctor of Philosophy

    Get PDF
    dissertationVisualization and exploration of volumetric datasets has been an active area of research for over two decades. During this period, volumetric datasets used by domain users have evolved from univariate to multivariate. The volume datasets are typically explored and classified via transfer function design and visualized using direct volume rendering. To improve classification results and to enable the exploration of multivariate volume datasets, multivariate transfer functions emerge. In this dissertation, we describe our research on multivariate transfer function design. To improve the classification of univariate volumes, various one-dimensional (1D) or two-dimensional (2D) transfer function spaces have been proposed; however, these methods work on only some datasets. We propose a novel transfer function method that provides better classifications by combining different transfer function spaces. Methods have been proposed for exploring multivariate simulations; however, these approaches are not suitable for complex real-world datasets and may be unintuitive for domain users. To this end, we propose a method based on user-selected samples in the spatial domain to make complex multivariate volume data visualization more accessible for domain users. However, this method still requires users to fine-tune transfer functions in parameter space transfer function widgets, which may not be familiar to them. We therefore propose GuideME, a novel slice-guided semiautomatic multivariate volume exploration approach. GuideME provides the user, an easy-to-use, slice-based user interface that suggests the feature boundaries and allows the user to select features via click and drag, and then an optimal transfer function is automatically generated by optimizing a response function. Throughout the exploration process, the user does not need to interact with the parameter views at all. Finally, real-world multivariate volume datasets are also usually of large size, which is larger than the GPU memory and even the main memory of standard work stations. We propose a ray-guided out-of-core, interactive volume rendering and efficient query method to support large and complex multivariate volumes on standard work stations

    The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision

    Full text link
    We propose the Neuro-Symbolic Concept Learner (NS-CL), a model that learns visual concepts, words, and semantic parsing of sentences without explicit supervision on any of them; instead, our model learns by simply looking at images and reading paired questions and answers. Our model builds an object-based scene representation and translates sentences into executable, symbolic programs. To bridge the learning of two modules, we use a neuro-symbolic reasoning module that executes these programs on the latent scene representation. Analogical to human concept learning, the perception module learns visual concepts based on the language description of the object being referred to. Meanwhile, the learned visual concepts facilitate learning new words and parsing new sentences. We use curriculum learning to guide the searching over the large compositional space of images and language. Extensive experiments demonstrate the accuracy and efficiency of our model on learning visual concepts, word representations, and semantic parsing of sentences. Further, our method allows easy generalization to new object attributes, compositions, language concepts, scenes and questions, and even new program domains. It also empowers applications including visual question answering and bidirectional image-text retrieval.Comment: ICLR 2019 (Oral). Project page: http://nscl.csail.mit.edu

    Structural focus+context rendering of multiclassified volume data

    Get PDF
    We present a F+C volume rendering system aimed at outlining structural relationships between different classification criteria of a multiclassified voxel model. We clusterize the voxel model into subsets of voxels sharing the same classification criteria and we construct an auxiliary voxel model storing for each voxel an identifier of its associated cluster. We represent the logical structure of the model as a directed graph having as nodes the classification criteria and as edges the inclusion relationships. We define a mapping function between nodes of the graph and clusters. The rendering process consists of two steps. First, given a user query defined in terms of a boolean expression of classification criteria, a parser computes a set of transfer functions on the cluster domain according to structural F+C rules. Then, we render simultaneously the original voxel model and the labelled one applying multimodal 3D texture mapping such that the fragment shader uses the computed transfer functions to apply structural F+C shading. The user interface of our system, based on Tulip, provides a visual feedback on the structure and the selection. We demonstrate the utility of our approach on several datasets.Postprint (published version

    Crepuscular Rays for Tumor Accessibility Planning

    Get PDF

    A Neuroimaging Web Interface for Data Acquisition, Processing and Visualization of Multimodal Brain Images

    Get PDF
    Structural and functional brain images are generated as essential modalities for medical experts to learn about the different functions of the brain. These images are typically visually inspected by experts. Many software packages are available to process medical images, but they are complex and difficult to use. The software packages are also hardware intensive. As a consequence, this dissertation proposes a novel Neuroimaging Web Services Interface (NWSI) as a series of processing pipelines for a common platform to store, process, visualize and share data. The NWSI system is made up of password-protected interconnected servers accessible through a web interface. The web-interface driving the NWSI is based on Drupal, a popular open source content management system. Drupal provides a user-based platform, in which the core code for the security and design tools are updated and patched frequently. New features can be added via modules, while maintaining the core software secure and intact. The webserver architecture allows for the visualization of results and the downloading of tabulated data. Several forms are ix available to capture clinical data. The processing pipeline starts with a FreeSurfer (FS) reconstruction of T1-weighted MRI images. Subsequently, PET, DTI, and fMRI images can be uploaded. The Webserver captures uploaded images and performs essential functionalities, while processing occurs in supporting servers. The computational platform is responsive and scalable. The current pipeline for PET processing calculates all regional Standardized Uptake Value ratios (SUVRs). The FS and SUVR calculations have been validated using Alzheimer\u27s Disease Neuroimaging Initiative (ADNI) results posted at Laboratory of Neuro Imaging (LONI). The NWSI system provides access to a calibration process through the centiloid scale, consolidating Florbetapir and Florbetaben tracers in amyloid PET images. The interface also offers onsite access to machine learning algorithms, and introduces new heat maps that augment expert visual rating of PET images. NWSI has been piloted using data and expertise from Mount Sinai Medical Center, the 1Florida Alzheimer’s Disease Research Center (ADRC), Baptist Health South Florida, Nicklaus Children\u27s Hospital, and the University of Miami. All results were obtained using our processing servers in order to maintain data validity, consistency, and minimal processing bias
    • …
    corecore