35 research outputs found

    Benchmarking Omni-Vision Representation through the Lens of Visual Realms

    Full text link
    Though impressive performance has been achieved in specific visual realms (e.g. faces, dogs, and places), an omni-vision representation generalizing to many natural visual domains is highly desirable. But, existing benchmarks are biased and inefficient to evaluate the omni-vision representation -- these benchmarks either only include several specific realms, or cover most realms at the expense of subsuming numerous datasets that have extensive realm overlapping. In this paper, we propose Omni-Realm Benchmark (OmniBenchmark). It includes 21 realm-wise datasets with 7,372 concepts and 1,074,346 images. Without semantic overlapping, these datasets cover most visual realms comprehensively and meanwhile efficiently. In addition, we propose a new supervised contrastive learning framework, namely Relational Contrastive learning (ReCo), for a better omni-vision representation. Beyond pulling two instances from the same concept closer -- the typical supervised contrastive learning framework -- ReCo also pulls two instances from the same semantic realm closer, encoding the semantic relation between concepts, and facilitating omni-vision representation learning. We benchmark ReCo and other advances in omni-vision representation studies that are different in architectures (from CNNs to transformers) and in learning paradigms (from supervised learning to self-supervised learning) on OmniBenchmark. We illustrate the superior of ReCo to other supervised contrastive learning methods and reveal multiple practical observations to facilitate future research.Comment: In ECCV 2022; The project page at https://zhangyuanhan-ai.github.io/OmniBenchmar

    On-Device Domain Generalization

    Full text link
    We present a systematic study of domain generalization (DG) for tiny neural networks. This problem is critical to on-device machine learning applications but has been overlooked in the literature where research has been merely focused on large models. Tiny neural networks have much fewer parameters and lower complexity and therefore should not be trained the same way as their large counterparts for DG applications. By conducting extensive experiments, we find that knowledge distillation (KD), a well-known technique for model compression, is much better for tackling the on-device DG problem than conventional DG methods. Another interesting observation is that the teacher-student gap on out-of-distribution data is bigger than that on in-distribution data, which highlights the capacity mismatch issue as well as the shortcoming of KD. We further propose a method called out-of-distribution knowledge distillation (OKD) where the idea is to teach the student how the teacher handles out-of-distribution data synthesized via disruptive data augmentation. Without adding any extra parameter to the model -- hence keeping the deployment cost unchanged -- OKD significantly improves DG performance for tiny neural networks in a variety of on-device DG scenarios for image and speech applications. We also contribute a scalable approach for synthesizing visual domain shifts, along with a new suite of DG datasets to complement existing testbeds.Comment: Preprin

    Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward

    Full text link
    Preference modeling techniques, such as direct preference optimization (DPO), has shown effective in enhancing the generalization abilities of large language model (LLM). However, in tasks involving video instruction-following, providing informative feedback, especially for detecting hallucinations in generated responses, remains a significant challenge. Previous studies have explored using large large multimodal models (LMMs) as reward models to guide preference modeling, but their ability to accurately assess the factuality of generated responses compared to corresponding videos has not been conclusively established. This paper introduces a novel framework that utilizes detailed video captions as a proxy of video content, enabling language models to incorporate this information as supporting evidence for scoring video Question Answering (QA) predictions. Our approach demonstrates robust alignment with OpenAI GPT-4V model's reward mechanism, which directly takes video frames as input. Furthermore, we show that applying this tailored reward through DPO significantly improves the performance of video LMMs on video QA tasks

    MMBench: Is Your Multi-modal Model an All-around Player?

    Full text link
    Large vision-language models have recently achieved remarkable progress, exhibiting great perception and reasoning abilities concerning visual information. However, how to effectively evaluate these large vision-language models remains a major obstacle, hindering future model development. Traditional benchmarks like VQAv2 or COCO Caption provide quantitative performance measurements but suffer from a lack of fine-grained ability assessment and non-robust evaluation metrics. Recent subjective benchmarks, such as OwlEval, offer comprehensive evaluations of a model's abilities by incorporating human labor, but they are not scalable and display significant bias. In response to these challenges, we propose MMBench, a novel multi-modality benchmark. MMBench methodically develops a comprehensive evaluation pipeline, primarily comprised of two elements. The first element is a meticulously curated dataset that surpasses existing similar benchmarks in terms of the number and variety of evaluation questions and abilities. The second element introduces a novel CircularEval strategy and incorporates the use of ChatGPT. This implementation is designed to convert free-form predictions into pre-defined choices, thereby facilitating a more robust evaluation of the model's predictions. MMBench is a systematically-designed objective benchmark for robustly evaluating the various abilities of vision-language models. We hope MMBench will assist the research community in better evaluating their models and encourage future advancements in this domain. Project page: https://opencompass.org.cn/mmbench

    Mechanisms and Therapeutic Targets of Depression After Intracerebral Hemorrhage

    Get PDF
    The relationship between depression and intracerebral hemorrhage (ICH) is complicated. One of the most common neuropsychiatric comorbidities of hemorrhagic stroke is Post-ICH depression. Depression, as a neuropsychiatric symptom, also negatively impacts the outcome of ICH by enhancing morbidity, disability, and mortality. However, the ICH outcome can be improved by antidepressants such as the frequently-used selective serotonin reuptake inhibitors. This review therefore presents the mechanisms of post-ICH depression, we grouped the mechanisms according to inflammation, oxidative stress (OS), apoptosis and autophagy, and explained them through their several associated signaling pathways. Inflammation is mainly related to Toll-like receptors (TLRs), the NF-kB mediated signal pathway, the PPAR-Îł-dependent pathway, as well as other signaling pathways. OS is associated to nuclear factor erythroid-2 related factor 2 (Nrf2), the PI3K/Akt pathway and the MAPK/P38 pathway. Moreover, autophagy is associated with the mTOR signaling cascade and the NF-kB mediated signal pathway, while apoptosis is correlated with the death receptor-mediated apoptosis pathway, mitochondrial apoptosis pathway, caspase-independent pathways and others. Furthermore, we found that neuroinflammation, oxidative stress, autophagy, and apoptosis experience interactions with one another. Additionally, it may provide several potential therapeutic targets for patients that might suffer from depression after ICH

    VBench: Comprehensive Benchmark Suite for Video Generative Models

    Full text link
    Video generation has witnessed significant advancements, yet evaluating these models remains a challenge. A comprehensive evaluation benchmark for video generation is indispensable for two reasons: 1) Existing metrics do not fully align with human perceptions; 2) An ideal evaluation system should provide insights to inform future developments of video generation. To this end, we present VBench, a comprehensive benchmark suite that dissects "video generation quality" into specific, hierarchical, and disentangled dimensions, each with tailored prompts and evaluation methods. VBench has three appealing properties: 1) Comprehensive Dimensions: VBench comprises 16 dimensions in video generation (e.g., subject identity inconsistency, motion smoothness, temporal flickering, and spatial relationship, etc). The evaluation metrics with fine-grained levels reveal individual models' strengths and weaknesses. 2) Human Alignment: We also provide a dataset of human preference annotations to validate our benchmarks' alignment with human perception, for each evaluation dimension respectively. 3) Valuable Insights: We look into current models' ability across various evaluation dimensions, and various content types. We also investigate the gaps between video and image generation models. We will open-source VBench, including all prompts, evaluation methods, generated videos, and human preference annotations, and also include more video generation models in VBench to drive forward the field of video generation.Comment: Equal contributions from first four authors. Project page: https://vchitect.github.io/VBench-project/ Code: https://github.com/Vchitect/VBenc

    Multimodal Foundation Models for Zero-shot Animal Species Recognition in Camera Trap Images

    Full text link
    Due to deteriorating environmental conditions and increasing human activity, conservation efforts directed towards wildlife is crucial. Motion-activated camera traps constitute an efficient tool for tracking and monitoring wildlife populations across the globe. Supervised learning techniques have been successfully deployed to analyze such imagery, however training such techniques requires annotations from experts. Reducing the reliance on costly labelled data therefore has immense potential in developing large-scale wildlife tracking solutions with markedly less human labor. In this work we propose WildMatch, a novel zero-shot species classification framework that leverages multimodal foundation models. In particular, we instruction tune vision-language models to generate detailed visual descriptions of camera trap images using similar terminology to experts. Then, we match the generated caption to an external knowledge base of descriptions in order to determine the species in a zero-shot manner. We investigate techniques to build instruction tuning datasets for detailed animal description generation and propose a novel knowledge augmentation technique to enhance caption quality. We demonstrate the performance of WildMatch on a new camera trap dataset collected in the Magdalena Medio region of Colombia.Comment: 18 pages, 9 figure

    Utility of clinical metagenomics in diagnosing malignancies in a cohort of patients with Epstein-Barr virus positivity

    Get PDF
    BackgroundsDifferentiation between benign and malignant diseases in EBV-positive patients poses a significant challenge due to the lack of efficient diagnostic tools. Metagenomic Next-Generation Sequencing (mNGS) is commonly used to identify pathogens of patients with fevers of unknown-origin (FUO). Recent studies have extended the application of Next-Generation Sequencing (NGS) in identifying tumors in body fluids and cerebrospinal fluids. In light of these, we conducted this study to develop and apply metagenomic methods to validate their role in identifying EBV-associated malignant disease.MethodsWe enrolled 29 patients with positive EBV results in the cohort of FUO in the Department of Infectious Diseases of Huashan Hospital affiliated with Fudan University from 2018 to 2019. Upon enrollment, these patients were grouped for benign diseases, CAEBV, and malignant diseases according to their final diagnosis, and CNV analysis was retrospectively performed in 2022 using samples from 2018 to 2019.ResultsAmong the 29 patients. 16 of them were diagnosed with benign diseases, 3 patients were diagnosed with CAEBV and 10 patients were with malignant diseases. 29 blood samples from 29 patients were tested for mNGS. Among all 10 patients with malignant diagnosis, CNV analysis suggested neoplasms in 9 patients. Of all 19 patients with benign or CAEBV diagnosis, 2 patients showed abnormal CNV results. The sensitivity and specificity of CNV analysis for the identification for tumors were 90% and 89.5%, separately.ConclusionsThe application of mNGS could assist in the identification of microbial infection and malignancies in EBV-related diseases. Our results demonstrate that CNV detection through mNGS is faster compared to conventional oncology tests. Moreover, the convenient collection of peripheral blood samples adds to the advantages of this approach

    Neural Prompt Search

    Full text link
    The size of vision models has grown exponentially over the last few years, especially after the emergence of Vision Transformer. This has motivated the development of parameter-efficient tuning methods, such as learning adapter layers or visual prompt tokens, which allow a tiny portion of model parameters to be trained whereas the vast majority obtained from pre-training are frozen. However, designing a proper tuning method is non-trivial: one might need to try out a lengthy list of design choices, not to mention that each downstream dataset often requires custom designs. In this paper, we view the existing parameter-efficient tuning methods as "prompt modules" and propose Neural prOmpt seArcH (NOAH), a novel approach that learns, for large vision models, the optimal design of prompt modules through a neural architecture search algorithm, specifically for each downstream dataset. By conducting extensive experiments on over 20 vision datasets, we demonstrate that NOAH (i) is superior to individual prompt modules, (ii) has a good few-shot learning ability, and (iii) is domain-generalizable. The code and models are available at https://github.com/Davidzhangyuanhan/NOAH.Comment: Code: https://github.com/Davidzhangyuanhan/NOA
    corecore