35 research outputs found
Benchmarking Omni-Vision Representation through the Lens of Visual Realms
Though impressive performance has been achieved in specific visual realms
(e.g. faces, dogs, and places), an omni-vision representation generalizing to
many natural visual domains is highly desirable. But, existing benchmarks are
biased and inefficient to evaluate the omni-vision representation -- these
benchmarks either only include several specific realms, or cover most realms at
the expense of subsuming numerous datasets that have extensive realm
overlapping. In this paper, we propose Omni-Realm Benchmark (OmniBenchmark). It
includes 21 realm-wise datasets with 7,372 concepts and 1,074,346 images.
Without semantic overlapping, these datasets cover most visual realms
comprehensively and meanwhile efficiently. In addition, we propose a new
supervised contrastive learning framework, namely Relational Contrastive
learning (ReCo), for a better omni-vision representation. Beyond pulling two
instances from the same concept closer -- the typical supervised contrastive
learning framework -- ReCo also pulls two instances from the same semantic
realm closer, encoding the semantic relation between concepts, and facilitating
omni-vision representation learning. We benchmark ReCo and other advances in
omni-vision representation studies that are different in architectures (from
CNNs to transformers) and in learning paradigms (from supervised learning to
self-supervised learning) on OmniBenchmark. We illustrate the superior of ReCo
to other supervised contrastive learning methods and reveal multiple practical
observations to facilitate future research.Comment: In ECCV 2022; The project page at
https://zhangyuanhan-ai.github.io/OmniBenchmar
On-Device Domain Generalization
We present a systematic study of domain generalization (DG) for tiny neural
networks. This problem is critical to on-device machine learning applications
but has been overlooked in the literature where research has been merely
focused on large models. Tiny neural networks have much fewer parameters and
lower complexity and therefore should not be trained the same way as their
large counterparts for DG applications. By conducting extensive experiments, we
find that knowledge distillation (KD), a well-known technique for model
compression, is much better for tackling the on-device DG problem than
conventional DG methods. Another interesting observation is that the
teacher-student gap on out-of-distribution data is bigger than that on
in-distribution data, which highlights the capacity mismatch issue as well as
the shortcoming of KD. We further propose a method called out-of-distribution
knowledge distillation (OKD) where the idea is to teach the student how the
teacher handles out-of-distribution data synthesized via disruptive data
augmentation. Without adding any extra parameter to the model -- hence keeping
the deployment cost unchanged -- OKD significantly improves DG performance for
tiny neural networks in a variety of on-device DG scenarios for image and
speech applications. We also contribute a scalable approach for synthesizing
visual domain shifts, along with a new suite of DG datasets to complement
existing testbeds.Comment: Preprin
Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward
Preference modeling techniques, such as direct preference optimization (DPO),
has shown effective in enhancing the generalization abilities of large language
model (LLM). However, in tasks involving video instruction-following, providing
informative feedback, especially for detecting hallucinations in generated
responses, remains a significant challenge. Previous studies have explored
using large large multimodal models (LMMs) as reward models to guide preference
modeling, but their ability to accurately assess the factuality of generated
responses compared to corresponding videos has not been conclusively
established. This paper introduces a novel framework that utilizes detailed
video captions as a proxy of video content, enabling language models to
incorporate this information as supporting evidence for scoring video Question
Answering (QA) predictions. Our approach demonstrates robust alignment with
OpenAI GPT-4V model's reward mechanism, which directly takes video frames as
input. Furthermore, we show that applying this tailored reward through DPO
significantly improves the performance of video LMMs on video QA tasks
MMBench: Is Your Multi-modal Model an All-around Player?
Large vision-language models have recently achieved remarkable progress,
exhibiting great perception and reasoning abilities concerning visual
information. However, how to effectively evaluate these large vision-language
models remains a major obstacle, hindering future model development.
Traditional benchmarks like VQAv2 or COCO Caption provide quantitative
performance measurements but suffer from a lack of fine-grained ability
assessment and non-robust evaluation metrics. Recent subjective benchmarks,
such as OwlEval, offer comprehensive evaluations of a model's abilities by
incorporating human labor, but they are not scalable and display significant
bias. In response to these challenges, we propose MMBench, a novel
multi-modality benchmark. MMBench methodically develops a comprehensive
evaluation pipeline, primarily comprised of two elements. The first element is
a meticulously curated dataset that surpasses existing similar benchmarks in
terms of the number and variety of evaluation questions and abilities. The
second element introduces a novel CircularEval strategy and incorporates the
use of ChatGPT. This implementation is designed to convert free-form
predictions into pre-defined choices, thereby facilitating a more robust
evaluation of the model's predictions. MMBench is a systematically-designed
objective benchmark for robustly evaluating the various abilities of
vision-language models. We hope MMBench will assist the research community in
better evaluating their models and encourage future advancements in this
domain. Project page: https://opencompass.org.cn/mmbench
Mechanisms and Therapeutic Targets of Depression After Intracerebral Hemorrhage
The relationship between depression and intracerebral hemorrhage (ICH) is complicated. One of the most common neuropsychiatric comorbidities of hemorrhagic stroke is Post-ICH depression. Depression, as a neuropsychiatric symptom, also negatively impacts the outcome of ICH by enhancing morbidity, disability, and mortality. However, the ICH outcome can be improved by antidepressants such as the frequently-used selective serotonin reuptake inhibitors. This review therefore presents the mechanisms of post-ICH depression, we grouped the mechanisms according to inflammation, oxidative stress (OS), apoptosis and autophagy, and explained them through their several associated signaling pathways. Inflammation is mainly related to Toll-like receptors (TLRs), the NF-kB mediated signal pathway, the PPAR-Îł-dependent pathway, as well as other signaling pathways. OS is associated to nuclear factor erythroid-2 related factor 2 (Nrf2), the PI3K/Akt pathway and the MAPK/P38 pathway. Moreover, autophagy is associated with the mTOR signaling cascade and the NF-kB mediated signal pathway, while apoptosis is correlated with the death receptor-mediated apoptosis pathway, mitochondrial apoptosis pathway, caspase-independent pathways and others. Furthermore, we found that neuroinflammation, oxidative stress, autophagy, and apoptosis experience interactions with one another. Additionally, it may provide several potential therapeutic targets for patients that might suffer from depression after ICH
VBench: Comprehensive Benchmark Suite for Video Generative Models
Video generation has witnessed significant advancements, yet evaluating these
models remains a challenge. A comprehensive evaluation benchmark for video
generation is indispensable for two reasons: 1) Existing metrics do not fully
align with human perceptions; 2) An ideal evaluation system should provide
insights to inform future developments of video generation. To this end, we
present VBench, a comprehensive benchmark suite that dissects "video generation
quality" into specific, hierarchical, and disentangled dimensions, each with
tailored prompts and evaluation methods. VBench has three appealing properties:
1) Comprehensive Dimensions: VBench comprises 16 dimensions in video generation
(e.g., subject identity inconsistency, motion smoothness, temporal flickering,
and spatial relationship, etc). The evaluation metrics with fine-grained levels
reveal individual models' strengths and weaknesses. 2) Human Alignment: We also
provide a dataset of human preference annotations to validate our benchmarks'
alignment with human perception, for each evaluation dimension respectively. 3)
Valuable Insights: We look into current models' ability across various
evaluation dimensions, and various content types. We also investigate the gaps
between video and image generation models. We will open-source VBench,
including all prompts, evaluation methods, generated videos, and human
preference annotations, and also include more video generation models in VBench
to drive forward the field of video generation.Comment: Equal contributions from first four authors. Project page:
https://vchitect.github.io/VBench-project/ Code:
https://github.com/Vchitect/VBenc
Multimodal Foundation Models for Zero-shot Animal Species Recognition in Camera Trap Images
Due to deteriorating environmental conditions and increasing human activity,
conservation efforts directed towards wildlife is crucial. Motion-activated
camera traps constitute an efficient tool for tracking and monitoring wildlife
populations across the globe. Supervised learning techniques have been
successfully deployed to analyze such imagery, however training such techniques
requires annotations from experts. Reducing the reliance on costly labelled
data therefore has immense potential in developing large-scale wildlife
tracking solutions with markedly less human labor. In this work we propose
WildMatch, a novel zero-shot species classification framework that leverages
multimodal foundation models. In particular, we instruction tune
vision-language models to generate detailed visual descriptions of camera trap
images using similar terminology to experts. Then, we match the generated
caption to an external knowledge base of descriptions in order to determine the
species in a zero-shot manner. We investigate techniques to build instruction
tuning datasets for detailed animal description generation and propose a novel
knowledge augmentation technique to enhance caption quality. We demonstrate the
performance of WildMatch on a new camera trap dataset collected in the
Magdalena Medio region of Colombia.Comment: 18 pages, 9 figure
Utility of clinical metagenomics in diagnosing malignancies in a cohort of patients with Epstein-Barr virus positivity
BackgroundsDifferentiation between benign and malignant diseases in EBV-positive patients poses a significant challenge due to the lack of efficient diagnostic tools. Metagenomic Next-Generation Sequencing (mNGS) is commonly used to identify pathogens of patients with fevers of unknown-origin (FUO). Recent studies have extended the application of Next-Generation Sequencing (NGS) in identifying tumors in body fluids and cerebrospinal fluids. In light of these, we conducted this study to develop and apply metagenomic methods to validate their role in identifying EBV-associated malignant disease.MethodsWe enrolled 29 patients with positive EBV results in the cohort of FUO in the Department of Infectious Diseases of Huashan Hospital affiliated with Fudan University from 2018 to 2019. Upon enrollment, these patients were grouped for benign diseases, CAEBV, and malignant diseases according to their final diagnosis, and CNV analysis was retrospectively performed in 2022 using samples from 2018 to 2019.ResultsAmong the 29 patients. 16 of them were diagnosed with benign diseases, 3 patients were diagnosed with CAEBV and 10 patients were with malignant diseases. 29 blood samples from 29 patients were tested for mNGS. Among all 10 patients with malignant diagnosis, CNV analysis suggested neoplasms in 9 patients. Of all 19 patients with benign or CAEBV diagnosis, 2 patients showed abnormal CNV results. The sensitivity and specificity of CNV analysis for the identification for tumors were 90% and 89.5%, separately.ConclusionsThe application of mNGS could assist in the identification of microbial infection and malignancies in EBV-related diseases. Our results demonstrate that CNV detection through mNGS is faster compared to conventional oncology tests. Moreover, the convenient collection of peripheral blood samples adds to the advantages of this approach
Neural Prompt Search
The size of vision models has grown exponentially over the last few years,
especially after the emergence of Vision Transformer. This has motivated the
development of parameter-efficient tuning methods, such as learning adapter
layers or visual prompt tokens, which allow a tiny portion of model parameters
to be trained whereas the vast majority obtained from pre-training are frozen.
However, designing a proper tuning method is non-trivial: one might need to try
out a lengthy list of design choices, not to mention that each downstream
dataset often requires custom designs. In this paper, we view the existing
parameter-efficient tuning methods as "prompt modules" and propose Neural
prOmpt seArcH (NOAH), a novel approach that learns, for large vision models,
the optimal design of prompt modules through a neural architecture search
algorithm, specifically for each downstream dataset. By conducting extensive
experiments on over 20 vision datasets, we demonstrate that NOAH (i) is
superior to individual prompt modules, (ii) has a good few-shot learning
ability, and (iii) is domain-generalizable. The code and models are available
at https://github.com/Davidzhangyuanhan/NOAH.Comment: Code: https://github.com/Davidzhangyuanhan/NOA