85 research outputs found
Analyzing the influence of oblique incidence on quantitative backscattering tissue polarimetry: a pilot ex vivo study
Significance
Among the available polarimetric techniques, backscattering Mueller matrix (MM) polarimetry provides a promising non-contact and quantitative tool for in vivo tissue detection and clinical diagnosis. To eliminate the surface reflection from the sample cost-effectively, the non-collinear backscattering MM imaging setup always has an oblique incidence. Meanwhile, for practical organ cavities imaged using polarimetric gastrointestinal endoscopy, the uneven tissue surfaces can induce various relative oblique incidences inevitably, which can affect the polarimetry in a complicated manner and needs to be considered for detailed study.
Aim
The purpose of this study is to systematically analyze the influence of oblique incidence on backscattering tissue polarimetry.
Approach
We measured the MMs of experimental phantom and ex vivo tissues with different incident angles and adopted a Monte Carlo simulation program based on cylindrical scattering model for further verification and analysis. Meanwhile, the results were quantitatively evaluated using the Fourier transform, basic statistics, and frequency distribution histograms.
Results
Oblique incidence can induce different changes on non-periodic, two-periodic, and four-periodic MM elements, leading to false-positive and false-negative polarization information for tissue polarimetry. Moreover, a prominent oblique incidence can bring more dramatic signal variations, such as phase retardance and element transposition.
Conclusions
The findings presented in this study give some crucial criterions of appropriate incident angle selections for in vivo polarimetric endoscopy and other applications and can also be valuable references for studying how to minimize the influence further
SEPT: Towards Scalable and Efficient Visual Pre-Training
Recently, the self-supervised pre-training paradigm has shown great potential
in leveraging large-scale unlabeled data to improve downstream task
performance. However, increasing the scale of unlabeled pre-training data in
real-world scenarios requires prohibitive computational costs and faces the
challenge of uncurated samples. To address these issues, we build a
task-specific self-supervised pre-training framework from a data selection
perspective based on a simple hypothesis that pre-training on the unlabeled
samples with similar distribution to the target task can bring substantial
performance gains. Buttressed by the hypothesis, we propose the first yet novel
framework for Scalable and Efficient visual Pre-Training (SEPT) by introducing
a retrieval pipeline for data selection. SEPT first leverage a self-supervised
pre-trained model to extract the features of the entire unlabeled dataset for
retrieval pipeline initialization. Then, for a specific target task, SEPT
retrievals the most similar samples from the unlabeled dataset based on feature
similarity for each target instance for pre-training. Finally, SEPT pre-trains
the target model with the selected unlabeled samples in a self-supervised
manner for target data finetuning. By decoupling the scale of pre-training and
available upstream data for a target task, SEPT achieves high scalability of
the upstream dataset and high efficiency of pre-training, resulting in high
model architecture flexibility. Results on various downstream tasks demonstrate
that SEPT can achieve competitive or even better performance compared with
ImageNet pre-training while reducing the size of training samples by one
magnitude without resorting to any extra annotations.Comment: Accepted by AAAI 202
Parrot Captions Teach CLIP to Spot Text
Despite CLIP being the foundation model in numerous vision-language
applications, the CLIP suffers from a severe text spotting bias. Such bias
causes CLIP models to `Parrot' the visual text embedded within images while
disregarding the authentic visual semantics. We uncover that in the most
popular image-text dataset LAION-2B, the captions also densely parrot (spell)
the text embedded in images. Our analysis shows that around 50% of images are
embedded with visual text content, and around 30% of captions words are in
these embedded visual content. Based on such observation, we thoroughly inspect
the different released versions of CLIP models and verify that the visual text
is the dominant factor in measuring the LAION-style image-text similarity for
these models. To examine whether these parrot captions shape the text spotting
bias, we train a series of CLIP models with LAION subsets curated by different
parrot-caption-oriented criteria. We show that training with parrot captions
easily shapes such bias but harms the expected visual-language representation
learning in CLIP models. This suggests that it is urgent to revisit either the
design of CLIP-like models or the existing image-text dataset curation pipeline
built on CLIP score filtering.Comment: project page: https://linyq17.github.io/CLIP-Parrot-Bias/. Add more
analysis and ablation studies. Update Figure 3 with a more precise metri
MiChao-HuaFen 1.0: A Specialized Pre-trained Corpus Dataset for Domain-specific Large Models
With the advancement of deep learning technologies, general-purpose large
models such as GPT-4 have demonstrated exceptional capabilities across various
domains. Nevertheless, there remains a demand for high-quality, domain-specific
outputs in areas like healthcare, law, and finance. This paper first evaluates
the existing large models for specialized domains and discusses their
limitations. To cater to the specific needs of certain domains, we introduce
the ``MiChao-HuaFen 1.0'' pre-trained corpus dataset, tailored for the news and
governmental sectors. The dataset, sourced from publicly available internet
data from 2022, underwent multiple rounds of cleansing and processing to ensure
high quality and reliable origins, with provisions for consistent and stable
updates. This dataset not only supports the pre-training of large models for
Chinese vertical domains but also aids in propelling deep learning research and
applications in related fields.Comment: 4 pages,2 figure
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
In the realm of large multi-modal models (LMMs), efficient modality alignment
is crucial yet often constrained by the scarcity of high-quality image-text
data. To address this bottleneck, we introduce the ShareGPT4V dataset, a
pioneering large-scale resource featuring 1.2 million highly descriptive
captions, which surpasses existing datasets in diversity and information
content, covering world knowledge, object properties, spatial relationships,
and aesthetic evaluations. Specifically, ShareGPT4V originates from a curated
100K high-quality captions collected from advanced GPT4-Vision and has been
expanded to 1.2M with a superb caption model trained on this subset. ShareGPT4V
first demonstrates its effectiveness for the Supervised Fine-Tuning (SFT)
phase, by substituting an equivalent quantity of detailed captions in existing
SFT datasets with a subset of our high-quality captions, significantly
enhancing the LMMs like LLaVA-7B, LLaVA-1.5-13B, and Qwen-VL-Chat-7B on the MME
and MMBench benchmarks, with respective gains of 222.8/22.0/22.3 and
2.7/1.3/1.5. We further incorporate ShareGPT4V data into both the pre-training
and SFT phases, obtaining ShareGPT4V-7B, a superior LMM based on a simple
architecture that has remarkable performance across a majority of the
multi-modal benchmarks. This project is available at
https://ShareGPT4V.github.io to serve as a pivotal resource for advancing the
LMMs community.Comment: Project: https://ShareGPT4V.github.i
MLLM-DataEngine: An Iterative Refinement Approach for MLLM
Despite the great advance of Multimodal Large Language Models (MLLMs) in both
instruction dataset building and benchmarking, the independence of training and
evaluation makes current MLLMs hard to further improve their capability under
the guidance of evaluation results with a relatively low human cost. In this
paper, we propose MLLM-DataEngine, a novel closed-loop system that bridges data
generation, model training, and evaluation. Within each loop iteration, the
MLLM-DataEngine first analyze the weakness of the model based on the evaluation
results, then generate a proper incremental dataset for the next training
iteration and enhance the model capability iteratively. Compared with previous
data collection methods which are separate from the benchmarking, the data
generated by MLLM-DataEngine shows better targeting, quality, and correctness.
For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts
the ratio of different types of data within each incremental dataset based on
the benchmarking results. For quality, we resort to GPT-4 to generate
high-quality data with each given data type. For correctness, prompt design is
critical for the data generation results. Rather than previous hand-crafted
prompt, we propose an Interactive Prompt Optimization strategy, which optimizes
the prompt with the multi-round interaction between human and GPT, and improve
the correctness of generated data greatly. Through extensive experiments, we
find our MLLM-DataEngine could boost the MLLM capability in a targeted and
automatic manner, with only a few human participation. We hope it could be a
general solution for the following MLLMs building. The MLLM-DataEngine has been
open-sourced and is now available at
https://github.com/opendatalab/MLLM-DataEngine.Comment: Code and models are available at
https://github.com/opendatalab/MLLM-DataEngin
ProtLLM: An Interleaved Protein-Language LLM with Protein-as-Word Pre-Training
We propose ProtLLM, a versatile cross-modal large language model (LLM) for
both protein-centric and protein-language tasks. ProtLLM features a unique
dynamic protein mounting mechanism, enabling it to handle complex inputs where
the natural language text is interspersed with an arbitrary number of proteins.
Besides, we propose the protein-as-word language modeling approach to train
ProtLLM. By developing a specialized protein vocabulary, we equip the model
with the capability to predict not just natural language but also proteins from
a vast pool of candidates. Additionally, we construct a large-scale interleaved
protein-text dataset, named InterPT, for pre-training. This dataset
comprehensively encompasses both (1) structured data sources like protein
annotations and (2) unstructured data sources like biological research papers,
thereby endowing ProtLLM with crucial knowledge for understanding proteins. We
evaluate ProtLLM on classic supervised protein-centric tasks and explore its
novel protein-language applications. Experimental results demonstrate that
ProtLLM not only achieves superior performance against protein-specialized
baselines on protein-centric tasks but also induces zero-shot and in-context
learning capabilities on protein-language tasks.Comment: https://protllm.github.io/project
- …