19 research outputs found
Audio-Driven Dubbing for User Generated Contents via Style-Aware Semi-Parametric Synthesis
Existing automated dubbing methods are usually designed for Professionally
Generated Content (PGC) production, which requires massive training data and
training time to learn a person-specific audio-video mapping. In this paper, we
investigate an audio-driven dubbing method that is more feasible for User
Generated Content (UGC) production. There are two unique challenges to design a
method for UGC: 1) the appearances of speakers are diverse and arbitrary as the
method needs to generalize across users; 2) the available video data of one
speaker are very limited. In order to tackle the above challenges, we first
introduce a new Style Translation Network to integrate the speaking style of
the target and the speaking content of the source via a cross-modal AdaIN
module. It enables our model to quickly adapt to a new speaker. Then, we
further develop a semi-parametric video renderer, which takes full advantage of
the limited training data of the unseen speaker via a video-level
retrieve-warp-refine pipeline. Finally, we propose a temporal regularization
for the semi-parametric renderer, generating more continuous videos. Extensive
experiments show that our method generates videos that accurately preserve
various speaking styles, yet with considerably lower amount of training data
and training time in comparison to existing methods. Besides, our method
achieves a faster testing speed than most recent methods.Comment: TCSVT 202
Multi-modal Queried Object Detection in the Wild
We introduce MQ-Det, an efficient architecture and pre-training strategy
design to utilize both textual description with open-set generalization and
visual exemplars with rich description granularity as category queries, namely,
Multi-modal Queried object Detection, for real-world detection with both
open-vocabulary categories and various granularity. MQ-Det incorporates vision
queries into existing well-established language-queried-only detectors. A
plug-and-play gated class-scalable perceiver module upon the frozen detector is
proposed to augment category text with class-wise visual information. To
address the learning inertia problem brought by the frozen detector, a vision
conditioned masked language prediction strategy is proposed. MQ-Det's simple
yet effective architecture and training strategy design is compatible with most
language-queried object detectors, thus yielding versatile applications.
Experimental results demonstrate that multi-modal queries largely boost
open-world detection. For instance, MQ-Det significantly improves the
state-of-the-art open-set detector GLIP by +7.8% zero-shot AP on the LVIS
benchmark and averagely +6.3% AP on 13 few-shot downstream tasks, with merely
3% pre-training time required by GLIP. Code is available at
https://github.com/YifanXu74/MQ-Det.Comment: Under revie
A Survey on Multimodal Large Language Models
Multimodal Large Language Model (MLLM) recently has been a new rising
research hotspot, which uses powerful Large Language Models (LLMs) as a brain
to perform multimodal tasks. The surprising emergent capabilities of MLLM, such
as writing stories based on images and OCR-free math reasoning, are rare in
traditional methods, suggesting a potential path to artificial general
intelligence. In this paper, we aim to trace and summarize the recent progress
of MLLM. First of all, we present the formulation of MLLM and delineate its
related concepts. Then, we discuss the key techniques and applications,
including Multimodal Instruction Tuning (M-IT), Multimodal In-Context Learning
(M-ICL), Multimodal Chain of Thought (M-CoT), and LLM-Aided Visual Reasoning
(LAVR). Finally, we discuss existing challenges and point out promising
research directions. In light of the fact that the era of MLLM has only just
begun, we will keep updating this survey and hope it can inspire more research.
An associated GitHub link collecting the latest papers is available at
https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models.Comment: Project
page:https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Model
Woodpecker: Hallucination Correction for Multimodal Large Language Models
Hallucination is a big shadow hanging over the rapidly evolving Multimodal
Large Language Models (MLLMs), referring to the phenomenon that the generated
text is inconsistent with the image content. In order to mitigate
hallucinations, existing studies mainly resort to an instruction-tuning manner
that requires retraining the models with specific data. In this paper, we pave
a different way, introducing a training-free method named Woodpecker. Like a
woodpecker heals trees, it picks out and corrects hallucinations from the
generated text. Concretely, Woodpecker consists of five stages: key concept
extraction, question formulation, visual knowledge validation, visual claim
generation, and hallucination correction. Implemented in a post-remedy manner,
Woodpecker can easily serve different MLLMs, while being interpretable by
accessing intermediate outputs of the five stages. We evaluate Woodpecker both
quantitatively and qualitatively and show the huge potential of this new
paradigm. On the POPE benchmark, our method obtains a 30.66%/24.33% improvement
in accuracy over the baseline MiniGPT-4/mPLUG-Owl. The source code is released
at https://github.com/BradyFU/Woodpecker.Comment: 16 pages, 7 figures. Code Website:
https://github.com/BradyFU/Woodpecke
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Multimodal Large Language Model (MLLM) relies on the powerful LLM to perform
multimodal tasks, showing amazing emergent abilities in recent studies, such as
writing poems based on an image. However, it is difficult for these case
studies to fully reflect the performance of MLLM, lacking a comprehensive
evaluation. In this paper, we fill in this blank, presenting the first MLLM
Evaluation benchmark MME. It measures both perception and cognition abilities
on a total of 14 subtasks. In order to avoid data leakage that may arise from
direct use of public datasets for evaluation, the annotations of
instruction-answer pairs are all manually designed. The concise instruction
design allows us to fairly compare MLLMs, instead of struggling in prompt
engineering. Besides, with such an instruction, we can also easily carry out
quantitative statistics. A total of 10 advanced MLLMs are comprehensively
evaluated on our MME, which not only suggests that existing MLLMs still have a
large room for improvement, but also reveals the potential directions for the
subsequent model optimization.Comment: https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Model
Perceção de mulheres grávidas relativamente à informação disponÃvel acerca do consumo de álcool durante a gravidez
info:eu-repo/semantics/draf
Clinical utilization of multiple antibodies of Mycobacterium tuberculosis for serodiagnosis evaluation of tuberculosis: a retrospective observational cohort study
AbstractObjectives We aimed to investigate clinical uncertainties by characterizing the accuracy and utility of commercially available antibodies of Mycobacterium tuberculosis in the diagnostic assessment of suspected tuberculosis in high-burden countries.Methods We conducted a retrospective, descriptive, cohort study among participants aged ≥ 18 years with suspected tuberculosis in Nanning, Guangxi, and China. Participants were tested for M. tuberculosis infection using commercially available antibodies against Mycobacterum tuberculosis. Specificity, sensitivity, negative and positive predictive values, and negative and positive likelihood ratios of the tests were determined. Sputum specimens and bronchoalveolar lavage fluid were sent for mycobacterial culture, Xpert MTB/RIF assay, and cell-free M. tuberculosis DNA or RNA assay. Blood samples were used for IGRAs, T-cell counts (CD3 + CD4+ and CD3 + CD8+), and antibodies to tuberculosis test.Results Of the 1857 participants enrolled in this study, 1772 were included in the analyses, among which, 1311 were diagnosed with active tuberculosis. The specificity of antibody against 16kD for active tuberculosis was 92.7% (95% confidence interval [CI]: 89.3–95.4) with a positive likelihood ratio for active tuberculosis cases of 3.1 (95% CI: 2.1–4.7), which was higher than that of antibody to Rv1636 (90.5% [95% CI: 86.6–93.5]), antibody to 38kD (89.5% [95% CI: 85.5–92.7]), antibody against CFP-10 (82.6% [95% CI: 77.9–86.7]), and antibody against LAM (79.3% [95% CI: 74.3–83.7]). Sensitivity ranged from 15.8% (95% CI: 13.9–17.9) for antibody against Rv1636 to 32.9% (95% CI: 30.4–35.6) for antibody to LAM.Conclusions Commercially available antibodies against to Mycobacterium tuberculosis do not have sufficient sensitivity for the diagnostic evaluation of active tuberculosis. However, antibody against Rv1636 and 16kD may have sufficiently high specificities, high positive likelihood ratios, and correspondingly high positive predictive values to facilitate the rule-in of active tuberculosis
Optimizing nitrogen application rate and plant density for improving cotton yield and nitrogen use efficiency in the North China Plain - Fig 1
<p>Leaf area index(LAI) of cotton at different growth periods in 2013(A) and 2014(B)Note: D1, D2, D3 indicate planting density at 3.00, 5.25, 7.50 plants m<sup>−2</sup> respectively, and N0, N1, N2, N3, N4 indicate nitrogen application rate at 0, 112.5, 225.0, 337.5 kg ha<sup>−1</sup> respectively. A, B indicate 2013 and 2014. Numbers at the same growth stage followed by the same small alphabet are not significantly different at the 5% level.</p
Monthly weather summary during the cotton growing season in 2013 and 2014 at Anyang, Henan, China.
<p>Monthly weather summary during the cotton growing season in 2013 and 2014 at Anyang, Henan, China.</p