19 research outputs found

    Audio-Driven Dubbing for User Generated Contents via Style-Aware Semi-Parametric Synthesis

    Full text link
    Existing automated dubbing methods are usually designed for Professionally Generated Content (PGC) production, which requires massive training data and training time to learn a person-specific audio-video mapping. In this paper, we investigate an audio-driven dubbing method that is more feasible for User Generated Content (UGC) production. There are two unique challenges to design a method for UGC: 1) the appearances of speakers are diverse and arbitrary as the method needs to generalize across users; 2) the available video data of one speaker are very limited. In order to tackle the above challenges, we first introduce a new Style Translation Network to integrate the speaking style of the target and the speaking content of the source via a cross-modal AdaIN module. It enables our model to quickly adapt to a new speaker. Then, we further develop a semi-parametric video renderer, which takes full advantage of the limited training data of the unseen speaker via a video-level retrieve-warp-refine pipeline. Finally, we propose a temporal regularization for the semi-parametric renderer, generating more continuous videos. Extensive experiments show that our method generates videos that accurately preserve various speaking styles, yet with considerably lower amount of training data and training time in comparison to existing methods. Besides, our method achieves a faster testing speed than most recent methods.Comment: TCSVT 202

    Multi-modal Queried Object Detection in the Wild

    Full text link
    We introduce MQ-Det, an efficient architecture and pre-training strategy design to utilize both textual description with open-set generalization and visual exemplars with rich description granularity as category queries, namely, Multi-modal Queried object Detection, for real-world detection with both open-vocabulary categories and various granularity. MQ-Det incorporates vision queries into existing well-established language-queried-only detectors. A plug-and-play gated class-scalable perceiver module upon the frozen detector is proposed to augment category text with class-wise visual information. To address the learning inertia problem brought by the frozen detector, a vision conditioned masked language prediction strategy is proposed. MQ-Det's simple yet effective architecture and training strategy design is compatible with most language-queried object detectors, thus yielding versatile applications. Experimental results demonstrate that multi-modal queries largely boost open-world detection. For instance, MQ-Det significantly improves the state-of-the-art open-set detector GLIP by +7.8% zero-shot AP on the LVIS benchmark and averagely +6.3% AP on 13 few-shot downstream tasks, with merely 3% pre-training time required by GLIP. Code is available at https://github.com/YifanXu74/MQ-Det.Comment: Under revie

    A Survey on Multimodal Large Language Models

    Full text link
    Multimodal Large Language Model (MLLM) recently has been a new rising research hotspot, which uses powerful Large Language Models (LLMs) as a brain to perform multimodal tasks. The surprising emergent capabilities of MLLM, such as writing stories based on images and OCR-free math reasoning, are rare in traditional methods, suggesting a potential path to artificial general intelligence. In this paper, we aim to trace and summarize the recent progress of MLLM. First of all, we present the formulation of MLLM and delineate its related concepts. Then, we discuss the key techniques and applications, including Multimodal Instruction Tuning (M-IT), Multimodal In-Context Learning (M-ICL), Multimodal Chain of Thought (M-CoT), and LLM-Aided Visual Reasoning (LAVR). Finally, we discuss existing challenges and point out promising research directions. In light of the fact that the era of MLLM has only just begun, we will keep updating this survey and hope it can inspire more research. An associated GitHub link collecting the latest papers is available at https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models.Comment: Project page:https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Model

    Woodpecker: Hallucination Correction for Multimodal Large Language Models

    Full text link
    Hallucination is a big shadow hanging over the rapidly evolving Multimodal Large Language Models (MLLMs), referring to the phenomenon that the generated text is inconsistent with the image content. In order to mitigate hallucinations, existing studies mainly resort to an instruction-tuning manner that requires retraining the models with specific data. In this paper, we pave a different way, introducing a training-free method named Woodpecker. Like a woodpecker heals trees, it picks out and corrects hallucinations from the generated text. Concretely, Woodpecker consists of five stages: key concept extraction, question formulation, visual knowledge validation, visual claim generation, and hallucination correction. Implemented in a post-remedy manner, Woodpecker can easily serve different MLLMs, while being interpretable by accessing intermediate outputs of the five stages. We evaluate Woodpecker both quantitatively and qualitatively and show the huge potential of this new paradigm. On the POPE benchmark, our method obtains a 30.66%/24.33% improvement in accuracy over the baseline MiniGPT-4/mPLUG-Owl. The source code is released at https://github.com/BradyFU/Woodpecker.Comment: 16 pages, 7 figures. Code Website: https://github.com/BradyFU/Woodpecke

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Full text link
    Multimodal Large Language Model (MLLM) relies on the powerful LLM to perform multimodal tasks, showing amazing emergent abilities in recent studies, such as writing poems based on an image. However, it is difficult for these case studies to fully reflect the performance of MLLM, lacking a comprehensive evaluation. In this paper, we fill in this blank, presenting the first MLLM Evaluation benchmark MME. It measures both perception and cognition abilities on a total of 14 subtasks. In order to avoid data leakage that may arise from direct use of public datasets for evaluation, the annotations of instruction-answer pairs are all manually designed. The concise instruction design allows us to fairly compare MLLMs, instead of struggling in prompt engineering. Besides, with such an instruction, we can also easily carry out quantitative statistics. A total of 10 advanced MLLMs are comprehensively evaluated on our MME, which not only suggests that existing MLLMs still have a large room for improvement, but also reveals the potential directions for the subsequent model optimization.Comment: https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Model

    Clinical utilization of multiple antibodies of Mycobacterium tuberculosis for serodiagnosis evaluation of tuberculosis: a retrospective observational cohort study

    No full text
    AbstractObjectives We aimed to investigate clinical uncertainties by characterizing the accuracy and utility of commercially available antibodies of Mycobacterium tuberculosis in the diagnostic assessment of suspected tuberculosis in high-burden countries.Methods We conducted a retrospective, descriptive, cohort study among participants aged ≥ 18 years with suspected tuberculosis in Nanning, Guangxi, and China. Participants were tested for M. tuberculosis infection using commercially available antibodies against Mycobacterum tuberculosis. Specificity, sensitivity, negative and positive predictive values, and negative and positive likelihood ratios of the tests were determined. Sputum specimens and bronchoalveolar lavage fluid were sent for mycobacterial culture, Xpert MTB/RIF assay, and cell-free M. tuberculosis DNA or RNA assay. Blood samples were used for IGRAs, T-cell counts (CD3 + CD4+ and CD3 + CD8+), and antibodies to tuberculosis test.Results Of the 1857 participants enrolled in this study, 1772 were included in the analyses, among which, 1311 were diagnosed with active tuberculosis. The specificity of antibody against 16kD for active tuberculosis was 92.7% (95% confidence interval [CI]: 89.3–95.4) with a positive likelihood ratio for active tuberculosis cases of 3.1 (95% CI: 2.1–4.7), which was higher than that of antibody to Rv1636 (90.5% [95% CI: 86.6–93.5]), antibody to 38kD (89.5% [95% CI: 85.5–92.7]), antibody against CFP-10 (82.6% [95% CI: 77.9–86.7]), and antibody against LAM (79.3% [95% CI: 74.3–83.7]). Sensitivity ranged from 15.8% (95% CI: 13.9–17.9) for antibody against Rv1636 to 32.9% (95% CI: 30.4–35.6) for antibody to LAM.Conclusions Commercially available antibodies against to Mycobacterium tuberculosis do not have sufficient sensitivity for the diagnostic evaluation of active tuberculosis. However, antibody against Rv1636 and 16kD may have sufficiently high specificities, high positive likelihood ratios, and correspondingly high positive predictive values to facilitate the rule-in of active tuberculosis

    Optimizing nitrogen application rate and plant density for improving cotton yield and nitrogen use efficiency in the North China Plain - Fig 1

    No full text
    <p>Leaf area index(LAI) of cotton at different growth periods in 2013(A) and 2014(B)Note: D1, D2, D3 indicate planting density at 3.00, 5.25, 7.50 plants m<sup>−2</sup> respectively, and N0, N1, N2, N3, N4 indicate nitrogen application rate at 0, 112.5, 225.0, 337.5 kg ha<sup>−1</sup> respectively. A, B indicate 2013 and 2014. Numbers at the same growth stage followed by the same small alphabet are not significantly different at the 5% level.</p
    corecore