44 research outputs found

    Diffusion Deepfake

    Full text link
    Recent progress in generative AI, primarily through diffusion models, presents significant challenges for real-world deepfake detection. The increased realism in image details, diverse content, and widespread accessibility to the general public complicates the identification of these sophisticated deepfakes. Acknowledging the urgency to address the vulnerability of current deepfake detectors to this evolving threat, our paper introduces two extensive deepfake datasets generated by state-of-the-art diffusion models as other datasets are less diverse and low in quality. Our extensive experiments also showed that our dataset is more challenging compared to the other face deepfake datasets. Our strategic dataset creation not only challenge the deepfake detectors but also sets a new benchmark for more evaluation. Our comprehensive evaluation reveals the struggle of existing detection methods, often optimized for specific image domains and manipulations, to effectively adapt to the intricate nature of diffusion deepfakes, limiting their practical utility. To address this critical issue, we investigate the impact of enhancing training data diversity on representative detection methods. This involves expanding the diversity of both manipulation techniques and image domains. Our findings underscore that increasing training data diversity results in improved generalizability. Moreover, we propose a novel momentum difficulty boosting strategy to tackle the additional challenge posed by training data heterogeneity. This strategy dynamically assigns appropriate sample weights based on learning difficulty, enhancing the model's adaptability to both easy and challenging samples. Extensive experiments on both existing and newly proposed benchmarks demonstrate that our model optimization approach surpasses prior alternatives significantly.Comment: 28 pages including Supplementary materia

    Vision-language Assisted Attribute Learning

    Full text link
    Attribute labeling at large scale is typically incomplete and partial, posing significant challenges to model optimization. Existing attribute learning methods often treat the missing labels as negative or simply ignore them all during training, either of which could hamper the model performance to a great extent. To overcome these limitations, in this paper we leverage the available vision-language knowledge to explicitly disclose the missing labels for enhancing model learning. Given an image, we predict the likelihood of each missing attribute label assisted by an off-the-shelf vision-language model, and randomly select to ignore those with high scores in training. Our strategy strikes a good balance between fully ignoring and negatifying the missing labels, as these high scores are found to be informative on revealing label ambiguity. Extensive experiments show that our proposed vision-language assisted loss can achieve state-of-the-art performance on the newly cleaned VAW dataset. Qualitative evaluation demonstrates the ability of the proposed method in predicting more complete attributes.Comment: Accepted by IEEE IC-NIDC 202

    Identification and comprehensive analyses of the CBL and CIPK gene families in wheat (Triticum aestivum L.)

    Get PDF
    The interaction analysis of wheat TaCBL and TaCIPK proteins were performed by Y2H method. (PDF 191 kb

    Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models

    Full text link
    Video Question Answering (VideoQA) has been significantly advanced from the scaling of recent Large Language Models (LLMs). The key idea is to convert the visual information into the language feature space so that the capacity of LLMs can be fully exploited. Existing VideoQA methods typically take two paradigms: (1) learning cross-modal alignment, and (2) using an off-the-shelf captioning model to describe the visual data. However, the first design needs costly training on many extra multi-modal data, whilst the second is further limited by limited domain generalization. To address these limitations, a simple yet effective Retrieving-to-Answer (R2A) framework is proposed.Given an input video, R2A first retrieves a set of semantically similar texts from a generic text corpus using a pre-trained multi-modal model (e.g., CLIP). With both the question and the retrieved texts, a LLM (e.g., DeBERTa) can be directly used to yield a desired answer. Without the need for cross-modal fine-tuning, R2A allows for all the key components (e.g., LLM, retrieval model, and text corpus) to plug-and-play. Extensive experiments on several VideoQA benchmarks show that despite with 1.3B parameters and no fine-tuning, our R2A can outperform the 61 times larger Flamingo-80B model even additionally trained on nearly 2.1B multi-modal data

    Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

    Full text link
    Most recent semantic segmentation methods adopt a fully-convolutional network (FCN) with an encoder-decoder architecture. The encoder progressively reduces the spatial resolution and learns more abstract/semantic visual concepts with larger receptive fields. Since context modeling is critical for segmentation, the latest efforts have been focused on increasing the receptive field, through either dilated/atrous convolutions or inserting attention modules. However, the encoder-decoder based FCN architecture remains unchanged. In this paper, we aim to provide an alternative perspective by treating semantic segmentation as a sequence-to-sequence prediction task. Specifically, we deploy a pure transformer (ie, without convolution and resolution reduction) to encode an image as a sequence of patches. With the global context modeled in every layer of the transformer, this encoder can be combined with a simple decoder to provide a powerful segmentation model, termed SEgmentation TRansformer (SETR). Extensive experiments show that SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes. Particularly, we achieve the first position in the highly competitive ADE20K test server leaderboard on the day of submission.Comment: CVPR 2021. Project page at https://fudan-zvg.github.io/SETR

    A miniature multi-functional photoacoustic probe

    Get PDF
    Photoacoustic technology is a promising tool to provide morphological and functional information in biomedical research. To enhance the imaging efficiency, the reported photoacoustic probes have been designed coaxially involving complicated optical/acoustic prisms to bypass the opaque piezoelectric layer of ultrasound transducers, but this has led to bulky probes and has hindered the applications in limited space. Though the emergence of transparent piezoelectric materials helps to save effort on the coaxial design, the reported transparent ultrasound transducers were still bulky. In this work, a miniature photoacoustic probe with an outer diameter of 4 mm was developed, in which an acoustic stack was made with a combination of transparent piezoelectric material and a gradient-index lens as a backing layer. The transparent ultrasound transducer exhibited a high center frequency of ~47 MHz and a −6 dB bandwidth of 29.4%, which could be easily assembled with a pigtailed ferrule of a single-mode fiber. The multi-functional capability of the probe was successfully validated through experiments of fluid flow sensing and photoacoustic imaging

    Miniature intravascular photoacoustic endoscopy with coaxial excitation and detection

    Get PDF
    Recent research pointed out that the degree of inflammation in the adventitia could correlate with the severity of atherosclerotic plaques. Intravascular photoacoustic endoscopy can provide the information of arterial morphology and plaque composition, and even detecting the inflammation. However, most reported work used a non-coaxial configuration for the photoacoustic catheter design, which formed a limited light-sound overlap area for imaging so as to miss the adventitia information. Here we developed a novel 0.9 mm-diameter intravascular photoacoustic catheter with coaxial excitation and detection to resolve the aforementioned issue. A miniature hollow ultrasound transducer with a 0.18 mm-diameter orifice in the center was successfully fabricated. To show the significance and merits of our design, phantom and ex vivo imaging experiments were conducted on both coaxial and non-coaxial catheters for comparison. The results demonstrated that the coaxial catheter exhibited much better photoacoustic/ultrasound imaging performance from the intima to the adventitia

    Unsupervised Person Re-identification by Deep Learning Tracklet Association

    Get PDF
    © 2018, Springer Nature Switzerland AG. Most existing person re-identification (re-id) methods rely on supervised model learning on per-camera-pair manually labelled pairwise training data. This leads to poor scalability in practical re-id deployment due to the lack of exhaustive identity labelling of image positive and negative pairs for every camera pair. In this work, we address this problem by proposing an unsupervised re-id deep learning approach capable of incrementally discovering and exploiting the underlying re-id discriminative information from automatically generated person tracklet data from videos in an end-to-end model optimisation. We formulate a Tracklet Association Unsupervised Deep Learning (TAUDL) framework characterised by jointly learning per-camera (within-camera) tracklet association (labelling) and cross-camera tracklet correlation by maximising the discovery of most likely tracklet relationships across camera views. Extensive experiments demonstrate the superiority of the proposed TAUDL model over the state-of-the-art unsupervised and domain adaptation re-id methods using six person re-id benchmarking datasets
    corecore