44 research outputs found
Diffusion Deepfake
Recent progress in generative AI, primarily through diffusion models,
presents significant challenges for real-world deepfake detection. The
increased realism in image details, diverse content, and widespread
accessibility to the general public complicates the identification of these
sophisticated deepfakes. Acknowledging the urgency to address the vulnerability
of current deepfake detectors to this evolving threat, our paper introduces two
extensive deepfake datasets generated by state-of-the-art diffusion models as
other datasets are less diverse and low in quality. Our extensive experiments
also showed that our dataset is more challenging compared to the other face
deepfake datasets. Our strategic dataset creation not only challenge the
deepfake detectors but also sets a new benchmark for more evaluation. Our
comprehensive evaluation reveals the struggle of existing detection methods,
often optimized for specific image domains and manipulations, to effectively
adapt to the intricate nature of diffusion deepfakes, limiting their practical
utility. To address this critical issue, we investigate the impact of enhancing
training data diversity on representative detection methods. This involves
expanding the diversity of both manipulation techniques and image domains. Our
findings underscore that increasing training data diversity results in improved
generalizability. Moreover, we propose a novel momentum difficulty boosting
strategy to tackle the additional challenge posed by training data
heterogeneity. This strategy dynamically assigns appropriate sample weights
based on learning difficulty, enhancing the model's adaptability to both easy
and challenging samples. Extensive experiments on both existing and newly
proposed benchmarks demonstrate that our model optimization approach surpasses
prior alternatives significantly.Comment: 28 pages including Supplementary materia
Vision-language Assisted Attribute Learning
Attribute labeling at large scale is typically incomplete and partial, posing
significant challenges to model optimization. Existing attribute learning
methods often treat the missing labels as negative or simply ignore them all
during training, either of which could hamper the model performance to a great
extent. To overcome these limitations, in this paper we leverage the available
vision-language knowledge to explicitly disclose the missing labels for
enhancing model learning. Given an image, we predict the likelihood of each
missing attribute label assisted by an off-the-shelf vision-language model, and
randomly select to ignore those with high scores in training. Our strategy
strikes a good balance between fully ignoring and negatifying the missing
labels, as these high scores are found to be informative on revealing label
ambiguity. Extensive experiments show that our proposed vision-language
assisted loss can achieve state-of-the-art performance on the newly cleaned VAW
dataset. Qualitative evaluation demonstrates the ability of the proposed method
in predicting more complete attributes.Comment: Accepted by IEEE IC-NIDC 202
Identification and comprehensive analyses of the CBL and CIPK gene families in wheat (Triticum aestivum L.)
The interaction analysis of wheat TaCBL and TaCIPK proteins were performed by Y2H method. (PDF 191Â kb
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models
Video Question Answering (VideoQA) has been significantly advanced from the
scaling of recent Large Language Models (LLMs). The key idea is to convert the
visual information into the language feature space so that the capacity of LLMs
can be fully exploited. Existing VideoQA methods typically take two paradigms:
(1) learning cross-modal alignment, and (2) using an off-the-shelf captioning
model to describe the visual data. However, the first design needs costly
training on many extra multi-modal data, whilst the second is further limited
by limited domain generalization. To address these limitations, a simple yet
effective Retrieving-to-Answer (R2A) framework is proposed.Given an input
video, R2A first retrieves a set of semantically similar texts from a generic
text corpus using a pre-trained multi-modal model (e.g., CLIP). With both the
question and the retrieved texts, a LLM (e.g., DeBERTa) can be directly used to
yield a desired answer. Without the need for cross-modal fine-tuning, R2A
allows for all the key components (e.g., LLM, retrieval model, and text corpus)
to plug-and-play. Extensive experiments on several VideoQA benchmarks show that
despite with 1.3B parameters and no fine-tuning, our R2A can outperform the 61
times larger Flamingo-80B model even additionally trained on nearly 2.1B
multi-modal data
Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers
Most recent semantic segmentation methods adopt a fully-convolutional network
(FCN) with an encoder-decoder architecture. The encoder progressively reduces
the spatial resolution and learns more abstract/semantic visual concepts with
larger receptive fields. Since context modeling is critical for segmentation,
the latest efforts have been focused on increasing the receptive field, through
either dilated/atrous convolutions or inserting attention modules. However, the
encoder-decoder based FCN architecture remains unchanged. In this paper, we aim
to provide an alternative perspective by treating semantic segmentation as a
sequence-to-sequence prediction task. Specifically, we deploy a pure
transformer (ie, without convolution and resolution reduction) to encode an
image as a sequence of patches. With the global context modeled in every layer
of the transformer, this encoder can be combined with a simple decoder to
provide a powerful segmentation model, termed SEgmentation TRansformer (SETR).
Extensive experiments show that SETR achieves new state of the art on ADE20K
(50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on
Cityscapes. Particularly, we achieve the first position in the highly
competitive ADE20K test server leaderboard on the day of submission.Comment: CVPR 2021. Project page at https://fudan-zvg.github.io/SETR
A miniature multi-functional photoacoustic probe
Photoacoustic technology is a promising tool to provide morphological and functional information in biomedical research. To enhance the imaging efficiency, the reported photoacoustic probes have been designed coaxially involving complicated optical/acoustic prisms to bypass the opaque piezoelectric layer of ultrasound transducers, but this has led to bulky probes and has hindered the applications in limited space. Though the emergence of transparent piezoelectric materials helps to save effort on the coaxial design, the reported transparent ultrasound transducers were still bulky. In this work, a miniature photoacoustic probe with an outer diameter of 4 mm was developed, in which an acoustic stack was made with a combination of transparent piezoelectric material and a gradient-index lens as a backing layer. The transparent ultrasound transducer exhibited a high center frequency of ~47 MHz and a −6 dB bandwidth of 29.4%, which could be easily assembled with a pigtailed ferrule of a single-mode fiber. The multi-functional capability of the probe was successfully validated through experiments of fluid flow sensing and photoacoustic imaging
Miniature intravascular photoacoustic endoscopy with coaxial excitation and detection
Recent research pointed out that the degree of inflammation in the adventitia could correlate with the severity of atherosclerotic plaques. Intravascular photoacoustic endoscopy can provide the information of arterial morphology and plaque composition, and even detecting the inflammation. However, most reported work used a non-coaxial configuration for the photoacoustic catheter design, which formed a limited light-sound overlap area for imaging so as to miss the adventitia information. Here we developed a novel 0.9 mm-diameter intravascular photoacoustic catheter with coaxial excitation and detection to resolve the aforementioned issue. A miniature hollow ultrasound transducer with a 0.18 mm-diameter orifice in the center was successfully fabricated. To show the significance and merits of our design, phantom and ex vivo imaging experiments were conducted on both coaxial and non-coaxial catheters for comparison. The results demonstrated that the coaxial catheter exhibited much better photoacoustic/ultrasound imaging performance from the intima to the adventitia
Unsupervised Person Re-identification by Deep Learning Tracklet Association
© 2018, Springer Nature Switzerland AG. Most existing person re-identification (re-id) methods rely on supervised model learning on per-camera-pair manually labelled pairwise training data. This leads to poor scalability in practical re-id deployment due to the lack of exhaustive identity labelling of image positive and negative pairs for every camera pair. In this work, we address this problem by proposing an unsupervised re-id deep learning approach capable of incrementally discovering and exploiting the underlying re-id discriminative information from automatically generated person tracklet data from videos in an end-to-end model optimisation. We formulate a Tracklet Association Unsupervised Deep Learning (TAUDL) framework characterised by jointly learning per-camera (within-camera) tracklet association (labelling) and cross-camera tracklet correlation by maximising the discovery of most likely tracklet relationships across camera views. Extensive experiments demonstrate the superiority of the proposed TAUDL model over the state-of-the-art unsupervised and domain adaptation re-id methods using six person re-id benchmarking datasets