100 research outputs found
Surface Functionalization of Black Phosphorus via Amine Compounds and Its Impacts on the Flame Retardancy and Thermal Decomposition Behaviors of Epoxy Resin.
Recently, lots of effort has been placed into stabilizing black phosphorus (BP) in the air to improve its compatibility with polymers. Herein, BP was chemically functionalized by aliphatic amine (DETA), aromatic amine (PPDA) and cyclamine (Pid) via a nucleophilic substitution reaction, aiming to develop an intensively reactive BP flame retardant for epoxy resin (EP). The -NH2 group on BP-DETA, BP-PPDA and BP-Pid reacted with the epoxide group at different temperatures. The lowest temperature was about 150 °C for BP-DETA. The impacts of three BP-NH2 were compared on the flame retardancy and thermal decomposition of EP. At 5 wt% loading, EP/BP-NH2 all passed UL 94 V 0 rating. The limiting oxygen index (LOI) of EP/BP-PPDA was as high as 32.3%. The heat release rate (HRR) of EP/BP-DETA greatly decreased by 46% and char residue increased by 73.8%, whereas HRR of EP/BP-Pid decreased by 11.5% and char residue increased by 50.8%, compared with EP. Average effective heat of combustion (av-EHC) of EP/BP-Pid was lower than that of EP/BP-DETA and EP/BP-PPDA. In view of the flame-retardant mechanism, BP nanosheets functionalized with aliphatic amine and aromatic amine played a dominant role in the condensed phase, while BP functionalized with cyclamine was more effective in the gas phase.post-print74881 K
Unified Visual Relationship Detection with Vision and Language Models
This work focuses on training a single visual relationship detector
predicting over the union of label spaces from multiple datasets. Merging
labels spanning different datasets could be challenging due to inconsistent
taxonomies. The issue is exacerbated in visual relationship detection when
second-order visual semantics are introduced between pairs of objects. To
address this challenge, we propose UniVRD, a novel bottom-up method for Unified
Visual Relationship Detection by leveraging vision and language models (VLMs).
VLMs provide well-aligned image and text embeddings, where similar
relationships are optimized to be close to each other for semantic unification.
Our bottom-up design enables the model to enjoy the benefit of training with
both object detection and visual relationship datasets. Empirical results on
both human-object interaction detection and scene-graph generation demonstrate
the competitive performance of our model. UniVRD achieves 38.07 mAP on
HICO-DET, outperforming the current best bottom-up HOI detector by 14.26 mAP.
More importantly, we show that our unified detector performs as well as
dataset-specific models in mAP, and achieves further improvements when we scale
up the model. Our code will be made publicly available on GitHub.Comment: Accepted to ICCV 2023. Code is available at
https://github.com/google-research/scenic/tree/main/scenic/projects/univr
Spatiotemporally Discriminative Video-Language Pre-Training with Text Grounding
Most of existing video-language pre-training methods focus on instance-level
alignment between video clips and captions via global contrastive learning but
neglect rich fine-grained local information, which is of importance to
downstream tasks requiring temporal localization and semantic reasoning. In
this work, we propose a simple yet effective video-language pre-training
framework, namely G-ViLM, to learn discriminative spatiotemporal features. Two
novel designs involving spatiotemporal grounding and temporal grouping promote
learning local region-noun alignment and temporal-aware features
simultaneously. Specifically, spatiotemporal grounding aggregates semantically
similar video tokens and aligns them with noun phrases extracted from the
caption to promote local region-noun correspondences. Moreover, temporal
grouping leverages cut-and-paste to manually create temporal scene changes and
then learns distinguishable features from different scenes. Comprehensive
evaluations demonstrate that G-ViLM performs favorably against existing
approaches on four representative downstream tasks, covering text-video
retrieval, video question answering, video action recognition and temporal
action localization. G-ViLM performs competitively on all evaluated tasks and
in particular achieves R@10 of 65.1 on zero-shot MSR-VTT retrieval, over 9%
higher than the state-of-the-art method
Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models
This paper proposes a method for generating images of customized objects
specified by users. The method is based on a general framework that bypasses
the lengthy optimization required by previous approaches, which often employ a
per-object optimization paradigm. Our framework adopts an encoder to capture
high-level identifiable semantics of objects, producing an object-specific
embedding with only a single feed-forward pass. The acquired object embedding
is then passed to a text-to-image synthesis model for subsequent generation. To
effectively blend a object-aware embedding space into a well developed
text-to-image model under the same generation context, we investigate different
network designs and training strategies, and propose a simple yet effective
regularized joint training scheme with an object identity preservation loss.
Additionally, we propose a caption generation scheme that become a critical
piece in fostering object specific embedding faithfully reflected into the
generation process, while keeping control and editing abilities. Once trained,
the network is able to produce diverse content and styles, conditioned on both
texts and objects. We demonstrate through experiments that our proposed method
is able to synthesize images with compelling output quality, appearance
diversity, and object fidelity, without the need of test-time optimization.
Systematic studies are also conducted to analyze our models, providing insights
for future work
Twisted van der Waals Quantum Materials: Fundamentals, Tunability and Applications
Twisted vdW quantum materials have emerged as a rapidly developing field of
2D semiconductors. These materials establish a new central research area and
provide a promising platform for studying quantum phenomena and investigating
the engineering of novel optoelectronic properties such as single-photon
emission, non-linear optical response, magnon physics, and topological
superconductivity. These captivating electronic and optical properties result
from, and can be tailored by, the interlayer coupling using moir\'e patterns
formed by vertically stacking atomic layers with controlled angle
misorientation or lattice mismatch. Their outstanding properties and the high
degree of tunability position them as compelling building blocks for both
compact quantum-enabled devices and classical optoelectronics. This article
offers a comprehensive review of recent advancements in the understanding and
manipulation of twisted van der Waals structures and presents a survey of the
state-of-the-art research on moir\'e superlattices, encompassing
interdisciplinary interests. It delves into fundamental theories, synthesis and
fabrication, and visualization techniques, and the wide range of novel physical
phenomena exhibited by these structures, with a focus on their potential for
practical device integration in applications ranging from quantum information
to biosensors, and including classical optoelectronics such as modulators,
light emitting diodes (LEDs), lasers, and photodetectors. It highlights the
unique ability of moir\'e superlattices to connect multiple disciplines,
covering chemistry, electronics, optics, photonics, magnetism, topological and
quantum physics. This comprehensive review provides a valuable resource for
researchers interested in moir\'e superlattices, shedding light on their
fundamental characteristics and their potential for transformative applications
in various fields.Comment: 179 pages, 42 figures, Chemical Review
Performance over Random
This paper proposes a new evaluation approach for video summarization algorithms. We start by studying the currently established evaluation protocol; this protocol, defined over the ground-truth annotations of the SumMe and TVSum datasets, quantifies the agreement between the user-defined and the automatically-created summaries with F-Score, and reports the average performance on a few different training/testing splits of the used dataset. We evaluate five publicly-available summarization algorithms under a large-scale experimental setting with 50 randomly-created data splits. We show that the results reported in the papers are not always congruent with their performance on the large-scale experiment, and that the F-Score cannot be used for comparing algorithms evaluated on different splits. We also show that the above shortcomings of the established evaluation protocol are due to the significantly varying levels of difficulty among the utilized splits, that affect the outcomes of the evaluations. Further analysis of these findings indicates a noticeable performance correlation among all algorithms and a random summarizer. To mitigate these shortcomings we propose an evaluation protocol that makes estimates about the difficulty of each used data split and utilizes this information during the evaluation process. Experiments involving different evaluation settings demonstrate the increased representativeness of performance results when using the proposed evaluation approach, and the increased reliability of comparisons when the examined methods have been evaluated on different data splits
Adversarial Bipartite Graph Learning for Video Domain Adaptation
Domain adaptation techniques, which focus on adapting models between
distributionally different domains, are rarely explored in the video
recognition area due to the significant spatial and temporal shifts across the
source (i.e. training) and target (i.e. test) domains. As such, recent works on
visual domain adaptation which leverage adversarial learning to unify the
source and target video representations and strengthen the feature
transferability are not highly effective on the videos. To overcome this
limitation, in this paper, we learn a domain-agnostic video classifier instead
of learning domain-invariant representations, and propose an Adversarial
Bipartite Graph (ABG) learning framework which directly models the
source-target interactions with a network topology of the bipartite graph.
Specifically, the source and target frames are sampled as heterogeneous
vertexes while the edges connecting two types of nodes measure the affinity
among them. Through message-passing, each vertex aggregates the features from
its heterogeneous neighbors, forcing the features coming from the same class to
be mixed evenly. Explicitly exposing the video classifier to such cross-domain
representations at the training and test stages makes our model less biased to
the labeled source data, which in-turn results in achieving a better
generalization on the target domain. To further enhance the model capacity and
testify the robustness of the proposed architecture on difficult transfer
tasks, we extend our model to work in a semi-supervised setting using an
additional video-level bipartite graph. Extensive experiments conducted on four
benchmarks evidence the effectiveness of the proposed approach over the SOTA
methods on the task of video recognition.Comment: Proceedings of the 28th ACM International Conference on Multimedia
(MM '20
VideoPrism: A Foundational Visual Encoder for Video Understanding
We introduce VideoPrism, a general-purpose video encoder that tackles diverse
video understanding tasks with a single frozen model. We pretrain VideoPrism on
a heterogeneous corpus containing 36M high-quality video-caption pairs and 582M
video clips with noisy parallel text (e.g., ASR transcripts). The pretraining
approach improves upon masked autoencoding by global-local distillation of
semantic video embeddings and a token shuffling scheme, enabling VideoPrism to
focus primarily on the video modality while leveraging the invaluable text
associated with videos. We extensively test VideoPrism on four broad groups of
video understanding tasks, from web video question answering to CV for science,
achieving state-of-the-art performance on 31 out of 33 video understanding
benchmarks.Comment: Accepted to ICML 2024. v2: added retrieval results on MSRVTT (1K-A),
more data analyses, and ablation studie
VideoGLUE: Video General Understanding Evaluation of Foundation Models
We evaluate existing foundation models video understanding capabilities using
a carefully designed experiment protocol consisting of three hallmark tasks
(action recognition, temporal localization, and spatiotemporal localization),
eight datasets well received by the community, and four adaptation methods
tailoring a foundation model (FM) for a downstream task. Moreover, we propose a
scalar VideoGLUE score (VGS) to measure an FMs efficacy and efficiency when
adapting to general video understanding tasks. Our main findings are as
follows. First, task-specialized models significantly outperform the six FMs
studied in this work, in sharp contrast to what FMs have achieved in natural
language and image understanding. Second,video-native FMs, whose pretraining
data contains the video modality, are generally better than image-native FMs in
classifying motion-rich videos, localizing actions in time, and understanding a
video of more than one action. Third, the video-native FMs can perform well on
video tasks under light adaptations to downstream tasks(e.g., freezing the FM
backbones), while image-native FMs win in full end-to-end finetuning. The first
two observations reveal the need and tremendous opportunities to conduct
research on video-focused FMs, and the last confirms that both tasks and
adaptation methods matter when it comes to the evaluation of FMs
- …
