100 research outputs found

    Surface Functionalization of Black Phosphorus via Amine Compounds and Its Impacts on the Flame Retardancy and Thermal Decomposition Behaviors of Epoxy Resin.

    Get PDF
    Recently, lots of effort has been placed into stabilizing black phosphorus (BP) in the air to improve its compatibility with polymers. Herein, BP was chemically functionalized by aliphatic amine (DETA), aromatic amine (PPDA) and cyclamine (Pid) via a nucleophilic substitution reaction, aiming to develop an intensively reactive BP flame retardant for epoxy resin (EP). The -NH2 group on BP-DETA, BP-PPDA and BP-Pid reacted with the epoxide group at different temperatures. The lowest temperature was about 150 °C for BP-DETA. The impacts of three BP-NH2 were compared on the flame retardancy and thermal decomposition of EP. At 5 wt% loading, EP/BP-NH2 all passed UL 94 V 0 rating. The limiting oxygen index (LOI) of EP/BP-PPDA was as high as 32.3%. The heat release rate (HRR) of EP/BP-DETA greatly decreased by 46% and char residue increased by 73.8%, whereas HRR of EP/BP-Pid decreased by 11.5% and char residue increased by 50.8%, compared with EP. Average effective heat of combustion (av-EHC) of EP/BP-Pid was lower than that of EP/BP-DETA and EP/BP-PPDA. In view of the flame-retardant mechanism, BP nanosheets functionalized with aliphatic amine and aromatic amine played a dominant role in the condensed phase, while BP functionalized with cyclamine was more effective in the gas phase.post-print74881 K

    Unified Visual Relationship Detection with Vision and Language Models

    Full text link
    This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets. Merging labels spanning different datasets could be challenging due to inconsistent taxonomies. The issue is exacerbated in visual relationship detection when second-order visual semantics are introduced between pairs of objects. To address this challenge, we propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models (VLMs). VLMs provide well-aligned image and text embeddings, where similar relationships are optimized to be close to each other for semantic unification. Our bottom-up design enables the model to enjoy the benefit of training with both object detection and visual relationship datasets. Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model. UniVRD achieves 38.07 mAP on HICO-DET, outperforming the current best bottom-up HOI detector by 14.26 mAP. More importantly, we show that our unified detector performs as well as dataset-specific models in mAP, and achieves further improvements when we scale up the model. Our code will be made publicly available on GitHub.Comment: Accepted to ICCV 2023. Code is available at https://github.com/google-research/scenic/tree/main/scenic/projects/univr

    Spatiotemporally Discriminative Video-Language Pre-Training with Text Grounding

    Full text link
    Most of existing video-language pre-training methods focus on instance-level alignment between video clips and captions via global contrastive learning but neglect rich fine-grained local information, which is of importance to downstream tasks requiring temporal localization and semantic reasoning. In this work, we propose a simple yet effective video-language pre-training framework, namely G-ViLM, to learn discriminative spatiotemporal features. Two novel designs involving spatiotemporal grounding and temporal grouping promote learning local region-noun alignment and temporal-aware features simultaneously. Specifically, spatiotemporal grounding aggregates semantically similar video tokens and aligns them with noun phrases extracted from the caption to promote local region-noun correspondences. Moreover, temporal grouping leverages cut-and-paste to manually create temporal scene changes and then learns distinguishable features from different scenes. Comprehensive evaluations demonstrate that G-ViLM performs favorably against existing approaches on four representative downstream tasks, covering text-video retrieval, video question answering, video action recognition and temporal action localization. G-ViLM performs competitively on all evaluated tasks and in particular achieves R@10 of 65.1 on zero-shot MSR-VTT retrieval, over 9% higher than the state-of-the-art method

    Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models

    Full text link
    This paper proposes a method for generating images of customized objects specified by users. The method is based on a general framework that bypasses the lengthy optimization required by previous approaches, which often employ a per-object optimization paradigm. Our framework adopts an encoder to capture high-level identifiable semantics of objects, producing an object-specific embedding with only a single feed-forward pass. The acquired object embedding is then passed to a text-to-image synthesis model for subsequent generation. To effectively blend a object-aware embedding space into a well developed text-to-image model under the same generation context, we investigate different network designs and training strategies, and propose a simple yet effective regularized joint training scheme with an object identity preservation loss. Additionally, we propose a caption generation scheme that become a critical piece in fostering object specific embedding faithfully reflected into the generation process, while keeping control and editing abilities. Once trained, the network is able to produce diverse content and styles, conditioned on both texts and objects. We demonstrate through experiments that our proposed method is able to synthesize images with compelling output quality, appearance diversity, and object fidelity, without the need of test-time optimization. Systematic studies are also conducted to analyze our models, providing insights for future work

    Twisted van der Waals Quantum Materials: Fundamentals, Tunability and Applications

    Full text link
    Twisted vdW quantum materials have emerged as a rapidly developing field of 2D semiconductors. These materials establish a new central research area and provide a promising platform for studying quantum phenomena and investigating the engineering of novel optoelectronic properties such as single-photon emission, non-linear optical response, magnon physics, and topological superconductivity. These captivating electronic and optical properties result from, and can be tailored by, the interlayer coupling using moir\'e patterns formed by vertically stacking atomic layers with controlled angle misorientation or lattice mismatch. Their outstanding properties and the high degree of tunability position them as compelling building blocks for both compact quantum-enabled devices and classical optoelectronics. This article offers a comprehensive review of recent advancements in the understanding and manipulation of twisted van der Waals structures and presents a survey of the state-of-the-art research on moir\'e superlattices, encompassing interdisciplinary interests. It delves into fundamental theories, synthesis and fabrication, and visualization techniques, and the wide range of novel physical phenomena exhibited by these structures, with a focus on their potential for practical device integration in applications ranging from quantum information to biosensors, and including classical optoelectronics such as modulators, light emitting diodes (LEDs), lasers, and photodetectors. It highlights the unique ability of moir\'e superlattices to connect multiple disciplines, covering chemistry, electronics, optics, photonics, magnetism, topological and quantum physics. This comprehensive review provides a valuable resource for researchers interested in moir\'e superlattices, shedding light on their fundamental characteristics and their potential for transformative applications in various fields.Comment: 179 pages, 42 figures, Chemical Review

    Performance over Random

    Get PDF
    This paper proposes a new evaluation approach for video summarization algorithms. We start by studying the currently established evaluation protocol; this protocol, defined over the ground-truth annotations of the SumMe and TVSum datasets, quantifies the agreement between the user-defined and the automatically-created summaries with F-Score, and reports the average performance on a few different training/testing splits of the used dataset. We evaluate five publicly-available summarization algorithms under a large-scale experimental setting with 50 randomly-created data splits. We show that the results reported in the papers are not always congruent with their performance on the large-scale experiment, and that the F-Score cannot be used for comparing algorithms evaluated on different splits. We also show that the above shortcomings of the established evaluation protocol are due to the significantly varying levels of difficulty among the utilized splits, that affect the outcomes of the evaluations. Further analysis of these findings indicates a noticeable performance correlation among all algorithms and a random summarizer. To mitigate these shortcomings we propose an evaluation protocol that makes estimates about the difficulty of each used data split and utilizes this information during the evaluation process. Experiments involving different evaluation settings demonstrate the increased representativeness of performance results when using the proposed evaluation approach, and the increased reliability of comparisons when the examined methods have been evaluated on different data splits

    Adversarial Bipartite Graph Learning for Video Domain Adaptation

    Full text link
    Domain adaptation techniques, which focus on adapting models between distributionally different domains, are rarely explored in the video recognition area due to the significant spatial and temporal shifts across the source (i.e. training) and target (i.e. test) domains. As such, recent works on visual domain adaptation which leverage adversarial learning to unify the source and target video representations and strengthen the feature transferability are not highly effective on the videos. To overcome this limitation, in this paper, we learn a domain-agnostic video classifier instead of learning domain-invariant representations, and propose an Adversarial Bipartite Graph (ABG) learning framework which directly models the source-target interactions with a network topology of the bipartite graph. Specifically, the source and target frames are sampled as heterogeneous vertexes while the edges connecting two types of nodes measure the affinity among them. Through message-passing, each vertex aggregates the features from its heterogeneous neighbors, forcing the features coming from the same class to be mixed evenly. Explicitly exposing the video classifier to such cross-domain representations at the training and test stages makes our model less biased to the labeled source data, which in-turn results in achieving a better generalization on the target domain. To further enhance the model capacity and testify the robustness of the proposed architecture on difficult transfer tasks, we extend our model to work in a semi-supervised setting using an additional video-level bipartite graph. Extensive experiments conducted on four benchmarks evidence the effectiveness of the proposed approach over the SOTA methods on the task of video recognition.Comment: Proceedings of the 28th ACM International Conference on Multimedia (MM '20

    VideoPrism: A Foundational Visual Encoder for Video Understanding

    Full text link
    We introduce VideoPrism, a general-purpose video encoder that tackles diverse video understanding tasks with a single frozen model. We pretrain VideoPrism on a heterogeneous corpus containing 36M high-quality video-caption pairs and 582M video clips with noisy parallel text (e.g., ASR transcripts). The pretraining approach improves upon masked autoencoding by global-local distillation of semantic video embeddings and a token shuffling scheme, enabling VideoPrism to focus primarily on the video modality while leveraging the invaluable text associated with videos. We extensively test VideoPrism on four broad groups of video understanding tasks, from web video question answering to CV for science, achieving state-of-the-art performance on 31 out of 33 video understanding benchmarks.Comment: Accepted to ICML 2024. v2: added retrieval results on MSRVTT (1K-A), more data analyses, and ablation studie

    VideoGLUE: Video General Understanding Evaluation of Foundation Models

    Full text link
    We evaluate existing foundation models video understanding capabilities using a carefully designed experiment protocol consisting of three hallmark tasks (action recognition, temporal localization, and spatiotemporal localization), eight datasets well received by the community, and four adaptation methods tailoring a foundation model (FM) for a downstream task. Moreover, we propose a scalar VideoGLUE score (VGS) to measure an FMs efficacy and efficiency when adapting to general video understanding tasks. Our main findings are as follows. First, task-specialized models significantly outperform the six FMs studied in this work, in sharp contrast to what FMs have achieved in natural language and image understanding. Second,video-native FMs, whose pretraining data contains the video modality, are generally better than image-native FMs in classifying motion-rich videos, localizing actions in time, and understanding a video of more than one action. Third, the video-native FMs can perform well on video tasks under light adaptations to downstream tasks(e.g., freezing the FM backbones), while image-native FMs win in full end-to-end finetuning. The first two observations reveal the need and tremendous opportunities to conduct research on video-focused FMs, and the last confirms that both tasks and adaptation methods matter when it comes to the evaluation of FMs
    corecore