250 research outputs found

    Multimodal knowledge integration for object detection and visual reasoning

    Get PDF
    We humans still perceive and reason in a different way than artificial intelligence models. We witness, we listen, we touch, we understand the world via multi-modal sensing, while machine models rely only on a single or a few modalities and ignore abundant information. In this thesis, we explore techniques for reducing the perception gap between machines and humans and focus on two families of tasks, reasoning and detection. First, we incorporate information from text, audio, motion, external knowledge bases, for training computer vision models. We find that data inputs from more extensive channels provide complementary information to improve models. Second, we study how multimodal inputs can be fully utilized. We argue that most existing deep learning methods are prone to pay too large attention to shallow patterns in the input features, which causes the resulting models to be biased. We propose robust training to overcome the issue. Third, we extend the benefits of multi-modal information to the supervision signals instead of the inputs, by learning a weakly supervised detection model from the natural supervision of textual captions or audio narrations. With the help of NLP constituency parsing, it is possible to extract structural knowledges from the captions and narrations, hence determines the entities and relations of visual objects

    Automatic Understanding of Image and Video Advertisements

    Full text link
    There is more to images than their objective physical content: for example, advertisements are created to persuade a viewer to take a certain action. We propose the novel problem of automatic advertisement understanding. To enable research on this problem, we create two datasets: an image dataset of 64,832 image ads, and a video dataset of 3,477 ads. Our data contains rich annotations encompassing the topic and sentiment of the ads, questions and answers describing what actions the viewer is prompted to take and the reasoning that the ad presents to persuade the viewer ("What should I do according to this ad, and why should I do it?"), and symbolic references ads make (e.g. a dove symbolizes peace). We also analyze the most common persuasive strategies ads use, and the capabilities that computer vision systems should have to understand these strategies. We present baseline classification results for several prediction tasks, including automatically answering questions about the messages of the ads.Comment: To appear in CVPR 2017; data available on http://cs.pitt.edu/~kovashka/ad

    What Explains Natives and Sojourners Preventive Health Behavior in a Pandemic: Role of Media and Scientific Self-Efficacy

    Get PDF
    The COVID-19 pandemic triggered a severe global public health emergency. The current research investigated and compared “Natives and Sojourners” health-protective behavior in Mainland China during the pandemic. We adopted a unified view to propose our theoretical model by adapting the Health Belief Model (HBM) and Institutional Theory (IT). The data obtained through an online survey questionnaire from 435 respondents during the second and third quarters of were analyzed. Structural equation modeling (SEM) was used to empirically analyze the proposed model. The media self-efficacy (MSE), scientific self-efficacy (SSE), perceived health risks (PHRs), and the perceived benefits of being protected have positive and significant effects on the definition of health-protective behavioral intentions among natives and sojourners in mainland China. Media and SSE can play a strategic role in formulating public health-protective behavior. The current research recommends an effective communication with sojourners during crisis for them to be a part of the national crisis management plan (i.e., infectious disease)

    VILA: Learning Image Aesthetics from User Comments with Vision-Language Pretraining

    Full text link
    Assessing the aesthetics of an image is challenging, as it is influenced by multiple factors including composition, color, style, and high-level semantics. Existing image aesthetic assessment (IAA) methods primarily rely on human-labeled rating scores, which oversimplify the visual aesthetic information that humans perceive. Conversely, user comments offer more comprehensive information and are a more natural way to express human opinions and preferences regarding image aesthetics. In light of this, we propose learning image aesthetics from user comments, and exploring vision-language pretraining methods to learn multimodal aesthetic representations. Specifically, we pretrain an image-text encoder-decoder model with image-comment pairs, using contrastive and generative objectives to learn rich and generic aesthetic semantics without human labels. To efficiently adapt the pretrained model for downstream IAA tasks, we further propose a lightweight rank-based adapter that employs text as an anchor to learn the aesthetic ranking concept. Our results show that our pretrained aesthetic vision-language model outperforms prior works on image aesthetic captioning over the AVA-Captions dataset, and it has powerful zero-shot capability for aesthetic tasks such as zero-shot style classification and zero-shot IAA, surpassing many supervised baselines. With only minimal finetuning parameters using the proposed adapter module, our model achieves state-of-the-art IAA performance over the AVA dataset.Comment: CVPR 2023, https://github.com/google-research/google-research/tree/master/vil

    Recommending Themes for Ad Creative Design via Visual-Linguistic Representations

    Full text link
    There is a perennial need in the online advertising industry to refresh ad creatives, i.e., images and text used for enticing online users towards a brand. Such refreshes are required to reduce the likelihood of ad fatigue among online users, and to incorporate insights from other successful campaigns in related product categories. Given a brand, to come up with themes for a new ad is a painstaking and time consuming process for creative strategists. Strategists typically draw inspiration from the images and text used for past ad campaigns, as well as world knowledge on the brands. To automatically infer ad themes via such multimodal sources of information in past ad campaigns, we propose a theme (keyphrase) recommender system for ad creative strategists. The theme recommender is based on aggregating results from a visual question answering (VQA) task, which ingests the following: (i) ad images, (ii) text associated with the ads as well as Wikipedia pages on the brands in the ads, and (iii) questions around the ad. We leverage transformer based cross-modality encoders to train visual-linguistic representations for our VQA task. We study two formulations for the VQA task along the lines of classification and ranking; via experiments on a public dataset, we show that cross-modal representations lead to significantly better classification accuracy and ranking precision-recall metrics. Cross-modal representations show better performance compared to separate image and text representations. In addition, the use of multimodal information shows a significant lift over using only textual or visual information.Comment: 7 pages, 8 figures, 2 tables, accepted by The Web Conference 202

    Precursors and Pathways Leading to Enhanced Secondary Organic Aerosol Formation during Severe Haze Episodes

    Get PDF
    Publisher Copyright: © 2021 American Chemical SocietyMolecular analyses help to investigate the key precursors and chemical processes of secondary organic aerosol (SOA) formation. We obtained the sources and molecular compositions of organic aerosol in PM2.5in winter in Beijing by online and offline mass spectrometer measurements. Photochemical and aqueous processing were both involved in producing SOA during the haze events. Aromatics, isoprene, long-chain alkanes or alkenes, and carbonyls such as glyoxal and methylglyoxal were all important precursors. The enhanced SOA formation during the severe haze event was predominantly contributed by aqueous processing that was promoted by elevated amounts of aerosol water for which multifunctional organic nitrates contributed the most followed by organic compounds having four oxygen atoms in their formulae. The latter included dicarboxylic acids and various oxidation products from isoprene and aromatics as well as products or oligomers from methylglyoxal aqueous uptake. Nitrated phenols, organosulfates, and methanesulfonic acid were also important SOA products but their contributions to the elevated SOA mass during the severe haze event were minor. Our results highlight the importance of reducing nitrogen oxides and nitrate for future SOA control. Additionally, the formation of highly oxygenated long-chain molecules with a low degree of unsaturation in polluted urban environments requires further research.Peer reviewe

    Electronic properties of guanine-based nanowires

    Full text link
    We present a first-principle study of the electronic and conduction properties of a few classes of nanowires constituted of guanine (G) molecules, self-assembled in different geometries. We first analyze the effect of the vertical π\pi-π\pi interaction in model G-stack columns. Then, we exploit the results obtained from those models to interpret the features of realistic stacked and hydrogen-bonded structures, namely the guanine quadruple helices and the planar ribbons. With respect to natural DNA, the different structures as well as the inclusion of metal cations, drastically affect the bonding pattern among the bases, introducing novel features in the electronic properties of the systems. These supramolecular G-aggregates, alternative to DNA, are expected to show intersting properties for molecular elec tronics applications.Comment: 30 pages (preprint format), 8 figures. To appear in Solid State Communications - Special Issue on "New advances on collective phenomena in one-dimensional systems
    corecore