952 research outputs found

    What Makes for Robust Multi-Modal Models in the Face of Missing Modalities?

    Full text link
    With the growing success of multi-modal learning, research on the robustness of multi-modal models, especially when facing situations with missing modalities, is receiving increased attention. Nevertheless, previous studies in this domain exhibit certain limitations, as they often lack theoretical insights or their methodologies are tied to specific network architectures or modalities. We model the scenarios of multi-modal models encountering missing modalities from an information-theoretic perspective and illustrate that the performance ceiling in such scenarios can be approached by efficiently utilizing the information inherent in non-missing modalities. In practice, there are two key aspects: (1) The encoder should be able to extract sufficiently good features from the non-missing modality; (2) The extracted features should be robust enough not to be influenced by noise during the fusion process across modalities. To this end, we introduce Uni-Modal Ensemble with Missing Modality Adaptation (UME-MMA). UME-MMA employs uni-modal pre-trained weights for the multi-modal model to enhance feature extraction and utilizes missing modality data augmentation techniques to better adapt to situations with missing modalities. Apart from that, UME-MMA, built on a late-fusion learning framework, allows for the plug-and-play use of various encoders, making it suitable for a wide range of modalities and enabling seamless integration of large-scale pre-trained encoders to further enhance performance. And we demonstrate UME-MMA's effectiveness in audio-visual datasets~(e.g., AV-MNIST, Kinetics-Sound, AVE) and vision-language datasets~(e.g., MM-IMDB, UPMC Food101)

    Global burden of colistin-resistant bacteria : mobilized colistin resistance genes study (1980-2018)

    Get PDF
    Colistin is considered to be an antimicrobial of last-resort for the treatment of multidrug-resistant Gram-negative bacterial infections. The recent global dissemination of mobilized colistin resistance (mcr) genes is an urgent public health threat. An accurate estimate of the global prevalence of mcr genes, their reservoirs and the potential pathways for human transmission are required to implement control and prevention strategies, yet such data are lacking. Publications from four English (PubMed, Scopus, the Cochrane Database of Systematic Reviews and Web of Science) and two Chinese (CNKI and WANFANG) databases published between 18 November 2015 and 30 December 2018 were identified. In this systematic review and meta-analysis, the prevalence of mcr genes in bacteria isolated from humans, animals, the environment and food products were investigated. A total of 974 publications were identified. 202 observational studies were included in the systematic review and 71 in the meta-analysis. mcr genes were reported from 47 countries across six continents and the overall average prevalence was 4.7% (0.1-9.3%). China reported the highest number of mcr-positive strains. Pathogenic Escherichia coli (54%), isolated from animals (52%) and harboring an IncI2 plasmid (34%) were the bacteria with highest prevalence of mcr genes. The estimated prevalence of mcr-1 pathogenic E. coli was higher in food-animals than in humans and food products, which suggests a role for foodborne transmission. This study provides a comprehensive assessment of prevalence of the mcr gene by source, organism, genotype and type of plasmid

    All are Worth Words: A ViT Backbone for Diffusion Models

    Full text link
    Vision transformers (ViT) have shown promise in various vision tasks while the U-Net based on a convolutional neural network (CNN) remains dominant in diffusion models. We design a simple and general ViT-based architecture (named U-ViT) for image generation with diffusion models. U-ViT is characterized by treating all inputs including the time, condition and noisy image patches as tokens and employing long skip connections between shallow and deep layers. We evaluate U-ViT in unconditional and class-conditional image generation, as well as text-to-image generation tasks, where U-ViT is comparable if not superior to a CNN-based U-Net of a similar size. In particular, latent diffusion models with U-ViT achieve record-breaking FID scores of 2.29 in class-conditional image generation on ImageNet 256x256, and 5.48 in text-to-image generation on MS-COCO, among methods without accessing large external datasets during the training of generative models. Our results suggest that, for diffusion-based image modeling, the long skip connection is crucial while the down-sampling and up-sampling operators in CNN-based U-Net are not always necessary. We believe that U-ViT can provide insights for future research on backbones in diffusion models and benefit generative modeling on large scale cross-modality datasets.Comment: Accepted to CVPR 202

    2,3,4-Trihy­droxy­benzoic acid 0.25-hydrate

    Get PDF
    The asymmetric unit of the title compound, C7H6O5·0.25H2O, contains two mol­ecules of 2,3,4-trihy­droxy­benzoic acid, with similar conformations, and one water mol­ecule which lies on a twofold rotation axis. Both acid mol­ecules are essentially planar [maximum r.m.s deviations = 0.0324 (2) and 0.0542 (3) Å for the two acid molecules]. The mol­ecular conformations are stabilized by intra­molecular O(phenol)—H⋯O(carbox­yl/phenol) inter­actions. A cyclic inter­molecular association is formed between the two acid and one water mol­ecule [graph set R 3 3(12)] involving O—H⋯O hydrogen bonds. The two acid mol­ecules are further linked through a cyclic R 2 2(8) carb­oxy­lic acid hydrogen-bonding association, which together with inter­molecular O—H⋯O hydrogen-bonding inter­actions involving the phenol groups and the water mol­ecule, and weak π–π inter­actions [minimum ring centroid separation = 3.731 (3) Å], give a three-dimensional network

    One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale

    Full text link
    This paper proposes a unified diffusion framework (dubbed UniDiffuser) to fit all distributions relevant to a set of multi-modal data in one model. Our key insight is -- learning diffusion models for marginal, conditional, and joint distributions can be unified as predicting the noise in the perturbed data, where the perturbation levels (i.e. timesteps) can be different for different modalities. Inspired by the unified view, UniDiffuser learns all distributions simultaneously with a minimal modification to the original diffusion model -- perturbs data in all modalities instead of a single modality, inputs individual timesteps in different modalities, and predicts the noise of all modalities instead of a single modality. UniDiffuser is parameterized by a transformer for diffusion models to handle input types of different modalities. Implemented on large-scale paired image-text data, UniDiffuser is able to perform image, text, text-to-image, image-to-text, and image-text pair generation by setting proper timesteps without additional overhead. In particular, UniDiffuser is able to produce perceptually realistic samples in all tasks and its quantitative results (e.g., the FID and CLIP score) are not only superior to existing general-purpose models but also comparable to the bespoken models (e.g., Stable Diffusion and DALL-E 2) in representative tasks (e.g., text-to-image generation).Comment: Accepted to ICML202

    On Uni-Modal Feature Learning in Supervised Multi-Modal Learning

    Full text link
    We abstract the features (i.e. learned representations) of multi-modal data into 1) uni-modal features, which can be learned from uni-modal training, and 2) paired features, which can only be learned from cross-modal interactions. Multi-modal models are expected to benefit from cross-modal interactions on the basis of ensuring uni-modal feature learning. However, recent supervised multi-modal late-fusion training approaches still suffer from insufficient learning of uni-modal features on each modality. We prove that this phenomenon does hurt the model's generalization ability. To this end, we propose to choose a targeted late-fusion learning method for the given supervised multi-modal task from Uni-Modal Ensemble(UME) and the proposed Uni-Modal Teacher(UMT), according to the distribution of uni-modal and paired features. We demonstrate that, under a simple guiding strategy, we can achieve comparable results to other complex late-fusion or intermediate-fusion methods on various multi-modal datasets, including VGG-Sound, Kinetics-400, UCF101, and ModelNet40

    LSDP5 Enhances Triglyceride Storage in Hepatocytes by Influencing Lipolysis and Fatty Acid β-Oxidation of Lipid Droplets

    Get PDF
    Lipid storage droplet protein 5 (LSDP5) is a lipid droplet-associated protein of the PAT (perilipin, adipophilin, and TIP47) family that is expressed in the liver in a peroxisome proliferator-activated receptor alpha (PPARα)-dependent manner; however, its exact function has not been elucidated. We noticed that LSDP5 was localized to the surface of lipid droplets in hepatocytes. Overexpression of LSDP5 enhanced lipid accumulation in the hepatic cell line AML12 and in primary hepatocytes. Knock-down of LSDP5 significantly decreased the triglyceride content of lipid droplets, stimulated lipolysis, and modestly increased the mitochondrial content and level of fatty-acid β-oxidation in the mitochondria. The expression of PPARα was increased in LSDP5-deficient cells and required for the increase in the level of fatty acid β-oxidation in LSDP5-deficient cells. Using serial deletions of LSDP5, we determined that the lipid droplet-targeting domain and the domain directing lipid droplet clustering overlapped and were localized to the 188 amino acid residues at the N-terminus of LSDP5. Our findings suggest that LSDP5, a novel lipid droplet protein, may contribute to triglyceride accumulation by negatively regulating lipolysis and fatty acid oxidation in hepatocytes
    • …
    corecore