357 research outputs found
Towards Open-Ended Visual Recognition with Large Language Model
Localizing and recognizing objects in the open-ended physical world poses a
long-standing challenge within the domain of machine perception. Recent methods
have endeavored to address the issue by employing a class-agnostic mask (or
box) proposal model, complemented by an open-vocabulary classifier (e.g., CLIP)
using pre-extracted text embeddings. However, it is worth noting that these
open-vocabulary recognition models still exhibit limitations in practical
applications. On one hand, they rely on the provision of class names during
testing, where the recognition performance heavily depends on this predefined
set of semantic classes by users. On the other hand, when training with
multiple datasets, human intervention is required to alleviate the label
definition conflict between them. In this paper, we introduce the OmniScient
Model (OSM), a novel Large Language Model (LLM) based mask classifier, as a
straightforward and effective solution to the aforementioned challenges.
Specifically, OSM predicts class labels in a generative manner, thus removing
the supply of class names during both training and testing. It also enables
cross-dataset training without any human interference, exhibiting robust
generalization capabilities due to the world knowledge acquired from the LLM.
By combining OSM with an off-the-shelf mask proposal model, we present
promising results on various benchmarks, and demonstrate its effectiveness in
handling novel concepts. Code/model are available at
https://github.com/bytedance/OmniScient-Model
Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP
Open-vocabulary segmentation is a challenging task requiring segmenting and
recognizing objects from an open set of categories. One way to address this
challenge is to leverage multi-modal models, such as CLIP, to provide image and
text features in a shared embedding space, which bridges the gap between
closed-vocabulary and open-vocabulary recognition. Hence, existing methods
often adopt a two-stage framework to tackle the problem, where the inputs first
go through a mask generator and then through the CLIP model along with the
predicted masks. This process involves extracting features from images multiple
times, which can be ineffective and inefficient. By contrast, we propose to
build everything into a single-stage framework using a shared Frozen
Convolutional CLIP backbone, which not only significantly simplifies the
current two-stage pipeline, but also remarkably yields a better accuracy-cost
trade-off. The proposed FC-CLIP, benefits from the following observations: the
frozen CLIP backbone maintains the ability of open-vocabulary classification
and can also serve as a strong mask generator, and the convolutional CLIP
generalizes well to a larger input resolution than the one used during
contrastive image-text pretraining. When training on COCO panoptic data only
and testing in a zero-shot manner, FC-CLIP achieve 26.8 PQ, 16.8 AP, and 34.1
mIoU on ADE20K, 18.2 PQ, 27.9 mIoU on Mapillary Vistas, 44.0 PQ, 26.8 AP, 56.2
mIoU on Cityscapes, outperforming the prior art by +4.2 PQ, +2.4 AP, +4.2 mIoU
on ADE20K, +4.0 PQ on Mapillary Vistas and +20.1 PQ on Cityscapes,
respectively. Additionally, the training and testing time of FC-CLIP is 7.5x
and 6.6x significantly faster than the same prior art, while using 5.9x fewer
parameters. FC-CLIP also sets a new state-of-the-art performance across various
open-vocabulary semantic segmentation datasets. Code at
https://github.com/bytedance/fc-clipComment: code and model available at https://github.com/bytedance/fc-cli
Recommended from our members
T Oligo-Primed Polymerase Chain Reaction (TOP-PCR): A Robust Method for the Amplification of Minute DNA Fragments in Body Fluids.
Body fluid DNA sequencing is a powerful noninvasive approach for the diagnosis of genetic defects, infectious agents and diseases. The success relies on the quantity and quality of the DNA samples. However, numerous clinical samples are either at low quantity or of poor quality due to various reasons. To overcome these problems, we have developed T oligo-primed polymerase chain reaction (TOP-PCR) for full-length nonselective amplification of minute quantity of DNA fragments. TOP-PCR adopts homogeneous "half adaptor" (HA), generated by annealing P oligo (carrying a phosphate group at the 5' end) and T oligo (carrying a T-tail at the 3' end), for efficient ligation to target DNA and subsequent PCR amplification primed by the T oligo alone. Using DNA samples from body fluids, we demonstrate that TOP-PCR recovers minute DNA fragments and maintains the DNA size profile, while enhancing the major molecular populations. Our results also showed that TOP-PCR is a superior method for detecting apoptosis and outperforms the method adopted by Illumina for DNA amplification
EvIcon: Designing High-Usability Icon with Human-in-the-loop Exploration and IconCLIP
Interface icons are prevalent in various digital applications. Due to limited
time and budgets, many designers rely on informal evaluation, which often
results in poor usability icons. In this paper, we propose a unique
human-in-the-loop framework that allows our target users, i.e., novice and
professional UI designers, to improve the usability of interface icons
efficiently. We formulate several usability criteria into a perceptual
usability function and enable users to iteratively revise an icon set with an
interactive design tool, EvIcon. We take a large-scale pre-trained joint
image-text embedding (CLIP) and fine-tune it to embed icon visuals with icon
tags in the same embedding space (IconCLIP). During the revision process, our
design tool provides two types of instant perceptual usability feedback. First,
we provide perceptual usability feedback modeled by deep learning models
trained on IconCLIP embeddings and crowdsourced perceptual ratings. Second, we
use the embedding space of IconCLIP to assist users in improving icons' visual
distinguishability among icons within the user-prepared icon set. To provide
the perceptual prediction, we compiled IconCEPT10K, the first large-scale
dataset of perceptual usability ratings over interface icons, by
conducting a crowdsourcing study. We demonstrated that our framework could
benefit UI designers' interface icon revision process with a wide range of
professional experience. Moreover, the interface icons designed using our
framework achieved better semantic distance and familiarity, verified by an
additional online user study
MaXTron: Mask Transformer with Trajectory Attention for Video Panoptic Segmentation
Video panoptic segmentation requires consistently segmenting (for both
`thing' and `stuff' classes) and tracking objects in a video over time. In this
work, we present MaXTron, a general framework that exploits Mask XFormer with
Trajectory Attention to tackle the task. MaXTron enriches an off-the-shelf mask
transformer by leveraging trajectory attention. The deployed mask transformer
takes as input a short clip consisting of only a few frames and predicts the
clip-level segmentation. To enhance the temporal consistency, MaXTron employs
within-clip and cross-clip tracking modules, efficiently utilizing trajectory
attention. Originally designed for video classification, trajectory attention
learns to model the temporal correspondences between neighboring frames and
aggregates information along the estimated motion paths. However, it is
nontrivial to directly extend trajectory attention to the per-pixel dense
prediction tasks due to its quadratic dependency on input size. To alleviate
the issue, we propose to adapt the trajectory attention for both the dense
pixel features and object queries, aiming to improve the short-term and
long-term tracking results, respectively. Particularly, in our within-clip
tracking module, we propose axial-trajectory attention that effectively
computes the trajectory attention for tracking dense pixels sequentially along
the height- and width-axes. The axial decomposition significantly reduces the
computational complexity for dense pixel features. In our cross-clip tracking
module, since the object queries in mask transformer are learned to encode the
object information, we are able to capture the long-term temporal connections
by applying trajectory attention to object queries, which learns to track each
object across different clips. Without bells and whistles, MaXTron demonstrates
state-of-the-art performances on video segmentation benchmarks.Comment: Code at https://github.com/TACJu/MaXTro
Single nucleotide polymorphisms of one-carbon metabolism and cancers of the esophagus, stomach, and liver in a Chinese population.
One-carbon metabolism (folate metabolism) is considered important in carcinogenesis because of its involvement in DNA synthesis and biological methylation reactions. We investigated the associations of single nucleotide polymorphisms (SNPs) in folate metabolic pathway and the risk of three GI cancers in a population-based case-control study in Taixing City, China, with 218 esophageal cancer cases, 206 stomach cancer cases, 204 liver cancer cases, and 415 healthy population controls. Study participants were interviewed with a standardized questionnaire, and blood samples were collected after the interviews. We genotyped SNPs of the MTHFR, MTR, MTRR, DNMT1, and ALDH2 genes, using PCR-RFLP, SNPlex, or TaqMan assays. To account for multiple comparisons and reduce the chances of false reports, we employed semi-Bayes (SB) shrinkage analysis. After shrinkage and adjusting for potential confounding factors, we found positive associations between MTHFR rs1801133 and stomach cancer (any T versus C/C, SB odds-ratio [SBOR]: 1.79, 95% posterior limits: 1.18, 2.71) and liver cancer (SBOR: 1.51, 95% posterior limits: 0.98, 2.32). There was an inverse association between DNMT1 rs2228612 and esophageal cancer (any G versus A/A, SBOR: 0.60, 95% posterior limits: 0.39, 0.94). In addition, we detected potential heterogeneity across alcohol drinking status for ORs relating MTRR rs1801394 to esophageal (posterior homogeneity P = 0.005) and stomach cancer (posterior homogeneity P = 0.004), and ORs relating MTR rs1805087 to liver cancer (posterior homogeneity P = 0.021). Among non-alcohol drinkers, the variant allele (allele G) of these two SNPs was inversely associated with the risk of these cancers; while a positive association was observed among ever-alcohol drinkers. Our results suggest that genetic polymorphisms related to one-carbon metabolism may be associated with cancers of the esophagus, stomach, and liver. Heterogeneity across alcohol consumption status of the associations between MTR/MTRR polymorphisms and these cancers indicates potential interactions between alcohol drinking and one-carbon metabolic pathway
Anti-IL-17A antibody-associated de novo vitiligo: Case report and review of literature
Interleukin (IL)-17 inhibitor is a biological therapy approved for moderate to severe psoriasis and psoriatic arthritis. The common adverse events of IL-17 inhibitor include injection site reaction, infections, nasopharyngitis, and headache. However, vitiligo associated with the use of IL-17 inhibitors was rarely reported in the previous literature. Here we described a woman who developed de novo vitiligo after 4 months of IL-17A inhibitor treatment for psoriasis and psoriatic arthritis. Upon discontinuation of IL-17A inhibitor and shifting to a broader T cell inhibitor—cyclosporine, our patient had control of both psoriasis and vitiligo and achieved 75% repigmentation after 3 months of oral cyclosporine without phototherapy. Due to the increasing use of anti-IL-17 biologics in psoriasis patients, clinicians should inquire about vitiligo’s history before treatment and inform patients of the possible adverse effects
Endothelial FGF signaling is protective in hypoxia-induced pulmonary hypertension
Hypoxia-induced pulmonary hypertension (PH) is one of the most common and deadliest forms of PH. Fibroblast growth factor receptors 1 and 2 (FGFR1/2) are elevated in patients with PH and in mice exposed to chronic hypoxia. Endothelial FGFR1/2 signaling is important for the adaptive response to several injury types and we hypothesized that endothelial FGFR1/2 signaling would protect against hypoxia-induced PH. Mice lacking endothelial FGFR1/2, mice with activated endothelial FGFR signaling, and human pulmonary artery endothelial cells (HPAECs) were challenged with hypoxia. We assessed the effect of FGFR activation and inhibition on right ventricular pressure, vascular remodeling, and endothelial-mesenchymal transition (EndMT), a known pathologic change seen in patients with PH. Hypoxia-exposed mice lacking endothelial FGFRs developed increased PH, while mice overexpressing a constitutively active FGFR in endothelial cells did not develop PH. Mechanistically, lack of endothelial FGFRs or inhibition of FGFRs in HPAECs led to increased TGF-β signaling and increased EndMT in response to hypoxia. These phenotypes were reversed in mice with activated endothelial FGFR signaling, suggesting that FGFR signaling inhibits TGF-β pathway-mediated EndMT during chronic hypoxia. Consistent with these observations, lung tissue from patients with PH showed activation of FGFR and TGF-β signaling. Collectively, these data suggest that activation of endothelial FGFR signaling could be therapeutic for hypoxia-induced PH
Deploying Image Deblurring across Mobile Devices: A Perspective of Quality and Latency
Recently, image enhancement and restoration have become important
applications on mobile devices, such as super-resolution and image deblurring.
However, most state-of-the-art networks present extremely high computational
complexity. This makes them difficult to be deployed on mobile devices with
acceptable latency. Moreover, when deploying to different mobile devices, there
is a large latency variation due to the difference and limitation of deep
learning accelerators on mobile devices. In this paper, we conduct a search of
portable network architectures for better quality-latency trade-off across
mobile devices. We further present the effectiveness of widely used network
optimizations for image deblurring task. This paper provides comprehensive
experiments and comparisons to uncover the in-depth analysis for both latency
and image quality. Through all the above works, we demonstrate the successful
deployment of image deblurring application on mobile devices with the
acceleration of deep learning accelerators. To the best of our knowledge, this
is the first paper that addresses all the deployment issues of image deblurring
task across mobile devices. This paper provides practical
deployment-guidelines, and is adopted by the championship-winning team in NTIRE
2020 Image Deblurring Challenge on Smartphone Track.Comment: CVPR 2020 Workshop on New Trends in Image Restoration and Enhancement
(NTIRE
- …