13 research outputs found
VITON: An Image-based Virtual Try-on Network
We present an image-based VIirtual Try-On Network (VITON) without using 3D
information in any form, which seamlessly transfers a desired clothing item
onto the corresponding region of a person using a coarse-to-fine strategy.
Conditioned upon a new clothing-agnostic yet descriptive person representation,
our framework first generates a coarse synthesized image with the target
clothing item overlaid on that same person in the same pose. We further enhance
the initial blurry clothing area with a refinement network. The network is
trained to learn how much detail to utilize from the target clothing item, and
where to apply to the person in order to synthesize a photo-realistic image in
which the target item deforms naturally with clear visual patterns. Experiments
on our newly collected Zalando dataset demonstrate its promise in the
image-based virtual try-on task over state-of-the-art generative models
ReMotENet: Efficient Relevant Motion Event Detection for Large-scale Home Surveillance Videos
This paper addresses the problem of detecting relevant motion caused by
objects of interest (e.g., person and vehicles) in large scale home
surveillance videos. The traditional method usually consists of two separate
steps, i.e., detecting moving objects with background subtraction running on
the camera, and filtering out nuisance motion events (e.g., trees, cloud,
shadow, rain/snow, flag) with deep learning based object detection and tracking
running on cloud. The method is extremely slow and therefore not cost
effective, and does not fully leverage the spatial-temporal redundancies with a
pre-trained off-the-shelf object detector. To dramatically speedup relevant
motion event detection and improve its performance, we propose a novel network
for relevant motion event detection, ReMotENet, which is a unified, end-to-end
data-driven method using spatial-temporal attention-based 3D ConvNets to
jointly model the appearance and motion of objects-of-interest in a video.
ReMotENet parses an entire video clip in one forward pass of a neural network
to achieve significant speedup. Meanwhile, it exploits the properties of home
surveillance videos, e.g., relevant motion is sparse both spatially and
temporally, and enhances 3D ConvNets with a spatial-temporal attention model
and reference-frame subtraction to encourage the network to focus on the
relevant moving objects. Experiments demonstrate that our method can achieve
comparable or event better performance than the object detection based method
but with three to four orders of magnitude speedup (up to 20k times) on GPU
devices. Our network is efficient, compact and light-weight. It can detect
relevant motion on a 15s surveillance video clip within 4-8 milliseconds on a
GPU and a fraction of second (0.17-0.39) on a CPU with a model size of less
than 1MB.Comment: WACV1
Improving Efficiency and Generalization of Visual Recognition
Deep Neural Networks (DNNs) are heavy in terms of their number of parameters and computational cost. This leads to two major challenges: first, training and deployment of deep networks are expensive; second, without tremendous annotated training data, which are very costly to obtain, DNNs easily suffer over-fitting and have poor generalization.
We propose approaches to these two challenges in the context of specific computer vision problems to improve their efficiency and generalization.
First, we study network pruning using neuron importance score propagation. To reduce the significant redundancy in DNNs, we formulate network pruning as a binary integer optimization problem which minimizes the reconstruction errors on the final responses produced by the network, and derive a closed-form solution to it for pruning neurons in earlier layers. Based on our theoretical analysis, we propose the Neuron Importance Score Propagation (NISP) algorithm to propagate the importance scores of final responses to every neuron in the network, then prune neurons in the entire networks jointly.
Second, we study visual relationship detection (VRD) with linguistic knowledge distillation. Since the semantic space of visual relationships is huge and training data is limited, especially for long-tail relationships that have few instances, detecting visual relationships from images is a challenging problem. To improve the predictive capability, especially generalization on unseen relationships, we utilize knowledge of linguistic statistics obtained from both training annotations (internal knowledge) and publicly available text, e.g., Wikipedia (external knowledge) to regularize visual model learning.
Third, we study the role of context selection in object detection. We investigate the reasons why context in object detection has limited utility by isolating and evaluating the predictive power of different context cues under ideal conditions in which context provided by an oracle. Based on this study, we propose a region-based context re-scoring method with dynamic context selection to remove noise and emphasize informative context.
Fourth, we study the efficient relevant motion event detection for large-scale home surveillance videos. To detect motion events of objects-of-interest from large scale home surveillance videos, traditional methods based on object detection and tracking are extremely slow and require expensive GPU devices. To dramatically speedup relevant motion event detection and improve its performance, we propose a novel network for relevant motion event detection, ReMotENet, which is a unified, end-to-end data-driven method using spatial-temporal attention-based 3D ConvNets to jointly model the appearance and motion of objects-of-interest in a video.
In the last part, we address the recognition of agent-in-place actions, which are associated with agents who perform them and places where they occur, in the context of outdoor home surveillance. We introduce a representation of the geometry and topology of scene layouts so that a network can generalize from the layouts observed in the training set to unseen layouts in the test set. This Layout-Induced Video Representation (LIVR) abstracts away low-level appearance variance and encodes geometric and topological relationships of places in a specific scene layout. LIVR partitions the semantic features of a video clip into different places to force the network to learn place-based feature descriptions; to predict the confidence of each action, LIVR aggregates features from the place associated with an action and its adjacent places on the scene layout. We introduce the Agent-in-Place Action dataset to show that our method allows neural network models to generalize significantly better to unseen scenes
ARHGEF12 regulates erythropoiesis and is involved in erythroid regeneration after chemotherapy in acute lymphoblastic leukemia patients
Hematopoiesis is a finely regulated process in vertebrates under both homeostatic and stress conditions. By whole exome sequencing, we studied the genomics of acute lymphoblastic leukemia (ALL) patients who needed multiple red blood cell (RBC) transfusions after intensive chemotherapy treatment. ARHGEF12, encoding a RhoA guanine nucleotide exchange factor, was found to be associated with chemotherapy-induced anemia by genome-wide association study analyses. A single nucleotide polymorphism (SNP) of ARHGEF12 located in an intron predicted to be a GATA1 binding site, rs10892563, is significantly associated with patients who need RBC transfusion (P=3.469E-03, odds ratio 5.864). A luciferase reporter assay revealed that this SNP impairs GATA1-mediated trans-regulation of ARHGEF12, and quantitative polymerase chain reaction studies confirmed that the homozygotes status is associated with an approximately 61% reduction in ARHGEF12 expression (P=0.0088). Consequently, erythropoiesis was affected at the pro-erythroblast phases. The role of ARHGEF12 and its homologs in erythroid differentiation was confirmed in human K562 cells, mouse 32D cells and primary murine bone marrow cells. We further demonstrated in zebrafish by morpholino-mediated knockdown and CRISPR/Cas9-mediated knockout of arhgef12 that its reduction resulted in erythropoiesis defects. The p38 kinase pathway was affected by the ARHGEF12-RhoA signaling in K562 cells, and consistently, the Arhgef12-RhoA-p38 pathway was also shown to be important for erythroid differentiation in zebrafish as active RhoA or p38 readily rescued the impaired erythropoiesis caused by arhgef12 knockdown. Finally, ARHGEF12-mediated p38 activity also appeared to be involved in phenotypes of patients of the rs10892563 homozygous genotype. Our findings present a novel SNP of ARHGEF12 that may involve ARHGEF12-RhoA-p38 signaling in erythroid regeneration in ALL patients after chemotherapy
R4D: Utilizing Reference Objects for Long-Range Distance Estimation
Estimating the distance of objects is a safety-critical task for autonomous
driving. Focusing on short-range objects, existing methods and datasets neglect
the equally important long-range objects. In this paper, we introduce a
challenging and under-explored task, which we refer to as Long-Range Distance
Estimation, as well as two datasets to validate new methods developed for this
task. We then proposeR4D, the first framework to accurately estimate the
distance of long-range objects by using references with known distances in the
scene. Drawing inspiration from human perception, R4D builds a graph by
connecting a target object to all references. An edge in the graph encodes the
relative distance information between a pair of target and reference objects.
An attention module is then used to weigh the importance of reference objects
and combine them into one target object distance prediction. Experiments on the
two proposed datasets demonstrate the effectiveness and robustness of R4D by
showing significant improvements compared to existing baselines. We are looking
to make the proposed dataset, Waymo OpenDataset - Long-Range Labels, available
publicly at waymo.com/open/download.Comment: ICLR 202