137 research outputs found
Detecting Visual Relationships with Deep Relational Networks
Relationships among objects play a crucial role in image understanding.
Despite the great success of deep learning techniques in recognizing individual
objects, reasoning about the relationships among objects remains a challenging
task. Previous methods often treat this as a classification problem,
considering each type of relationship (e.g. "ride") or each distinct visual
phrase (e.g. "person-ride-horse") as a category. Such approaches are faced with
significant difficulties caused by the high diversity of visual appearance for
each kind of relationships or the large number of distinct visual phrases. We
propose an integrated framework to tackle this problem. At the heart of this
framework is the Deep Relational Network, a novel formulation designed
specifically for exploiting the statistical dependencies between objects and
their relationships. On two large datasets, the proposed method achieves
substantial improvement over state-of-the-art.Comment: To be appeared in CVPR 2017 as an oral pape
PixMIM: Rethinking Pixel Reconstruction in Masked Image Modeling
Masked Image Modeling (MIM) has achieved promising progress with the advent
of Masked Autoencoders (MAE) and BEiT. However, subsequent works have
complicated the framework with new auxiliary tasks or extra pre-trained models,
inevitably increasing computational overhead. This paper undertakes a
fundamental analysis of MIM from the perspective of pixel reconstruction, which
examines the input image patches and reconstruction target, and highlights two
critical but previously overlooked bottlenecks. Based on this analysis, we
propose a remarkably simple and effective method, {\ourmethod}, that entails
two strategies: 1) filtering the high-frequency components from the
reconstruction target to de-emphasize the network's focus on texture-rich
details and 2) adopting a conservative data transform strategy to alleviate the
problem of missing foreground in MIM training. {\ourmethod} can be easily
integrated into most existing pixel-based MIM approaches (\ie, using raw images
as reconstruction target) with negligible additional computation. Without bells
and whistles, our method consistently improves three MIM approaches, MAE,
ConvMAE, and LSMAE, across various downstream tasks. We believe this effective
plug-and-play method will serve as a strong baseline for self-supervised
learning and provide insights for future improvements of the MIM framework.
Code and models are available at
\url{https://github.com/open-mmlab/mmselfsup/tree/dev-1.x/configs/selfsup/pixmim}.Comment: Update code link and add additional result
Proteus: Simulating the Performance of Distributed DNN Training
DNN models are becoming increasingly larger to achieve unprecedented
accuracy, and the accompanying increased computation and memory requirements
necessitate the employment of massive clusters and elaborate parallelization
strategies to accelerate DNN training. In order to better optimize the
performance and analyze the cost, it is indispensable to model the training
throughput of distributed DNN training. However, complex parallelization
strategies and the resulting complex runtime behaviors make it challenging to
construct an accurate performance model. In this paper, we present Proteus, the
first standalone simulator to model the performance of complex parallelization
strategies through simulation execution. Proteus first models complex
parallelization strategies with a unified representation named Strategy Tree.
Then, it compiles the strategy tree into a distributed execution graph and
simulates the complex runtime behaviors, comp-comm overlap and bandwidth
sharing, with a Hierarchical Topo-Aware Executor (HTAE). We finally evaluate
Proteus across a wide variety of DNNs on three hardware configurations.
Experimental results show that Proteus achieves average prediction
error and preserves order for training throughput of various parallelization
strategies. Compared to state-of-the-art approaches, Proteus reduces prediction
error by up to
Force-Aware Interface via Electromyography for Natural VR/AR Interaction
While tremendous advances in visual and auditory realism have been made for
virtual and augmented reality (VR/AR), introducing a plausible sense of
physicality into the virtual world remains challenging. Closing the gap between
real-world physicality and immersive virtual experience requires a closed
interaction loop: applying user-exerted physical forces to the virtual
environment and generating haptic sensations back to the users. However,
existing VR/AR solutions either completely ignore the force inputs from the
users or rely on obtrusive sensing devices that compromise user experience.
By identifying users' muscle activation patterns while engaging in VR/AR, we
design a learning-based neural interface for natural and intuitive force
inputs. Specifically, we show that lightweight electromyography sensors,
resting non-invasively on users' forearm skin, inform and establish a robust
understanding of their complex hand activities. Fuelled by a
neural-network-based model, our interface can decode finger-wise forces in
real-time with 3.3% mean error, and generalize to new users with little
calibration. Through an interactive psychophysical study, we show that human
perception of virtual objects' physical properties, such as stiffness, can be
significantly enhanced by our interface. We further demonstrate that our
interface enables ubiquitous control via finger tapping. Ultimately, we
envision our findings to push forward research towards more realistic
physicality in future VR/AR.Comment: ACM Transactions on Graphics (SIGGRAPH Asia 2022
Optimizing Video Object Detection via a Scale-Time Lattice
High-performance object detection relies on expensive convolutional networks
to compute features, often leading to significant challenges in applications,
e.g. those that require detecting objects from video streams in real time. The
key to this problem is to trade accuracy for efficiency in an effective way,
i.e. reducing the computing cost while maintaining competitive performance. To
seek a good balance, previous efforts usually focus on optimizing the model
architectures. This paper explores an alternative approach, that is, to
reallocate the computation over a scale-time space. The basic idea is to
perform expensive detection sparsely and propagate the results across both
scales and time with substantially cheaper networks, by exploiting the strong
correlations among them. Specifically, we present a unified framework that
integrates detection, temporal propagation, and across-scale refinement on a
Scale-Time Lattice. On this framework, one can explore various strategies to
balance performance and cost. Taking advantage of this flexibility, we further
develop an adaptive scheme with the detector invoked on demand and thus obtain
improved tradeoff. On ImageNet VID dataset, the proposed method can achieve a
competitive mAP 79.6% at 20 fps, or 79.0% at 62 fps as a performance/speed
tradeoff.Comment: Accepted to CVPR 2018. Project page:
http://mmlab.ie.cuhk.edu.hk/projects/ST-Lattice
- …