148 research outputs found
Detecting Visual Relationships with Deep Relational Networks
Relationships among objects play a crucial role in image understanding.
Despite the great success of deep learning techniques in recognizing individual
objects, reasoning about the relationships among objects remains a challenging
task. Previous methods often treat this as a classification problem,
considering each type of relationship (e.g. "ride") or each distinct visual
phrase (e.g. "person-ride-horse") as a category. Such approaches are faced with
significant difficulties caused by the high diversity of visual appearance for
each kind of relationships or the large number of distinct visual phrases. We
propose an integrated framework to tackle this problem. At the heart of this
framework is the Deep Relational Network, a novel formulation designed
specifically for exploiting the statistical dependencies between objects and
their relationships. On two large datasets, the proposed method achieves
substantial improvement over state-of-the-art.Comment: To be appeared in CVPR 2017 as an oral pape
PixMIM: Rethinking Pixel Reconstruction in Masked Image Modeling
Masked Image Modeling (MIM) has achieved promising progress with the advent
of Masked Autoencoders (MAE) and BEiT. However, subsequent works have
complicated the framework with new auxiliary tasks or extra pre-trained models,
inevitably increasing computational overhead. This paper undertakes a
fundamental analysis of MIM from the perspective of pixel reconstruction, which
examines the input image patches and reconstruction target, and highlights two
critical but previously overlooked bottlenecks. Based on this analysis, we
propose a remarkably simple and effective method, {\ourmethod}, that entails
two strategies: 1) filtering the high-frequency components from the
reconstruction target to de-emphasize the network's focus on texture-rich
details and 2) adopting a conservative data transform strategy to alleviate the
problem of missing foreground in MIM training. {\ourmethod} can be easily
integrated into most existing pixel-based MIM approaches (\ie, using raw images
as reconstruction target) with negligible additional computation. Without bells
and whistles, our method consistently improves three MIM approaches, MAE,
ConvMAE, and LSMAE, across various downstream tasks. We believe this effective
plug-and-play method will serve as a strong baseline for self-supervised
learning and provide insights for future improvements of the MIM framework.
Code and models are available at
\url{https://github.com/open-mmlab/mmselfsup/tree/dev-1.x/configs/selfsup/pixmim}.Comment: Update code link and add additional result
Scaling Laws of RoPE-based Extrapolation
The extrapolation capability of Large Language Models (LLMs) based on Rotary
Position Embedding is currently a topic of considerable interest. The
mainstream approach to addressing extrapolation with LLMs involves modifying
RoPE by replacing 10000, the rotary base of in the
original RoPE, with a larger value and providing longer fine-tuning text. In
this work, we first observe that fine-tuning a RoPE-based LLM with either a
smaller or larger base in pre-training context length could significantly
enhance its extrapolation performance. After that, we propose
\textbf{\textit{Scaling Laws of RoPE-based Extrapolation}}, a unified framework
from the periodic perspective, to describe the relationship between the
extrapolation performance and base value as well as tuning context length. In
this process, we also explain the origin of the RoPE-based extrapolation issue
by \textbf{\textit{critical dimension for extrapolation}}. Besides these
observations and analyses, we achieve extrapolation up to 1 million context
length within only 16K training length on LLaMA2 7B and 13B.Comment: 26 pages, 12 figures, Accepted by ICLR 202
Proteus: Simulating the Performance of Distributed DNN Training
DNN models are becoming increasingly larger to achieve unprecedented
accuracy, and the accompanying increased computation and memory requirements
necessitate the employment of massive clusters and elaborate parallelization
strategies to accelerate DNN training. In order to better optimize the
performance and analyze the cost, it is indispensable to model the training
throughput of distributed DNN training. However, complex parallelization
strategies and the resulting complex runtime behaviors make it challenging to
construct an accurate performance model. In this paper, we present Proteus, the
first standalone simulator to model the performance of complex parallelization
strategies through simulation execution. Proteus first models complex
parallelization strategies with a unified representation named Strategy Tree.
Then, it compiles the strategy tree into a distributed execution graph and
simulates the complex runtime behaviors, comp-comm overlap and bandwidth
sharing, with a Hierarchical Topo-Aware Executor (HTAE). We finally evaluate
Proteus across a wide variety of DNNs on three hardware configurations.
Experimental results show that Proteus achieves average prediction
error and preserves order for training throughput of various parallelization
strategies. Compared to state-of-the-art approaches, Proteus reduces prediction
error by up to
OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation
Hallucination, posed as a pervasive challenge of multi-modal large language
models (MLLMs), has significantly impeded their real-world usage that demands
precise judgment. Existing methods mitigate this issue with either training
with specific designed data or inferencing with external knowledge from other
sources, incurring inevitable additional costs. In this paper, we present
OPERA, a novel MLLM decoding method grounded in an Over-trust Penalty and a
Retrospection-Allocation strategy, serving as a nearly free lunch to alleviate
the hallucination issue without additional data, knowledge, or training. Our
approach begins with an interesting observation that, most hallucinations are
closely tied to the knowledge aggregation patterns manifested in the
self-attention matrix, i.e., MLLMs tend to generate new tokens by focusing on a
few summary tokens, but not all the previous tokens. Such partial over-trust
inclination results in the neglecting of image tokens and describes the image
content with hallucination. Based on the observation, OPERA introduces a
penalty term on the model logits during the beam-search decoding to mitigate
the over-trust issue, along with a rollback strategy that retrospects the
presence of summary tokens in the previously generated tokens, and re-allocate
the token selection if necessary. With extensive experiments, OPERA shows
significant hallucination-mitigating performance on different MLLMs and
metrics, proving its effectiveness and generality. Our code is available at:
https://github.com/shikiw/OPERA.Comment: CVPR 2024, code is available at https://github.com/shikiw/OPER
- …