1,602 research outputs found
Localizing Active Objects from Egocentric Vision with Symbolic World Knowledge
The ability to actively ground task instructions from an egocentric view is
crucial for AI agents to accomplish tasks or assist humans virtually. One
important step towards this goal is to localize and track key active objects
that undergo major state change as a consequence of human actions/interactions
to the environment without being told exactly what/where to ground (e.g.,
localizing and tracking the `sponge` in video from the instruction "Dip the
`sponge` into the bucket."). While existing works approach this problem from a
pure vision perspective, we investigate to which extent the textual modality
(i.e., task instructions) and their interaction with visual modality can be
beneficial. Specifically, we propose to improve phrase grounding models'
ability on localizing the active objects by: (1) learning the role of `objects
undergoing change` and extracting them accurately from the instructions, (2)
leveraging pre- and post-conditions of the objects during actions, and (3)
recognizing the objects more robustly with descriptional knowledge. We leverage
large language models (LLMs) to extract the aforementioned action-object
knowledge, and design a per-object aggregation masking technique to effectively
perform joint inference on object phrases and symbolic knowledge. We evaluate
our framework on Ego4D and Epic-Kitchens datasets. Extensive experiments
demonstrate the effectiveness of our proposed framework, which leads to>54%
improvements in all standard metrics on the TREK-150-OPE-Det localization +
tracking task, >7% improvements in all standard metrics on the TREK-150-OPE
tracking task, and >3% improvements in average precision (AP) on the Ego4D SCOD
task.Comment: In Proceedings of the 2023 Conference on Empirical Methods in Natural
Language Processing (EMNLP
MELINDA: A Multimodal Dataset for Biomedical Experiment Method Classification
We introduce a new dataset, MELINDA, for Multimodal biomEdicaL experImeNt
methoD clAssification. The dataset is collected in a fully automated distant
supervision manner, where the labels are obtained from an existing curated
database, and the actual contents are extracted from papers associated with
each of the records in the database. We benchmark various state-of-the-art NLP
and computer vision models, including unimodal models which only take either
caption texts or images as inputs, and multimodal models. Extensive experiments
and analysis show that multimodal models, despite outperforming unimodal ones,
still need improvements especially on a less-supervised way of grounding visual
concepts with languages, and better transferability to low resource domains. We
release our dataset and the benchmarks to facilitate future research in
multimodal learning, especially to motivate targeted improvements for
applications in scientific domains.Comment: In The Thirty-Fifth AAAI Conference on Artificial Intelligence
(AAAI-21), 202
InSpaceType: Reconsider Space Type in Indoor Monocular Depth Estimation
Indoor monocular depth estimation has attracted increasing research interest.
Most previous works have been focusing on methodology, primarily experimenting
with NYU-Depth-V2 (NYUv2) Dataset, and only concentrated on the overall
performance over the test set. However, little is known regarding robustness
and generalization when it comes to applying monocular depth estimation methods
to real-world scenarios where highly varying and diverse functional
\textit{space types} are present such as library or kitchen. A study for
performance breakdown into space types is essential to realize a pretrained
model's performance variance. To facilitate our investigation for robustness
and address limitations of previous works, we collect InSpaceType, a
high-quality and high-resolution RGBD dataset for general indoor environments.
We benchmark 11 recent methods on InSpaceType and find they severely suffer
from performance imbalance concerning space types, which reveals their
underlying bias. We extend our analysis to 4 other datasets, 3 mitigation
approaches, and the ability to generalize to unseen space types. Our work marks
the first in-depth investigation of performance imbalance across space types
for indoor monocular depth estimation, drawing attention to potential safety
concerns for model deployment without considering space types, and further
shedding light on potential ways to improve robustness. See
\url{https://depthcomputation.github.io/DepthPublic} for data
- …