1,077 research outputs found
Multilevel Language and Vision Integration for Text-to-Clip Retrieval
We address the problem of text-based activity retrieval in video. Given a
sentence describing an activity, our task is to retrieve matching clips from an
untrimmed video. To capture the inherent structures present in both text and
video, we introduce a multilevel model that integrates vision and language
features earlier and more tightly than prior work. First, we inject text
features early on when generating clip proposals, to help eliminate unlikely
clips and thus speed up processing and boost performance. Second, to learn a
fine-grained similarity metric for retrieval, we use visual features to
modulate the processing of query sentences at the word level in a recurrent
neural network. A multi-task loss is also employed by adding query
re-generation as an auxiliary task. Our approach significantly outperforms
prior work on two challenging benchmarks: Charades-STA and ActivityNet
Captions.Comment: AAAI 201
FaceAtt: Enhancing Image Captioning with Facial Attributes for Portrait Images
Automated image caption generation is a critical area of research that
enhances accessibility and understanding of visual content for diverse
audiences. In this study, we propose the FaceAtt model, a novel approach to
attribute-focused image captioning that emphasizes the accurate depiction of
facial attributes within images. FaceAtt automatically detects and describes a
wide range of attributes, including emotions, expressions, pointed noses, fair
skin tones, hair textures, attractiveness, and approximate age ranges.
Leveraging deep learning techniques, we explore the impact of different image
feature extraction methods on caption quality and evaluate our model's
performance using metrics such as BLEU and METEOR. Our FaceAtt model leverages
annotated attributes of portraits as supplementary prior knowledge for our
portrait images before captioning. This innovative addition yields a subtle yet
discernible enhancement in the resulting scores, exemplifying the potency of
incorporating additional attribute vectors during training. Furthermore, our
research contributes to the broader discourse on ethical considerations in
automated captioning. This study sets the stage for future research in refining
attribute-focused captioning techniques, with a focus on enhancing linguistic
coherence, addressing biases, and accommodating diverse user needs
- …