626 research outputs found
CuisineNet: Food Attributes Classification using Multi-scale Convolution Network
Diversity of food and its attributes represents the culinary habits of
peoples from different countries. Thus, this paper addresses the problem of
identifying food culture of people around the world and its flavor by
classifying two main food attributes, cuisine and flavor. A deep learning model
based on multi-scale convotuional networks is proposed for extracting more
accurate features from input images. The aggregation of multi-scale convolution
layers with different kernel size is also used for weighting the features
results from different scales. In addition, a joint loss function based on
Negative Log Likelihood (NLL) is used to fit the model probability to multi
labeled classes for multi-modal classification task. Furthermore, this work
provides a new dataset for food attributes, so-called Yummly48K, extracted from
the popular food website, Yummly. Our model is assessed on the constructed
Yummly48K dataset. The experimental results show that our proposed method
yields 65% and 62% average F1 score on validation and test set which
outperforming the state-of-the-art models.Comment: 8 pages, Submitted in CCIA 201
ConTra: (Con)text (Tra)nsformer for Cross-Modal Video Retrieval
In this paper, we re-examine the task of cross-modal clip-sentence retrieval,
where the clip is part of a longer untrimmed video. When the clip is short or
visually ambiguous, knowledge of its local temporal context (i.e. surrounding
video segments) can be used to improve the retrieval performance. We propose
Context Transformer (ConTra); an encoder architecture that models the
interaction between a video clip and its local temporal context in order to
enhance its embedded representations. Importantly, we supervise the context
transformer using contrastive losses in the cross-modal embedding space. We
explore context transformers for video and text modalities. Results
consistently demonstrate improved performance on three datasets: YouCook2,
EPIC-KITCHENS and a clip-sentence version of ActivityNet Captions. Exhaustive
ablation studies and context analysis show the efficacy of the proposed method.Comment: Accepted in ACCV 202
- …