Search CORE

626 research outputs found

CuisineNet: Food Attributes Classification using Multi-scale Convolution Network

Author: Banu S.F.
Jabreel M.
Moreno A.
Puig D.
Radeva P.
Rashwan H.A.
Sarker M.M.K.
Singh V.K.
Publication venue
Publication date: 01/01/2018
Field of study

Diversity of food and its attributes represents the culinary habits of peoples from different countries. Thus, this paper addresses the problem of identifying food culture of people around the world and its flavor by classifying two main food attributes, cuisine and flavor. A deep learning model based on multi-scale convotuional networks is proposed for extracting more accurate features from input images. The aggregation of multi-scale convolution layers with different kernel size is also used for weighting the features results from different scales. In addition, a joint loss function based on Negative Log Likelihood (NLL) is used to fit the model probability to multi labeled classes for multi-modal classification task. Furthermore, this work provides a new dataset for food attributes, so-called Yummly48K, extracted from the popular food website, Yummly. Our model is assessed on the constructed Yummly48K dataset. The experimental results show that our proposed method yields 65% and 62% average F1 score on validation and test set which outperforming the state-of-the-art models.Comment: 8 pages, Submitted in CCIA 201

arXiv.org e-Print Archive

Queen's University Belfast Research Portal

ConTra: (Con)text (Tra)nsformer for Cross-Modal Video Retrieval

Author: Damen Dima
Fragomeni Adriano
Wray Michael
Publication venue
Publication date: 09/10/2022
Field of study

In this paper, we re-examine the task of cross-modal clip-sentence retrieval, where the clip is part of a longer untrimmed video. When the clip is short or visually ambiguous, knowledge of its local temporal context (i.e. surrounding video segments) can be used to improve the retrieval performance. We propose Context Transformer (ConTra); an encoder architecture that models the interaction between a video clip and its local temporal context in order to enhance its embedded representations. Importantly, we supervise the context transformer using contrastive losses in the cross-modal embedding space. We explore context transformers for video and text modalities. Results consistently demonstrate improved performance on three datasets: YouCook2, EPIC-KITCHENS and a clip-sentence version of ActivityNet Captions. Exhaustive ablation studies and context analysis show the efficacy of the proposed method.Comment: Accepted in ACCV 202

arXiv.org e-Print Archive

Explore Bristol Research