4,055 research outputs found

    Egocentric vision-based passive dietary intake monitoring

    Get PDF
    Egocentric (first-person) perception captures and reveals how people perceive their surroundings. This unique perceptual view enables passive and objective monitoring of human-centric activities and behaviours. In capturing egocentric visual data, wearable cameras are used. Recent advances in wearable technologies have enabled wearable cameras to be lightweight, accurate, and with long battery life, making long-term passive monitoring a promising solution for healthcare and human behaviour understanding. In addition, recent progress in deep learning has provided an opportunity to accelerate the development of passive methods to enable pervasive and accurate monitoring, as well as comprehensive modelling of human-centric behaviours. This thesis investigates and proposes innovative egocentric technologies for passive dietary intake monitoring and human behaviour analysis. Compared to conventional dietary assessment methods in nutritional epidemiology, such as 24-hour dietary recall (24HR) and food frequency questionnaires (FFQs), which heavily rely on subjects’ memory to recall the dietary intake, and trained dietitians to collect, interpret, and analyse the dietary data, passive dietary intake monitoring can ease such burden and provide more accurate and objective assessment of dietary intake. Egocentric vision-based passive monitoring uses wearable cameras to continuously record human-centric activities with a close-up view. This passive way of monitoring does not require active participation from the subject, and records rich spatiotemporal details for fine-grained analysis. Based on egocentric vision and passive dietary intake monitoring, this thesis proposes: 1) a novel network structure called PAR-Net to achieve accurate food recognition by mining discriminative food regions. PAR-Net has been evaluated with food intake images captured by wearable cameras as well as those non-egocentric food images to validate its effectiveness for food recognition; 2) a deep learning-based solution for recognising consumed food items as well as counting the number of bites taken by the subjects from egocentric videos in an end-to-end manner; 3) in light of privacy concerns in egocentric data, this thesis also proposes a privacy-preserved solution for passive dietary intake monitoring, which uses image captioning techniques to summarise the image content and subsequently combines image captioning with 3D container reconstruction to report the actual food volume consumed. Furthermore, a novel framework that integrates food recognition, hand tracking and face recognition has also been developed to tackle the challenge of assessing individual dietary intake in food sharing scenarios with the use of a panoramic camera. Extensive experiments have been conducted. Tested with both laboratory (captured in London) and field study data (captured in Africa), the above proposed solutions have proven the feasibility and accuracy of using the egocentric camera technologies with deep learning methods for individual dietary assessment and human behaviour analysis.Open Acces

    Assessing individual dietary intake in food sharing scenarios with food and human pose detection

    Get PDF
    Food sharing and communal eating are very common in some countries. To assess individual dietary intake in food sharing scenarios, this work proposes a vision-based approach to first capturing the food sharing scenario with a 360-degree camera, and then using a neural network to infer different eating states of each individual based on their body pose and relative positions to the dishes. The number of bites each individual has taken of each dish is then deduced by analyzing the inferred eating states. A new dataset with 14 panoramic food sharing videos was constructed to validate our approach. The results show that our approach is able to reliably predict different eating states as well as individual’s bite count with respect to each dish in food sharing scenarios

    Eating Speed Measurement Using Wrist-Worn IMU Sensors in Free-Living Environments

    Full text link
    Eating speed is an important indicator that has been widely scrutinized in nutritional studies. The relationship between eating speed and several intake-related problems such as obesity, diabetes, and oral health has received increased attention from researchers. However, existing studies mainly use self-reported questionnaires to obtain participants' eating speed, where they choose options from slow, medium, and fast. Such a non-quantitative method is highly subjective and coarse in individual level. In this study, we propose a novel approach to measure eating speed in free-living environments automatically and objectively using wrist-worn inertial measurement unit (IMU) sensors. Specifically, a temporal convolutional network combined with a multi-head attention module (TCN-MHA) is developed to detect bites (including eating and drinking gestures) from free-living IMU data. The predicted bite sequences are then clustered to eating episodes. Eating speed is calculated by using the time taken to finish the eating episode to divide the number of bites. To validate the proposed approach on eating speed measurement, a 7-fold cross validation is applied to the self-collected fine-annotated full-day-I (FD-I) dataset, and a hold-out experiment is conducted on the full-day-II (FD-II) dataset. The two datasets are collected from 61 participants in free-living environments with a total duration of 513 h, which are publicly available. Experimental results shows that the proposed approach achieves a mean absolute percentage error (MAPE) of 0.110 and 0.146 in the FD-I and FD-II datasets, respectively, showcasing the feasibility of automated eating speed measurement. To the best of our knowledge, this is the first study investigating automated eating speed measurement

    Using Deep Learning and 360 Video to Detect Eating Behavior for User Assistance Systems

    Get PDF
    The rising prevalence of non-communicable diseases calls for more sophisticated approaches to support individuals in engaging in healthy lifestyle behaviors, particularly in terms of their dietary intake. Building on recent advances in information technology, user assistance systems hold the potential of combining active and passive data collection methods to monitor dietary intake and, subsequently, to support individuals in making better decisions about their diet. In this paper, we review the state-of-the-art in active and passive dietary monitoring along with the issues being faced. Building on this groundwork, we propose a research framework for user assistance systems that combine active and passive methods with three distinct levels of assistance. Finally, we outline a proof-of-concept study using video obtained from a 360-degree camera to automatically detect eating behavior from video data as a source of passive dietary monitoring for decision support

    Optimized Matrix Feature Analysis – Convolutional Neural Network (OMFA-CNN) Model for Potato Leaf Diseases Detection System

    Get PDF
    One of the most often grown crops is the potato. As a main food, potatoes are prioritised for cultivation worldwide. Because potatoes are such a rich source of vitamins and minerals, we can create a robust system for food security. However, a number of illnesses delay the growth of agriculture and harm potato output. Consequently, early disease identification can offer a better answer for effective crop production. In this research work aim is to classify and detect the potato leave (PL) diseases using OMFA-CNN deep learning model. An optimized matrix feature analysis-CNN deep learning model for PL disease detection is implemented. In the first phase, the PLs features are extracted from the potato leave images using K-means clustering image segmentation method. At the last phase, a new OMFA-CNN model are proposed using CNN to classify virus, and bacterial diseases of PLs, The PL disease dataset consists 2351 images gathered in real-time and from the Kaggle (PlantVillage) dataset. The implemented OMFA-CNN model attained 99.3 % precision and 99 % recall on potato disease detection. The implemented method is also compared with MASK RCNN,SVM and other models and attained significantly high precision and recall

    Scaling Egocentric Vision: The EPIC-KITCHENS Dataset

    Get PDF
    First-person vision is gaining interest as it offers a unique viewpoint on people's interaction with objects, their attention, and even intention. However, progress in this challenging domain has been relatively slow due to the lack of sufficiently large datasets. In this paper, we introduce EPIC-KITCHENS, a large-scale egocentric video benchmark recorded by 32 participants in their native kitchen environments. Our videos depict nonscripted daily activities: we simply asked each participant to start recording every time they entered their kitchen. Recording took place in 4 cities (in North America and Europe) by participants belonging to 10 different nationalities, resulting in highly diverse cooking styles. Our dataset features 55 hours of video consisting of 11.5M frames, which we densely labeled for a total of 39.6K action segments and 454.3K object bounding boxes. Our annotation is unique in that we had the participants narrate their own videos (after recording), thus reflecting true intention, and we crowd-sourced ground-truths based on these. We describe our object, action and anticipation challenges, and evaluate several baselines over two test splits, seen and unseen kitchens. Dataset and Project page: http://epic-kitchens.github.ioComment: European Conference on Computer Vision (ECCV) 2018 Dataset and Project page: http://epic-kitchens.github.i

    I hear you eat and speak: automatic recognition of eating condition and food type, use-cases, and impact on ASR performance

    Get PDF
    We propose a new recognition task in the area of computational paralinguistics: automatic recognition of eating conditions in speech, i. e., whether people are eating while speaking, and what they are eating. To this end, we introduce the audio-visual iHEARu-EAT database featuring 1.6 k utterances of 30 subjects (mean age: 26.1 years, standard deviation: 2.66 years, gender balanced, German speakers), six types of food (Apple, Nectarine, Banana, Haribo Smurfs, Biscuit, and Crisps), and read as well as spontaneous speech, which is made publicly available for research purposes. We start with demonstrating that for automatic speech recognition (ASR), it pays off to know whether speakers are eating or not. We also propose automatic classification both by brute-forcing of low-level acoustic features as well as higher-level features related to intelligibility, obtained from an Automatic Speech Recogniser. Prediction of the eating condition was performed with a Support Vector Machine (SVM) classifier employed in a leave-one-speaker-out evaluation framework. Results show that the binary prediction of eating condition (i. e., eating or not eating) can be easily solved independently of the speaking condition; the obtained average recalls are all above 90%. Low-level acoustic features provide the best performance on spontaneous speech, which reaches up to 62.3% average recall for multi-way classification of the eating condition, i. e., discriminating the six types of food, as well as not eating. The early fusion of features related to intelligibility with the brute-forced acoustic feature set improves the performance on read speech, reaching a 66.4% average recall for the multi-way classification task. Analysing features and classifier errors leads to a suitable ordinal scale for eating conditions, on which automatic regression can be performed with up to 56.2% determination coefficient

    Video Storytelling: Textual Summaries for Events

    Full text link
    Bridging vision and natural language is a longstanding goal in computer vision and multimedia research. While earlier works focus on generating a single-sentence description for visual content, recent works have studied paragraph generation. In this work, we introduce the problem of video storytelling, which aims at generating coherent and succinct stories for long videos. Video storytelling introduces new challenges, mainly due to the diversity of the story and the length and complexity of the video. We propose novel methods to address the challenges. First, we propose a context-aware framework for multimodal embedding learning, where we design a Residual Bidirectional Recurrent Neural Network to leverage contextual information from past and future. Second, we propose a Narrator model to discover the underlying storyline. The Narrator is formulated as a reinforcement learning agent which is trained by directly optimizing the textual metric of the generated story. We evaluate our method on the Video Story dataset, a new dataset that we have collected to enable the study. We compare our method with multiple state-of-the-art baselines, and show that our method achieves better performance, in terms of quantitative measures and user study.Comment: Published in IEEE Transactions on Multimedi
    corecore