Search CORE

24 research outputs found

DDLSTM: Dual-Domain LSTM for Cross-Dataset Action Recognition

Author: Damen Dima
Perrett Toby
Publication venue
Publication date: 18/04/2019
Field of study

Domain alignment in convolutional networks aims to learn the degree of layer-specific feature alignment beneficial to the joint learning of source and target datasets. While increasingly popular in convolutional networks, there have been no previous attempts to achieve domain alignment in recurrent networks. Similar to spatial features, both source and target domains are likely to exhibit temporal dependencies that can be jointly learnt and aligned. In this paper we introduce Dual-Domain LSTM (DDLSTM), an architecture that is able to learn temporal dependencies from two domains concurrently. It performs cross-contaminated batch normalisation on both input-to-hidden and hidden-to-hidden weights, and learns the parameters for cross-contamination, for both single-layer and multi-layer LSTM architectures. We evaluate DDLSTM on frame-level action recognition using three datasets, taking a pair at a time, and report an average increase in accuracy of 3.5%. The proposed DDLSTM architecture outperforms standard, fine-tuned, and batch-normalised LSTMs.Comment: To appear in CVPR 201

arXiv.org e-Print Archive

Explore Bristol Research

Visual Monitoring of Driver and Passenger Control Panel Interactions

Author: Dias Eduardo
Mirmehdi Majid
Perrett Toby
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/02/2017
Field of study

Crossref

Explore Bristol Research

Centre Stage: Centricity-based Audio-Visual Temporal Action Detection

Author: Damen Dima
Mirmehdi Majid
Perrett Toby
Wang Hanyuan
Publication venue
Publication date: 27/11/2023
Field of study

Previous one-stage action detection approaches have modelled temporal dependencies using only the visual modality. In this paper, we explore different strategies to incorporate the audio modality, using multi-scale cross-attention to fuse the two modalities. We also demonstrate the correlation between the distance from the timestep to the action centre and the accuracy of the predicted boundaries. Thus, we propose a novel network head to estimate the closeness of timesteps to the action centre, which we call the centricity score. This leads to increased confidence for proposals that exhibit more precise boundaries. Our method can be integrated with other one-stage anchor-free architectures and we demonstrate this on three recent baselines on the EPIC-Kitchens-100 action detection benchmark where we achieve state-of-the-art performance. Detailed ablation studies showcase the benefits of fusing audio and our proposed centricity scores. Code and models for our proposed method are publicly available at https://github.com/hanielwang/Audio-Visual-TAD.gitComment: Accepted to VUA workshop at BMVC 202

arXiv.org e-Print Archive

What can a cook in Italy teach a mechanic in India? Action Recognition Generalisation Over Scenarios and Locations

Author: Caputo Barbara
Damen Dima
Perrett Toby
Plizzari Chiara
Publication venue
Publication date: 24/08/2023
Field of study

We propose and address a new generalisation problem: can a model trained for action recognition successfully classify actions when they are performed within a previously unseen scenario and in a previously unseen location? To answer this question, we introduce the Action Recognition Generalisation Over scenarios and locations dataset (ARGO1M), which contains 1.1M video clips from the large-scale Ego4D dataset, across 10 scenarios and 13 locations. We demonstrate recognition models struggle to generalise over 10 proposed test splits, each of an unseen scenario in an unseen location. We thus propose CIR, a method to represent each video as a Cross-Instance Reconstruction of videos from other domains. Reconstructions are paired with text narrations to guide the learning of a domain generalisable representation. We provide extensive analysis and ablations on ARGO1M that show CIR outperforms prior domain generalisation works on all test splits. Code and data: https://chiaraplizz.github.io/what-can-a-cook/.Comment: Accepted at ICCV 2023. Project page: https://chiaraplizz.github.io/what-can-a-cook

arXiv.org e-Print Archive

Recurrent Assistance:Cross-Dataset Training of LSTMs on Kitchen Tasks

Author: Damen Dima
Perrett Toby
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/02/2018
Field of study

Crossref

Explore Bristol Research

DDLSTM:Dual-Domain LSTM for Cross-Dataset Action Recognition

Author: Damen Dima
Perrett Toby
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 09/01/2020
Field of study

Explore Bristol Research

What can a cook in Italy teach a mechanic in India? Action Recognition Generalisation Over Scenarios and Locations

Author: Caputo Barbara
Damen Dima
Perrett Toby J
Plizzari Chiara
Publication venue
Publication date: 06/10/2023
Field of study

Explore Bristol Research

Use Your Head: Improving Long-Tail Video Recognition

Author: Burghardt Tilo
Damen Dima
Mirmehdi Majid
Perrett Toby
Sinha Saptarshi
Publication venue
Publication date: 03/04/2023
Field of study

This paper presents an investigation into long-tail video recognition. We demonstrate that, unlike naturally-collected video datasets and existing long-tail image benchmarks, current video benchmarks fall short on multiple long-tailed properties. Most critically, they lack few-shot classes in their tails. In response, we propose new video benchmarks that better assess long-tail recognition, by sampling subsets from two datasets: SSv2 and VideoLT. We then propose a method, Long-Tail Mixed Reconstruction, which reduces overfitting to instances from few-shot classes by reconstructing them as weighted combinations of samples from head classes. LMR then employs label mixing to learn robust decision boundaries. It achieves state-of-the-art average class accuracy on EPIC-KITCHENS and the proposed SSv2-LT and VideoLT-LT. Benchmarks and code at: tobyperrett.github.io/lmrComment: CVPR 202

arXiv.org e-Print Archive

Person Re-ID by Fusion of Video Silhouettes and Wearable Signals for Home Monitoring Applications

Author: Burghardt Tilo
Damen Dima
Masullo Alessandro
Mirmehdi Majid
Perrett Toby
Publication venue: 'MDPI AG'
Publication date: 01/05/2020
Field of study

The use of visual sensors for monitoring people in their living environments is critical in processing more accurate health measurements, but their use is undermined by the issue of privacy. Silhouettes, generated from RGB video, can help towards alleviating the issue of privacy to some considerable degree. However, the use of silhouettes would make it rather complex to discriminate between different subjects, preventing a subject-tailored analysis of the data within a free-living, multi-occupancy home. This limitation can be overcome with a strategic fusion of sensors that involves wearable accelerometer devices, which can be used in conjunction with the silhouette video data, to match video clips to a specific patient being monitored. The proposed method simultaneously solves the problem of Person ReID using silhouettes and enables home monitoring systems to employ sensor fusion techniques for data analysis. We develop a multimodal deep-learning detection framework that maps short video clips and accelerations into a latent space where the Euclidean distance can be measured to match video and acceleration streams. We train our method on the SPHERE Calorie Dataset, for which we show an average area under the ROC curve of 76.3% and an assignment accuracy of 77.4%. In addition, we propose a novel triplet loss for which we demonstrate improving performances and convergence speed

Multidisciplinary Digital Publishing Institute

Explore Bristol Research

Meta-Learning with Context-Agnostic Initialisations

Author: Burghardt Tilo
Damen Dima
Masullo Alessandro
Mirmehdi Majid
Perrett Toby J
Publication venue
Publication date: 29/07/2020
Field of study

Meta-learning approaches have addressed few-shot problems by finding initialisations suited for fine-tuning to target tasks. Often there are additional properties within training data (which we refer to as context), not relevant to the target task, which act as a distractor to meta-learning, particularly when the target task contains examples from a novel context not seen during training. We address this oversight by incorporating a context-adversarial component into the meta-learning process. This produces an initialisation for fine-tuning to target which is both context-agnostic and task-generalised. We evaluate our approach on three commonly used meta-learning algorithms and two problems. We demonstrate our context-agnostic meta-learning improves results in each case. First, we report on Omniglot few-shot character classification, using alphabets as context. An average improvement of 4.3% is observed across methods and tasks when classifying characters from an unseen alphabet. Second, we evaluate on a dataset for personalised energy expenditure predictions from video, using participant knowledge as context. We demonstrate that context-agnostic meta-learning decreases the average mean square error by 30%

arXiv.org e-Print Archive

Explore Bristol Research