4,763 research outputs found
Relaxed Spatio-Temporal Deep Feature Aggregation for Real-Fake Expression Prediction
Frame-level visual features are generally aggregated in time with the
techniques such as LSTM, Fisher Vectors, NetVLAD etc. to produce a robust
video-level representation. We here introduce a learnable aggregation technique
whose primary objective is to retain short-time temporal structure between
frame-level features and their spatial interdependencies in the representation.
Also, it can be easily adapted to the cases where there have very scarce
training samples. We evaluate the method on a real-fake expression prediction
dataset to demonstrate its superiority. Our method obtains 65% score on the
test dataset in the official MAP evaluation and there is only one misclassified
decision with the best reported result in the Chalearn Challenge (i.e. 66:7%) .
Lastly, we believe that this method can be extended to different problems such
as action/event recognition in future.Comment: Submitted to International Conference on Computer Vision Workshop
Gestures Everywhere: A Multimodal Sensor Fusion and Analysis Framework for Pervasive Displays
Gestures Everywhere is a dynamic framework for multimodal sensor fusion, pervasive analytics and gesture recognition. Our framework aggregates the real-time data from approximately 100 sensors that include RFID readers, depth cameras and RGB cameras distributed across 30 interactive displays that are located in key public areas of the MIT Media Lab. Gestures Everywhere fuses the multimodal sensor data using radial basis function particle filters and performs real-time analysis on the aggregated data. This includes key spatio-temporal properties such as presence, location and identity; in addition to higher-level analysis including social clustering and gesture recognition. We describe the algorithms and architecture of our system and discuss the lessons learned from the systems deployment
- …