6 research outputs found
Bypass Enhancement RGB Stream Model for Pedestrian Action Recognition of Autonomous Vehicles
Pedestrian action recognition and intention prediction is one of the core
issues in the field of autonomous driving. In this research field, action
recognition is one of the key technologies. A large number of scholars have
done a lot of work to im-prove the accuracy of the algorithm for the task.
However, there are relatively few studies and improvements in the computational
complexity of algorithms and sys-tem real-time. In the autonomous driving
application scenario, the real-time per-formance and ultra-low latency of the
algorithm are extremely important evalua-tion indicators, which are directly
related to the availability and safety of the au-tonomous driving system. To
this end, we construct a bypass enhanced RGB flow model, which combines the
previous two-branch algorithm to extract RGB feature information and optical
flow feature information respectively. In the train-ing phase, the two branches
are merged by distillation method, and the bypass enhancement is combined in
the inference phase to ensure accuracy. The real-time behavior of the behavior
recognition algorithm is significantly improved on the premise that the
accuracy does not decrease. Experiments confirm the superiority and
effectiveness of our algorithm.Comment: Accepted to ACPR 2019 - Workshop on Computer Vision for Modern
Vehicle
End-to-end Lip-reading: A Preliminary Study
Deep lip-reading is the combination of the domains of computer vision and natural language processing. It uses deep neural networks to extract speech from silent videos. Most works in lip-reading use a multi staged training approach due to the complex nature of the task. A single stage, end-to-end, unified training approach, which is an ideal of machine learning, is also the goal in lip-reading. However, pure end-to-end systems have not yet been able to perform as good as non-end-to-end systems. Some exceptions to this are the very recent Temporal Convolutional Network (TCN) based architectures. This work lays out preliminary study of deep lip-reading, with a special focus on various end-to-end approaches. The research aims to test whether a purely end-to-end approach is justifiable for a task as complex as deep lip-reading. To achieve this, the meaning of pure end-to-end is first defined and several lip-reading systems that follow the definition are analysed. The system that most closely matches the definition is then adapted for pure end-to-end experiments. Four main contributions have been made: i) An analysis of 9 different end-to-end deep lip-reading systems, ii) Creation and public release of a pipeline1 to adapt sentence level Lipreading Sentences in the Wild 3 (LRS3) dataset into word level, iii) Pure end-to-end training of a TCN based network and evaluation on LRS3 word-level dataset as a proof of concept, iv) a public online portal2 to analyse visemes and experiment live end-to-end lip-reading inference. The study is able to verify that pure end-to-end is a sensible approach and an achievable goal for deep machine lip-reading
End-to-End Deep Lip-reading: A Preliminary Study
Deep lip-reading is the use of deep neural networks to extract speech from silent videos. Most works in lip-reading use a multi staged training approach due to the complex nature of the task. A single stage, end-to-end, unified training approach, which is an ideal of machine learning, is also the goal in lip-reading. However, pure end-to-end systems have so far failed to perform as good as non-end-to-end systems. Some exceptions to this are the very recent Temporal Convolutional Network (TCN) based architectures (Martinez et al., 2020; Martinez et al., 2021). This work lays out preliminary study of deep lip-reading, with a special focus on various end-to-end approaches. The research aims to test whether a purely end-to-end approach is justifiable for a task as complex as deep lip-reading. To achieve this, the meaning of pure end-to-end is first defined and several lip-reading systems that follow the definition are analysed. The system that most closely matches the definition is then adapted for pure end-to-end experiments. We make four main contributions: i) An analysis of 9 different end-to-end deep lip-reading systems, ii) Creation and public release of a pipeline to adapt sentence level Lipreading Sentences in the Wild 3 (LRS3) dataset into word level, iii) Pure end-to-end training of a TCN based network and evaluation on LRS3 word-level dataset as a proof of concept, iv) a public online portal to analyse visemes and experiment live end-to-end lip-reading inference. The study is able to verify that pure end-to-end is a sensible approach and an achievable goal for deep machine lip-reading