3 research outputs found
Multimodal active speaker detection and virtual cinematography for video conferencing
Active speaker detection (ASD) and virtual cinematography (VC) can
significantly improve the remote user experience of a video conference by
automatically panning, tilting and zooming of a video conferencing camera:
users subjectively rate an expert video cinematographer's video significantly
higher than unedited video. We describe a new automated ASD and VC that
performs within 0.3 MOS of an expert cinematographer based on subjective
ratings with a 1-5 scale. This system uses a 4K wide-FOV camera, a depth
camera, and a microphone array; it extracts features from each modality and
trains an ASD using an AdaBoost machine learning system that is very efficient
and runs in real-time. A VC is similarly trained using machine learning to
optimize the subjective quality of the overall experience. To avoid distracting
the room participants and reduce switching latency the system has no moving
parts -- the VC works by cropping and zooming the 4K wide-FOV video stream. The
system was tuned and evaluated using extensive crowdsourcing techniques and
evaluated on a dataset with N=100 meetings, each 2-5 minutes in length
Audio-visual speech activity detection in a two-speaker scenario incorporating depth information from a profile or frontal view
Motivated by increasing popularity of depth visual sensors, such as the Kinect device, we investigate the utility of depth information in audio-visual speech activity detection. A two-subject scenario is assumed, allowing to also consider speech overlap. Two sensory setups are employed, where depth video captures either a frontal or profile view of the subjects, and is subsequently combined with the corresponding planar video and audio streams. Further, multi-view fusion is regarded, using audio and planar video from a sensor at the complementary view setup. Support vector machines provide temporal speech activity classification for each visually detected subject, fusing the available modality streams. Classification results are further combined to yield speaker diarization. Experiments are reported on a suitable audio-visual corpus recorded by two Kinects. Results demonstrate the benefits of depth information, particularly in the frontal depth view setup, reducing speech activity detection and speaker diarization errors over systems that ignore it. © 2016 IEEE