Real-time Stereo Matching is a cornerstone algorithm for many Extended
Reality (XR) applications, such as indoor 3D understanding, video pass-through,
and mixed-reality games. Despite significant advancements in deep stereo
methods, achieving real-time depth inference with high accuracy on a low-power
device remains a major challenge. One of the major difficulties is the lack of
high-quality indoor video stereo training datasets captured by head-mounted
VR/AR glasses. To address this issue, we introduce a novel video stereo
synthetic dataset that comprises photorealistic renderings of various indoor
scenes and realistic camera motion captured by a 6-DoF moving VR/AR
head-mounted display (HMD). This facilitates the evaluation of existing
approaches and promotes further research on indoor augmented reality scenarios.
Our newly proposed dataset enables us to develop a novel framework for
continuous video-rate stereo matching.
As another contribution, our dataset enables us to proposed a new video-based
stereo matching approach tailored for XR applications, which achieves real-time
inference at an impressive 134fps on a standard desktop computer, or 30fps on a
battery-powered HMD. Our key insight is that disparity and contextual
information are highly correlated and redundant between consecutive stereo
frames. By unrolling an iterative cost aggregation in time (i.e. in the
temporal dimension), we are able to distribute and reuse the aggregated
features over time. This approach leads to a substantial reduction in
computation without sacrificing accuracy. We conducted extensive evaluations
and comparisons and demonstrated that our method achieves superior performance
compared to the current state-of-the-art, making it a strong contender for
real-time stereo matching in VR/AR applications