Event cameras, or dynamic vision sensors, have recently achieved success from
fundamental vision tasks to high-level vision researches. Due to its ability to
asynchronously capture light intensity changes, event camera has an inherent
advantage to capture moving objects in challenging scenarios including objects
under low light, high dynamic range, or fast moving objects. Thus event camera
are natural for visual object tracking. However, the current event-based
trackers derived from RGB trackers simply modify the input images to event
frames and still follow conventional tracking pipeline that mainly focus on
object texture for target distinction. As a result, the trackers may not be
robust dealing with challenging scenarios such as moving cameras and cluttered
foreground. In this paper, we propose a distractor-aware event-based tracker
that introduces transformer modules into Siamese network architecture (named
DANet). Specifically, our model is mainly composed of a motion-aware network
and a target-aware network, which simultaneously exploits both motion cues and
object contours from event data, so as to discover motion objects and identify
the target object by removing dynamic distractors. Our DANet can be trained in
an end-to-end manner without any post-processing and can run at over 80 FPS on
a single V100. We conduct comprehensive experiments on two large event tracking
datasets to validate the proposed model. We demonstrate that our tracker has
superior performance against the state-of-the-art trackers in terms of both
accuracy and efficiency