In this paper, we propose PhantomSound, a query-efficient black-box attack
toward voice assistants. Existing black-box adversarial attacks on voice
assistants either apply substitution models or leverage the intermediate model
output to estimate the gradients for crafting adversarial audio samples.
However, these attack approaches require a significant amount of queries with a
lengthy training stage. PhantomSound leverages the decision-based attack to
produce effective adversarial audios, and reduces the number of queries by
optimizing the gradient estimation. In the experiments, we perform our attack
against 4 different speech-to-text APIs under 3 real-world scenarios to
demonstrate the real-time attack impact. The results show that PhantomSound is
practical and robust in attacking 5 popular commercial voice controllable
devices over the air, and is able to bypass 3 liveness detection mechanisms
with >95% success rate. The benchmark result shows that PhantomSound can
generate adversarial examples and launch the attack in a few minutes. We
significantly enhance the query efficiency and reduce the cost of a successful
untargeted and targeted adversarial attack by 93.1% and 65.5% compared with the
state-of-the-art black-box attacks, using merely ~300 queries (~5 minutes) and
~1,500 queries (~25 minutes), respectively.Comment: RAID 202