Voice user interfaces and digital assistants are rapidly entering our lives
and becoming singular touch points spanning our devices. These always-on
services capture and transmit our audio data to powerful cloud services for
further processing and subsequent actions. Our voices and raw audio signals
collected through these devices contain a host of sensitive paralinguistic
information that is transmitted to service providers regardless of deliberate
or false triggers. As our emotional patterns and sensitive attributes like our
identity, gender, mental well-being, are easily inferred using deep acoustic
models, we encounter a new generation of privacy risks by using these services.
One approach to mitigate the risk of paralinguistic-based privacy breaches is
to exploit a combination of cloud-based processing with privacy-preserving,
on-device paralinguistic information learning and filtering before transmitting
voice data. In this paper we introduce EDGY, a configurable, lightweight,
disentangled representation learning framework that transforms and filters
high-dimensional voice data to identify and contain sensitive attributes at the
edge prior to offloading to the cloud. We evaluate EDGY's on-device performance
and explore optimization techniques, including model quantization and knowledge
distillation, to enable private, accurate and efficient representation learning
on resource-constrained devices. Our results show that EDGY runs in tens of
milliseconds with 0.2% relative improvement in ABX score or minimal performance
penalties in learning linguistic representations from raw voice signals, using
a CPU and a single-core ARM processor without specialized hardware.Comment: 14 pages, 7 figures. arXiv admin note: text overlap with
arXiv:2007.1506