Source separation and other audio applications have traditionally relied on
the use of short-time Fourier transforms as a front-end frequency domain
representation step. The unavailability of a neural network equivalent to
forward and inverse transforms hinders the implementation of end-to-end
learning systems for these applications. We present an auto-encoder neural
network that can act as an equivalent to short-time front-end transforms. We
demonstrate the ability of the network to learn optimal, real-valued basis
functions directly from the raw waveform of a signal and further show how it
can be used as an adaptive front-end for supervised source separation. In terms
of separation performance, these transforms significantly outperform their
Fourier counterparts. Finally, we also propose a novel source to distortion
ratio based cost function for end-to-end source separation.Comment: 4 figures, 4 page