Deep Scattering Spectrum with deep neural networks
Abstract
State-of-the-art convolutional neural networks (CNNs) typically use a log-mel spectral representation of the speech signal. However, this representation is limited by the spectro-temporal resolution afforded by log-mel filter-banks. A novel technique known as Deep Scattering Spectrum (DSS) addresses this limitation and preserves higher resolution information, while ensuring time warp stability, through the cascaded application of the wavelet-modulus operator. The first order scatter is equivalent to log-mel features and standard CNN modeling techniques can directly be used with these features. However the higher order scatter, which preserves the higher resolution information, presents new challenges in modelling. This paper explores how to effectively use DSS features with CNN acoustic models. Specifically, we identify the effective normalization, neural network topology and regularization techniques to effectively model higher order scatter. The use of these higher order scatter features, in conjunction with CNNs, results in relative improvement of 7% compared to log-mel features on TIMIT, providing a phonetic error rate (PER) of 17.4%, one of the lowest reported PERs to date on this task. © 2014 IEEE.