Unfolded recurrent neural networks for speech recognition
Abstract
We introduce recurrent neural networks (RNNs) for acoustic modeling which are unfolded in time for a fixed number of time steps. The proposed models are feedforward networks with the property that the unfolded layers which correspond to the recurrent layer have time-shifted inputs and tied weight matrices. Besides the temporal depth due to unfolding, hierarchical processing depth is added by means of several non-recurrent hidden layers inserted between the unfolded layers and the output layer. The training of these models: (a) has a complexity that is comparable to deep neural networks (DNNs) with the same number of layers; (b) can be done on frame-randomized minibatches; (c) can be implemented efficiently through matrix-matrix operations on GPU architectures which makes it scalable for large tasks. Experimental results on the Switchboard 300 hours English conversational telephony task show a 5% relative improvement in word error rate over state-of-the-art DNNs trained on FMLLR features with i-vector speaker adaptation and hessianfree sequence discriminative training.