Scalable minimum bayes risk training of deep neural network acoustic models using distributed hessian-free optimization
Abstract
Training neural network acoustic models with sequence-discriminative criteria, such as state-level minimum Bayes risk (sMBR), been shown to produce large improvements in performance over cross-entropy. However, because they entail the processing of lattices, sequence criteria are much more computationally intensive than cross-entropy. We describe a distributed neural network training algorithm, based on Hessian-free optimization, that scales to deep networks and large data sets. For the sMBR criterion, this training algorithm is faster than stochastic gradient descent by a factor of 5.5 and yields a 4.4% relative improvement in word error rate on a 50-hour broadcast news task. Distributed Hessian-free sMBR training yields relative reductions in word error rate of 7-13% over cross-entropy training with stochastic gradient descent on two larger tasks: Switchboard and DARPA RATS noisy Levantine Arabic. Our best Switchboard DBN achieves a word error rate of 16.4% on rt03-FSH.