Fast incremental adaptation using maximum likelihood regression and stochastic gradient descent
Abstract
Adaptation to a new speaker or environment is becoming very important as speech recognition systems are deployed in unpredictable real world situations. Constrained or Feature space Maximum Likelihood Regression (fMLLR) [1] has proved to be especially effective for this purpose, particularly when used for incremental unsupervised adaptation [2]. Unfortunately the standard implementation described in [1] and used by most authors since, requires statistics that require O(n3) operations to collect per frame. In addition the statistics require O(n3) space for storage and the estimation of the feature transform matrix requires O(n4) operations. This is an unacceptable cost for most embedded speech recognition systems. In this paper we show the fMLLR objective function can be optimized using stochastic gradient descent in a way that achieves almost the same results as the standard implementation. All this is accomplished with an algorithm that requires only O(n2) operations per frame and O(n2) storage requirements. This order of magnitude savings allows continuous adaptation to be implemented in most resource constrained embedded speech recognition applications.