Maximum likelihood nonlinear transformations based on deep neural networks
Abstract
Feature transformations are commonly used in speech recognition to account for distribution mismatches between the source and target domains (also referred to as covariate shift). Linear (affine) or piecewise linear transformations are typically considered. In this paper, we present deep neural network (DNN) based nonlinear feature transformations estimated under the maximum likelihood criterion. We use the hidden Markov model (HMM) to model speech feature sequences and features in each HMM state assume a Gaussian mixture model (GMM) distribution. The network is pre-trained close to a linear transformation followed by a fine-tuning using the gradient descent algorithm. Due to the nonlinearity, the gradients and the partition functions of GMM-HMM state distributions are evaluated using the Monte Carlo (MC) method based on importance sampling. In addition, a deep stacked architecture is proposed to hierarchically build a DNN as a series of sub-networks with each representing a nonlinear transformation itself, which can be learned using a block-wise learning strategy. Applications of the proposed nonlinear transformations in speaker/environment adaptation and acoustic modeling in large vocabulary continuous speech recognition tasks show its superior performance over the widely-used constrained maximum likelihood linear regression (CMLLR).