Conversational-side-specific inter-session variability compensation
Abstract
General techniques for inter-session variability compensation may not capture session and channel information specific to a given conversational side. This paper investigates three methods for estimating a conversational-side- specific projection or affine transform to compensate for session and channel effects. In the first, we estimate the projection based on an estimate of the within-class covariance matrix using a conversational-side-specific subset of the development data. In the second, we use a discriminative objective function to estimate the projection parameters. We present an iterative algorithm similar to the expectation maximization (EM) algorithm to estimate the projection parameters which maximize this objective function. An affine transform of the observation vectors of each conversational side is estimated using maximum likelihood estimation in the third method. The maximum likelihood objective function is estimated on a selected subset of the development data. We present several experiments that show how these three techniques perform compared to our baseline system on the interview tasks of the NIST 2008 and the NIST 2010 speaker recognition evaluations. The best method of these techniques gives a performance improvement of up to 20% relative compared to the baseline system. Copyright © 2011 ISCA.