Unnormalized exponential and neural network language models
Abstract
Model M, an exponential class-based language model, and neural network language models (NNLM's) have outperformed word n-gram language models over a wide range of tasks. However, these gains come at the cost of vastly increased computation when calculating word probabilities. For both models, the bulk of this computation involves evaluating the softmax function over a large word or class vocabulary to ensure that probabilities sum to 1. In this paper, we study unnormalized variants of Model M and NNLM's, whereby the softmax function is simply omitted. Accordingly, model training must be modified to encourage scores to sum close to 1. In this paper, we demonstrate up to a factor of 35 faster n-gram lookups with unnormalized models over their normalized counterparts, while still yielding state-of-the-art performance in WER (10.2 on the English broadcast news rt04 set).