Performance prediction for exponential language models

Stanley F. Chen

doi:10.3115/1620754.1620820

Publication

NAACL-HLT 2009

Conference paper

Performance prediction for exponential language models

NAACL-HLT 2009

View publication

Abstract

We investigate the task of performance prediction for language models belonging to the exponential family. First, we attempt to empirically discover a formula for predicting test set cross-entropy for n-gram language models. We build models over varying domains, data set sizes, and n-gram orders, and perform linear regression to see whether we can model test set performance as a simple function of training set performance and various model statistics. Remarkably, we find a simple relationship that predicts test set performance with a correlation of 0.9997. We analyze why this relationship holds and show that it holds for other exponential language models as well, including class-based models and minimum discrimination information models. Finally, we discuss how this relationship can be applied to improve language model performance. © 2009 Association for Computational Linguistics.

Date

31 May 2009

Publication

NAACL-HLT 2009

Authors

Stanley F. Chen

IBM-affiliated at time of publication

Abstract

Date

Publication

Authors

Share