A study of unsupervised clustering techniques for language modeling

Sangyun Hahn; Abhinav Sethy; Hong-Kwang J. Kuo; Bhuvana Ramabhadran

Publication

INTERSPEECH 2008

Conference paper

A study of unsupervised clustering techniques for language modeling

INTERSPEECH 2008

Abstract

There has been recent interest in clustering text data to build topic-specific language models for large vocabulary speech recognition. In this paper, we studied various unsupervised clustering algorithms on several corpora. First we compared the clustering methods with quality metrics such as entropy and purity. Of the techniques studied, two-phase bisecting K-means achieved good performance with relatively fast speed. Then we performed speech recognition experiments on English and Arabic systems using the automatically derived topic-based language models. We obtained modest word error rate improvements, comparable to previously published studies. A careful analysis of the correlation between word error rate and the distribution of misrecognized words, including an information-gain metric, is presented. Copyright © 2008 ISCA.

Date

01 Dec 2008

Publication

INTERSPEECH 2008

Authors

IBM-affiliated at time of publication

Topics

Human-Centered AI

Abstract

Date

Publication

Authors

Topics

Share