On classification and segmentation of massive audio data streams
Abstract
In recent years, the proliferation of VOIP data has created a number of applications in which it is desirable to perform quick online classification and recognition of massive voice streams. Typically such applications are encountered in real time intelligence and surveillance. In many cases, the data streams can be in compressed format, and the rate of data processing can often run at the rate of Gigabits per second. All known techniques for speaker voice analysis require the use of an offline training phase in which the system is trained with known segments of speech. The state-of-the-art method for text-independent speaker recognition is known as Gaussian mixture modeling (GMM), and it requires an iterative expectation maximization procedure for training, which cannot be implemented in real time. In many real applications (such as surveillance) it is desirable to perform the recognition process in online time, so that the system can be quickly adapted to new segments of the data. In many cases, it may also be desirable to quickly create databases of training profiles for speakers of interest. In this paper, we discuss the details of such an online voice recognition system. For this purpose, we use our micro-clustering algorithms to design concise signatures of the target speakers. One of the surprising and insightful observations from our experiences with such a system is that while it was originally designed only for efficiency, we later discovered that it was also more accurate than the widely used GMM. This was because of the conciseness of the micro-cluster model, which made it less prone to over training. This is evidence of the fact that it is often possible to get the best of both worlds and do better than complex models both from an efficiency and accuracy perspective. We present experimental results illustrating the effectiveness and efficiency of the method. © Springer-Verlag London Limited 2008.