Pattern discovery on character sets and real-valued data: Linear bound on irredundant motifs and an efficient polynomial time algorithm
Abstract
Given an input sequence of data, a motif is a repeating pattern, possibly interspersed with `dont care' characters. The data could be a sequence of characters or sets of characters or even real values. In the first two cases, the number of motifs could potentially be exponential in the size of the input sequence and in the third case there could be infinite number of motifs (assuming two real numbers are equal if they are within some specified δ>0 of each other). However, it has been shown in some motif based applications, such as multiple sequence alignment [Par98, PFR99], that a small subset of these motifs capture all the relevant information. Also, from an information-theoretic viewpoint, the special motifs represent the sets with low co-relation, hence of higher informational value. In this paper, we show, for any sequence with n characters, there exists only a linear (or no more than 3 n) number of these special motifs and every other motif can be constructed from this set. We term these special motifs as irredundant motifs. We also present an efficient polynomial time algorithm to generate these motifs. This bound on the number of useful motifs gives validation to motif-based approaches, since the total number of irredundant motifs does not explode. This result is potentially of significance to most applications that use pattern discovery as the basic engine such as data mining, clustering, matching and others.