On nested palindromes in clickstream data
Abstract
In this paper we discuss an interesting and useful property of clickstream data. Often a visit includes repeated views of the same page. We show that in three real datasets, sampled from the websites of technology and consulting groups and a news broadcaster, page repetitions occur for the majority as a very specific structure, namely in the form of nested palindromes. This can be explained by the widespread use of features which are available in any web browser: the "refresh" and "back" buttons. Among the types of patterns which can be mined from sequence data, many either stumble if symbol repetitions are involved, or else fail to capture interesting aspects related to symbol repetitions. In an attempt to remedy this, we characterize the palindromic structures, and discuss possible ways of making use of them. One way is to pre-process the sequence data by explicitly inserting these structures, in order to obtain a richer output from conventional mining algorithms. Another application we discuss is to use the information directly, in order to analyze certain aspects of the website under study. We also provide the simple linear-time algorithm which we developed to identify and extract the structures from our data. © 2012 ACM.