Characterization of a large web site population with implications for content delivery
Abstract
This paper presents a systematic study of the properties of a large number of Web sites hosted by a major ISP. To our knowledge, ours is the first comprehensive study of a large server farm that contains thousands of commercial Web sites. We also perform a simulation analysis to estimate potential performance benefits of content delivery networks (CDNs) for these Web sites, and validate our analysis for several sites by replaying our trace through a real cache. We make several interesting observations about the current usage of Web technologies and Web site performance characteristics. First, compared with previous client workload studies, the Web server farm workload contains a much higher degree of uncacheable responses and responses that require mandatory cache validations. A significant reason for this is that cookie use is prevalent among our population, especially among more popular sites. We found an indication of widespread indiscriminate usage of cookies, which unnecessarily impedes the use of many content delivery optimizations. We also found that most Web sites do not utilize the cache-control features of the HTTP 1.1 protocol, resulting in suboptimal performance. Moreover, the implicit expiration time in client caches for responses is strongly constrained by the maximum values allowed in the Squid proxy. Thus, supplying explicit expiration information would significantly improve Web sites' cacheability. Finally, our simulation results indicate that while most Web sites benefit from the use of a CDN, the amount of the benefit varies widely among the sites, which underscores the need for workload analysis tools. © Springer Science + Business Media, LLC 2006.