Publication
VLDB 2005
Conference paper

QoS-based data access and placement for federated systems

Abstract

A wide variety of applications require access to multiple heterogeneous, distributed data sources. By transparently integrating such diverse data sources, underlying differ- ences in DBMSs, languages, and data models can be hid- den and users can use a single data model and a single high- level query language to access the unified data through a global schema. To address the needs of such federated information sys- tems, IBM has developed the DB2 Information Integra- tor (II) to provide relational access to both relational DBMSs and non-relational sources, such as file systems and web services. These data sources are registered at II as nicknames and thereafter can be accessed via wrap- pers. Statistics about the remote databases are collected and maintained at II for later use by the optimizer for cost- ing query plans. DB2 Information Integrator deploys cost-based query optimization to select a low cost global query plan to ex- ecute. Thus, cost functions used by II heavily influence what remote servers (i.e. equivalent data sources) to ac- cess and how federated queries are processed. Cost esti- mation is usually based on database statistics, query state- ments, and the local and remote system configuration, such as the CPU power and I/O device characteristics. DB2 al- lows the system administrator to specify expected network latency between II and the remote servers. However, ex- isting cost functions do not consider (1) the load on the remote servers, dynamic nature of network latency be- tween remote servers and II, and the availability of the remote sources. As a result, federated information systems cannot dynamically adapt to runtime environment changes, such as network congestions or load spikes at the remote sources. Also, since the query plans are generated via cost-based decision making process, currently, there are no mechanisms to avoid fast but unreliable sources. Further- more, II optimizes user queries individually rather than op- timizing a workload as a whole nor does it consider QoS goals. In some scenarios, it is required to distribute queries among servers for balancing load and differentiating QoS requirement for response time.

Date

Publication

VLDB 2005

Authors

Share