QoS-based data access and placement for federated systems
Abstract
A wide variety of applications require access to multiple heterogeneous, distributed data sources. By transparently integrating such diverse data sources, underlying differ- ences in DBMSs, languages, and data models can be hid- den and users can use a single data model and a single high- level query language to access the unified data through a global schema. To address the needs of such federated information sys- tems, IBM has developed the DB2 Information Integra- tor (II) to provide relational access to both relational DBMSs and non-relational sources, such as file systems and web services. These data sources are registered at II as nicknames and thereafter can be accessed via wrap- pers. Statistics about the remote databases are collected and maintained at II for later use by the optimizer for cost- ing query plans. DB2 Information Integrator deploys cost-based query optimization to select a low cost global query plan to ex- ecute. Thus, cost functions used by II heavily influence what remote servers (i.e. equivalent data sources) to ac- cess and how federated queries are processed. Cost esti- mation is usually based on database statistics, query state- ments, and the local and remote system configuration, such as the CPU power and I/O device characteristics. DB2 al- lows the system administrator to specify expected network latency between II and the remote servers. However, ex- isting cost functions do not consider (1) the load on the remote servers, dynamic nature of network latency be- tween remote servers and II, and the availability of the remote sources. As a result, federated information systems cannot dynamically adapt to runtime environment changes, such as network congestions or load spikes at the remote sources. Also, since the query plans are generated via cost-based decision making process, currently, there are no mechanisms to avoid fast but unreliable sources. Further- more, II optimizes user queries individually rather than op- timizing a workload as a whole nor does it consider QoS goals. In some scenarios, it is required to distribute queries among servers for balancing load and differentiating QoS requirement for response time.