Finite-Sample Analysis of Off-Policy TD-Learning via Generalized Bellman Operators
Abstract
In TD-learning, off-policy sampling is known to be more practical than on-policy sampling, and by decoupling learning from data collection, it enables data reuse. It is known that policy evaluation has the interpretation of solving a generalized Bellman equation. In this paper, we derive finite-sample bounds for any general off-policy TD-like stochastic approximation algorithm that solves for the fixed-point of this generalized Bellman operator. Our key step is to show that the generalized Bellman operator is simultaneously a contraction mapping with respect to a weighted `p-norm for each p in [1, ∞), with a common contraction factor. Off-policy TD-learning is known to suffer from high variance due to the product of importance sampling ratios. A number of algorithms (e.g. Qπ (λ), Tree-Backup(λ), Retrace(λ), and Q-trace) have been proposed in the literature to address this issue. Our results immediately imply finite-sample bounds of these algorithms. In particular, we provide first-known finite-sample guarantees for Qπ (λ), TreeBackup(λ), and Retrace(λ), and improve the best known bounds of Q-trace in [19]. Moreover, we show the bias-variance trade-offs in each of these algorithms.