Practical efficiency of asynchronous stochastic gradient descent
Abstract
Stochastic gradient descent (SGD) and its distributed variants are essential to leverage modern computing resources for large-scale machine learning tasks. ASGD [1] is one of the most popular asynchronous distributed variant of SGD. Recent mathematical analyses have shown that with certain assumptions on the learning task (and ignoring communication cost), ASGD exhibits linear speed-up asymptotically. However, as practically observed, ASGD does not lead linear speed-up as we increase the number of learners. Motivated by this, we investigate finite time convergence properties of ASGD. We observe that the learning rate used by mathematical analyses to guarantee linear speed-up can be very small (and practically sub-optimal with respect to convergence speed) as opposed to practically chosen learning rates (for quick convergence) which exhibit sub-linear speed-up. We show that such an observation can in fact be supported by mathematical analysis, i.e., in the finite time regime, better convergence rate guarantees can be proven for ASGD with small number of learners, thus indicating lack of linear speed up as we increase the number of learners. Thus we conclude that even with ignoring communication cost, there is an inherent inefficiency in ASGD with respect to increasing the number of learners.