Conference paperMore is less: Learning efficient video representations by big-little network and depthwise temporal aggregation