DeepActsNet: A deep ensemble framework combining features from face, hands, and body for action recognition
Abstract
Human action recognition from videos has gained substantial focus due to its wide applications in the field of video understanding. Most of the existing approaches extract human skeleton data from videos to encode actions because of the invariance nature of the skeleton information with respect to lightning conditions and background changes. Despite their success in achieving high recognition accuracy, methods based on limited body joints fail to capture the nuances of subtle body parts which are highly relevant for discriminating similar actions. In this paper, we overcome this limitation by presenting a holistic framework for combining spatial and motion features from the body, face, and hands to develop a novel data representation termed “Deep Actions Stamps (DeepActs)” for video-based action recognition. Compared to the skeleton sequences based on limited body joints, DeepActs encode more effective spatio-temporal features that provide robustness against pose estimation noises and improve action recognition accuracy. We also present “DeepActsNet”, a deep learning based ensemble model which learns convolutional and structural features from Deep Action Stamps for highly accurate action recognition. Experiments on three challenging action recognition datasets (NTU60, NTU120, and SYSU) show that the proposed model produces significant improvements in the action recognition accuracy with less computational cost compared to the state-of-the-art methods.