Deep Ensemble Machine for Video Classification
Video classification has been extensively researched in computer vision due to its wide spread applications. However, it remains an outstanding task because of the great challenges in effective spatial-temporal feature extraction and efficient classification with high-dimensional video representations. To address these challenges, in this paper, we propose an end-to-end learning framework called deep ensemble machine (DEM) for video classification. Specifically, to establish effective spatio-temporal features, we propose using two deep convolutional neural networks (CNNs), i.e., vision and graphics group and C3-D to extract heterogeneous spatial and temporal features for complementary representations. To achieve efficient classification, we propose ensemble learning based on random projections aiming to transform high-dimensional features into a set of lower dimensional compact features in subspaces; an ensemble of classifiers is trained on the subspaces and combined with a weighting layer during the backpropagation. To further enhance the performance, we introduce rectified linear encoding (RLE) inspired from error-correcting output coding to encode the initial outputs of classifiers, followed by a softmax layer to produce the final classification results. DEM combines the strengths of deep CNNs and ensemble learning, which establishes a new end-to-end learning architecture for more accurate and efficient video classification. We show the great effectiveness of DEM by extensive experiments on four data sets for diverse video classification tasks including action recognition and dynamic scene classification. Results have shown that DEM achieves high performance on all tasks with an improvement of up to 13% on CIFAR10 data set over the baseline model.
Action recognition, deep learning, dynamic scene classification, ensemble learning, random projection, rectified linear encoding (RLE), video classification