Fine-grained Scheduling in FPGA-Based Convolutional Neural Networks
FPGA has been considered as a promising solution to accelerate Convolutional Neural Networks (CNNs) for its excellent performance in energy efficiency and programmability. However, prior designs are usually designed for inference only as designers can map pre-trained models to the hardware in a very efficient way. However, those approaches may not be suitable for training CNN models. In this paper, we propose FConv, in which the CPU and FPGA work together in a fine-grained manner. The FPGA accelerator in FConv uses one Winograd-based convolver, which reduces the design complexity and improves performance. We apply double-buffer for output routine to effectively overlap computation and data transfer. We also integrate multiple PEs to improve data parallelism. We propose our analytical model for prediction and use it as a guide in task scheduling. We find the upper limit of performance under the current design based on the analytical model. We evaluate our design on VGG-16 and Densnet-40 on ImageNet and CIFAR-10. We achieve 262.43 GOP/s on the VGG-16 model, which is 2.13Ã? of the performance compared to FFT-based implementation on the same platform. We also achieve at most 4Ã?+ performance improvement compared MKL with 20 threads running on 10 core Intel processors.
CNN, Fine-grained, FPGA, accelerator