Fast GPU Convolution for CP-Decomposed Tensorial Neural Networks

Year
2020
Type(s)
Author(s)
A. Reustle and T. Rabbani and F. Huang
Source
Intelligent Systems Conference (IntelliSys) Proceedings 2020.
BibTeX
BibTeX

We present a new GPU operation for performing convolution with decomposed tensor products. We experimentally find up to 4.85x faster execution times than Nvidia’s cuDNN for some tensors. This is achieved by extending recent advances in compression of CNNs through use of tensor decomposition methods on filter parameter tensors. Progress had previously been limited by a lack of fast operations to compute the decomposed variants of critical functions such as 2D convolution. We interpret this and other operations as a network of compound convolution and tensor contraction on the decomposed factors (i.e., generalized tensor operations). The prior approach sees such networks evaluated in a pairwise manner until the resulting output has been recovered, by composing functions in existing libraries such as cuDNN. The computational cost of such evaluations depends upon the order in which the index sums are evaluated, and varies between networks. The sequence of pairwise generalized tensor operations that minimizes the number of computations often produces large intermediate products, incurring performance bottlenecks when communicated with the scarce global memory of modern GPUs. Our solution is a GPU parallel algorithm which performs 2D convolution using filter tensors obtained through CP-decomposition with minimal memory overhead. We benchmark the run-time performance of our algorithm for common filter sizes in neural networks at multiple decomposition ranks. We compare ourselves against cuDNN using traditional filters and find that our implementation is superior for lower ranks. We also propose a method for determining optimal sequences of pairwise tensor operations, achieving a minimal number of operations with memory constraints.

The code is here.