Tuformer: Data-driven Design of Transformers for Improved Generalization or Efficiency

Year
2022
Type(s)
Author(s)
Xiaoyu Liu, Jiahao Su, Furong Huang
Source
The Tenth International Conference on Learning Representations (ICLR), 2022.
Url
https://openreview.net/forum?id=V0A5g83gdQ_&referrer=%5BAuthor%20Console%5D(%2Fgroup%3Fid%3DICLR.cc%2F2022%2FConference%2FAuthors%23your-submissions)
BibTeX
BibTeX

Transformers are neural network architectures that achieve remarkable performance in many areas. However, the core component of Transformers, multi-head self-attention (MHSA), is mainly derived from heuristics, and the interactions across its components are not well understood. To address the problem, we first introduce a mathematically rigorous and yet intuitive tensor diagram representation of MHSA. Guided by tensor diagram representations, we propose a novel design, namely Tunable Transformers (Tuformers), by allowing learnable weights across heads, whereas MHSA adopts pre-defined and fixed weights across heads, as will be explained in our paper. Tuformers naturally reveal a flexible design space that a user, depending on her needs, can choose a structure that has either improved performance (generalization error) or higher model efficiency. Any pre-trained Transformer can be an initialization of the corresponding Tuformer with trainable number of heads for efficient training and fine-tuning. Tuformers universally outperform Transformers on various tasks across multiple domains under a wide range of model sizes, and their efficient variant in the design space maintains 93% performance using only 1.5% parameters.