TY - GEN
T1 - Catformer
T2 - 38th International Conference on Machine Learning, ICML 2021
AU - Davis, Jared Quincy
AU - Gu, Albert
AU - Choromanski, Krzysztof
AU - Dao, Tri
AU - Re, Christopher
AU - Finn, Chelsea
AU - Liang, Percy
N1 - Publisher Copyright:
Copyright © 2021 by the author(s)
PY - 2021
Y1 - 2021
N2 - Transformer architectures are widely used, but training them is non-trivial, requiring custom learning rate schedules, scaling terms, residual connections, careful placement of submodules such as normalization, and so on. In this paper, we improve upon recent analysis of Transformers and formalize a notion of sensitivity to capture the difficulty of training. Sensitivity characterizes how the variance of activation and gradient norms change in expectation when parameters are randomly perturbed. We analyze the sensitivity of previous Transformer architectures and design a new architecture, the Catformer, which replaces residual connections or RNN-based gating mechanisms with concatenation. We prove that Catformers are less sensitive than other Transformer variants and demonstrate that this leads to more stable training. On DMLab30, a suite of high-dimension reinforcement tasks, Catformer outperforms other transformers, including Gated Transformer-XL-the state-of-the-art architecture designed to address stability-by 13%.
AB - Transformer architectures are widely used, but training them is non-trivial, requiring custom learning rate schedules, scaling terms, residual connections, careful placement of submodules such as normalization, and so on. In this paper, we improve upon recent analysis of Transformers and formalize a notion of sensitivity to capture the difficulty of training. Sensitivity characterizes how the variance of activation and gradient norms change in expectation when parameters are randomly perturbed. We analyze the sensitivity of previous Transformer architectures and design a new architecture, the Catformer, which replaces residual connections or RNN-based gating mechanisms with concatenation. We prove that Catformers are less sensitive than other Transformer variants and demonstrate that this leads to more stable training. On DMLab30, a suite of high-dimension reinforcement tasks, Catformer outperforms other transformers, including Gated Transformer-XL-the state-of-the-art architecture designed to address stability-by 13%.
UR - http://www.scopus.com/inward/record.url?scp=85161317351&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85161317351&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85161317351
T3 - Proceedings of Machine Learning Research
SP - 2489
EP - 2499
BT - Proceedings of the 38th International Conference on Machine Learning, ICML 2021
PB - ML Research Press
Y2 - 18 July 2021 through 24 July 2021
ER -