TY - JOUR
T1 - One-Layer Transformer Provably Learns One-Nearest Neighbor In Context
AU - Li, Zihao
AU - Cao, Yuan
AU - Gao, Cheng
AU - He, Yihan
AU - Liu, Han
AU - Klusowski, Jason M.
AU - Fan, Jianqing
AU - Wang, Mengdi
N1 - Publisher Copyright:
© 2024 Neural information processing systems foundation. All rights reserved.
PY - 2024
Y1 - 2024
N2 - Transformers have achieved great success in recent years. Interestingly, transformers have shown particularly strong in-context learning capability - even without fine-tuning, they are still able to solve unseen tasks well purely based on task-specific prompts. In this paper, we study the capability of one-layer transformers in learning one of the most classical nonparametric estimators, the one-nearest neighbor prediction rule. Under a theoretical framework where the prompt contains a sequence of labeled training data and unlabeled test data, we show that, although the loss function is nonconvex when trained with gradient descent, a single softmax attention layer can successfully learn to behave like a one-nearest neighbor classifier. Our result gives a concrete example of how transformers can be trained to implement nonparametric machine learning algorithms, and sheds light on the role of softmax attention in transformer models.
AB - Transformers have achieved great success in recent years. Interestingly, transformers have shown particularly strong in-context learning capability - even without fine-tuning, they are still able to solve unseen tasks well purely based on task-specific prompts. In this paper, we study the capability of one-layer transformers in learning one of the most classical nonparametric estimators, the one-nearest neighbor prediction rule. Under a theoretical framework where the prompt contains a sequence of labeled training data and unlabeled test data, we show that, although the loss function is nonconvex when trained with gradient descent, a single softmax attention layer can successfully learn to behave like a one-nearest neighbor classifier. Our result gives a concrete example of how transformers can be trained to implement nonparametric machine learning algorithms, and sheds light on the role of softmax attention in transformer models.
UR - http://www.scopus.com/inward/record.url?scp=105000520198&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=105000520198&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:105000520198
SN - 1049-5258
VL - 37
JO - Advances in Neural Information Processing Systems
JF - Advances in Neural Information Processing Systems
T2 - 38th Conference on Neural Information Processing Systems, NeurIPS 2024
Y2 - 9 December 2024 through 15 December 2024
ER -