TY - JOUR
T1 - A 5′ UTR language model for decoding untranslated regions of mRNA and function predictions
AU - Chu, Yanyi
AU - Yu, Dan
AU - Li, Yupeng
AU - Huang, Kaixuan
AU - Shen, Yue
AU - Cong, Le
AU - Zhang, Jason
AU - Wang, Mengdi
N1 - Publisher Copyright:
© The Author(s), under exclusive licence to Springer Nature Limited 2024.
PY - 2024/4
Y1 - 2024/4
N2 - The 5′ untranslated region (UTR), a regulatory region at the beginning of a messenger RNA (mRNA) molecule, plays a crucial role in regulating the translation process and affects the protein expression level. Language models have showcased their effectiveness in decoding the functions of protein and genome sequences. Here, we introduce a language model for 5′ UTR, which we refer to as the UTR-LM. The UTR-LM is pretrained on endogenous 5′ UTRs from multiple species and is further augmented with supervised information including secondary structure and minimum free energy. We fine-tuned the UTR-LM in a variety of downstream tasks. The model outperformed the best known benchmark by up to 5% for predicting the mean ribosome loading, and by up to 8% for predicting the translation efficiency and the mRNA expression level. The model was also applied to identifying unannotated internal ribosome entry sites within the untranslated region and improved the area under the precision–recall curve from 0.37 to 0.52 compared to the best baseline. Further, we designed a library of 211 new 5′ UTRs with high predicted values of translation efficiency and evaluated them via a wet-laboratory assay. Experiment results confirmed that our top designs achieved a 32.5% increase in protein production level relative to well-established 5′ UTRs optimized for therapeutics.
AB - The 5′ untranslated region (UTR), a regulatory region at the beginning of a messenger RNA (mRNA) molecule, plays a crucial role in regulating the translation process and affects the protein expression level. Language models have showcased their effectiveness in decoding the functions of protein and genome sequences. Here, we introduce a language model for 5′ UTR, which we refer to as the UTR-LM. The UTR-LM is pretrained on endogenous 5′ UTRs from multiple species and is further augmented with supervised information including secondary structure and minimum free energy. We fine-tuned the UTR-LM in a variety of downstream tasks. The model outperformed the best known benchmark by up to 5% for predicting the mean ribosome loading, and by up to 8% for predicting the translation efficiency and the mRNA expression level. The model was also applied to identifying unannotated internal ribosome entry sites within the untranslated region and improved the area under the precision–recall curve from 0.37 to 0.52 compared to the best baseline. Further, we designed a library of 211 new 5′ UTRs with high predicted values of translation efficiency and evaluated them via a wet-laboratory assay. Experiment results confirmed that our top designs achieved a 32.5% increase in protein production level relative to well-established 5′ UTRs optimized for therapeutics.
UR - http://www.scopus.com/inward/record.url?scp=85189476531&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85189476531&partnerID=8YFLogxK
U2 - 10.1038/s42256-024-00823-9
DO - 10.1038/s42256-024-00823-9
M3 - Article
C2 - 38855263
AN - SCOPUS:85189476531
SN - 2522-5839
VL - 6
SP - 449
EP - 460
JO - Nature Machine Intelligence
JF - Nature Machine Intelligence
IS - 4
ER -