TY - GEN
T1 - Post-OCR Correction with OpenAI’s GPT Models on Challenging English Prosody Texts
AU - Zhang, James
AU - Haverals, Wouter
AU - Naydan, Mary
AU - Kernighan, Brian W.
N1 - Publisher Copyright:
© 2024 Copyright held by the owner/author(s).
PY - 2024/8/20
Y1 - 2024/8/20
N2 - The digitization of historical documents faces challenges with the accuracy of Optical Character Recognition (OCR). Noting the success of large language models (LLMs) on many text-based tasks, this paper explores the potential of OpenAI’s GPT models (3.5-turbo, 4, 4-turbo) on the post-OCR correction task using works from the Princeton Prosody Archive (PPA), a full-text searchable database containing English texts published between 1559 and 1928 on versification and pronunciation. We conduct a comparative analysis across different model configurations and prompt strategies. Our results indicate that tailoring prompts with work metadata is less effective than anticipated, though adjusting the temperature parameter can be beneficial. The models tend to overcorrect works with already good OCR quality but perform well overall, with the best model setup improving the Character Error Rate (CER) by a mean of 18.92%. Additionally, after introducing a preliminary quality estimation step to process texts differently based on their original OCR quality, the best mean improvement increases to 38.83%.
AB - The digitization of historical documents faces challenges with the accuracy of Optical Character Recognition (OCR). Noting the success of large language models (LLMs) on many text-based tasks, this paper explores the potential of OpenAI’s GPT models (3.5-turbo, 4, 4-turbo) on the post-OCR correction task using works from the Princeton Prosody Archive (PPA), a full-text searchable database containing English texts published between 1559 and 1928 on versification and pronunciation. We conduct a comparative analysis across different model configurations and prompt strategies. Our results indicate that tailoring prompts with work metadata is less effective than anticipated, though adjusting the temperature parameter can be beneficial. The models tend to overcorrect works with already good OCR quality but perform well overall, with the best model setup improving the Character Error Rate (CER) by a mean of 18.92%. Additionally, after introducing a preliminary quality estimation step to process texts differently based on their original OCR quality, the best mean improvement increases to 38.83%.
KW - Error Correction
KW - GPT
KW - Historical Documents
KW - Large Language Models
KW - LLMs
KW - Optical Character Recognition
KW - Post-OCR Correction
UR - http://www.scopus.com/inward/record.url?scp=85206088338&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85206088338&partnerID=8YFLogxK
U2 - 10.1145/3685650.3685669
DO - 10.1145/3685650.3685669
M3 - Conference contribution
AN - SCOPUS:85206088338
T3 - DocEng 2024 - Proceedings of the 2024 ACM Symposium on Document Engineering
BT - DocEng 2024 - Proceedings of the 2024 ACM Symposium on Document Engineering
PB - Association for Computing Machinery, Inc
T2 - 2024 ACM Symposium on Document Engineering, DocEng 2024
Y2 - 20 August 2024 through 23 August 2024
ER -