Post-OCR Correction with OpenAI’s GPT Models on Challenging English Prosody Texts

James Zhang, Wouter Haverals, Mary Naydan, Brian W. Kernighan

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The digitization of historical documents faces challenges with the accuracy of Optical Character Recognition (OCR). Noting the success of large language models (LLMs) on many text-based tasks, this paper explores the potential of OpenAI’s GPT models (3.5-turbo, 4, 4-turbo) on the post-OCR correction task using works from the Princeton Prosody Archive (PPA), a full-text searchable database containing English texts published between 1559 and 1928 on versification and pronunciation. We conduct a comparative analysis across different model configurations and prompt strategies. Our results indicate that tailoring prompts with work metadata is less effective than anticipated, though adjusting the temperature parameter can be beneficial. The models tend to overcorrect works with already good OCR quality but perform well overall, with the best model setup improving the Character Error Rate (CER) by a mean of 18.92%. Additionally, after introducing a preliminary quality estimation step to process texts differently based on their original OCR quality, the best mean improvement increases to 38.83%.

Original languageEnglish (US)
Title of host publicationDocEng 2024 - Proceedings of the 2024 ACM Symposium on Document Engineering
PublisherAssociation for Computing Machinery, Inc
ISBN (Electronic)9798400711695
DOIs
StatePublished - Aug 20 2024
Event2024 ACM Symposium on Document Engineering, DocEng 2024 - San Jose, United States
Duration: Aug 20 2024Aug 23 2024

Publication series

NameDocEng 2024 - Proceedings of the 2024 ACM Symposium on Document Engineering

Conference

Conference2024 ACM Symposium on Document Engineering, DocEng 2024
Country/TerritoryUnited States
CitySan Jose
Period8/20/248/23/24

All Science Journal Classification (ASJC) codes

  • Computer Science Applications
  • Information Systems
  • Software

Keywords

  • Error Correction
  • GPT
  • Historical Documents
  • Large Language Models
  • LLMs
  • Optical Character Recognition
  • Post-OCR Correction

Fingerprint

Dive into the research topics of 'Post-OCR Correction with OpenAI’s GPT Models on Challenging English Prosody Texts'. Together they form a unique fingerprint.

Cite this