The Old Bailey and OCR: Benchmarking AWS, Azure, and GCP with 180,000 Page Images

William Ughetta, Brian W. Kernighan

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The Proceedings of the Old Bailey is a corpus of over 180,000 page images of court records printed from April 1674 to April 1913 and presents a comprehensive challenge for Optical Character Recognition (OCR) services. The Old Bailey is an ideal benchmark for historical document OCR, representing more than two centuries of variations in documents, including spellings, formats, and printing and preservation qualities. In addition to its historical and sociological significance, the Old Bailey is filled with imperfections that reflect the reality of coping with large-scale historical data. Most importantly, the Old Bailey contains human transcriptions for each page, which can be used to help measure OCR accuracy. Since humans do make mistakes in transcriptions, the relative performance of OCR services will be more informative than their absolute performance. This paper compares three leading commercial OCR cloud services: Amazon Web Services's Textract (AWS); Microsoft Azure's Cognitive Services (Azure); and Google Cloud Platform's Vision (GCP). Benchmarking involved downloading over 180,000 images, executing the OCR, and measuring the error rate of the OCR text against the human transcriptions. Our results found that AWS had the lowest median error rate, Azure had the lowest median round trip time, and GCP had the best combination of a low error rate and a low duration.

Original languageEnglish (US)
Title of host publicationProceedings of the ACM Symposium on Document Engineering, DocEng 2020
PublisherAssociation for Computing Machinery, Inc
ISBN (Electronic)9781450380003
DOIs
StatePublished - Sep 29 2020
Externally publishedYes
Event20th ACM Symposium on Document Engineering, DocEng 2020 - Virtual, Online, United States
Duration: Sep 29 2020Oct 1 2020

Publication series

NameProceedings of the ACM Symposium on Document Engineering, DocEng 2020

Conference

Conference20th ACM Symposium on Document Engineering, DocEng 2020
Country/TerritoryUnited States
CityVirtual, Online
Period9/29/2010/1/20

All Science Journal Classification (ASJC) codes

  • Software
  • Information Systems

Keywords

  • Amazon Web Services
  • Google Cloud Platform
  • Historical Documents
  • Microsoft Azure
  • Old Bailey
  • Optical Character Recognition

Fingerprint

Dive into the research topics of 'The Old Bailey and OCR: Benchmarking AWS, Azure, and GCP with 180,000 Page Images'. Together they form a unique fingerprint.

Cite this