- Purchase/rental options available:
This essay will discuss an ongoing project to train an optical character recognition (OCR) system on medieval manuscripts—specifically, the OCR engine Kraken, which we trained to transcribe early-fifteenth-century Middle English manuscripts. Our current model, trained on Scribe D's handwriting, has a 97 percent training accuracy rate and transcribes unseen manuscripts with a range of accuracy rates between 27 and 86 percent. Our project adds to the growing number of successful experiments in training medieval manuscripts on OCR, a technology that could have an immense impact on medieval studies.
The primary concern of this essay is not our specific results but the challenges we faced when preparing our training data and the decisions we made accordingly. In particular, we compare the diplomatic transcriptions required by our software to the semi-diplomatic transcriptions that medievalists usually create. We argue that technical constraints such as the use of diplomatic transcriptions in OCR might encourage medievalists to evaluate how we typically remediate manuscripts (that is, transfer them from one medium to another). Considering the potential scope and scalability of this technology, we argue that it is important to consider our training data (human-made transcriptions) carefully, as is the case for any machinelearning project. But we also argue that machine learning offers a useful framework for understanding how we manipulate manuscript data in any kind of remediation.