-
Loose, Falling Characters and Sentences: The Persistence of the OCR Problem in Digital Repository E-Books
- portal: Libraries and the Academy
- Johns Hopkins University Press
- Volume 15, Number 1, January 2015
- pp. 59-91
- 10.1353/pla.2015.0005
- Article
- Additional Information
- Purchase/rental options available:
The electronic conversion of scanned image files to readable text using optical character recognition (OCR) software and the subsequent migration of raw OCR text to e-book text file formats are key remediation or media conversion technologies used in digital repository e-book production. Despite real progress, the OCR problem of reliability and accuracy in OCR-derived e-book text and metadata persists. This paper examines a selection of digitized e-books in several prominent digital repositories and discusses the impact of OCR technology on e-book text file formats, metadata, and the online reading experience.