Workshop recap: How Does HKBU Library Preserve Vintage Documents Using OCR?

04 Sunday Mar 2018

Tags

data collection, Digitalization, library, OCR, Scan

Technology has changed our way of researching and our reading habit after the Internet became the popular platform for the release of news and information. The documents and publications from the non-information era are still invaluable for us especially when it comes to referencing and history learning. Yet, these resources are black and white and read all over, which does not fit in today’s mode of information processing. To digitalise these old documents, four students from Baptist University (BU) learned about the technique and usage of software in Optical Character Recognition (OCR) workshop.

Using OCR machine to preserve information on old documents.

1. Scanning

The hard copy of the old documents and publication has to be scanned to the computer first in order to proceed the recognition on computer. A staff of BU introduced an unusually vast scanner. The model of the scanner is Zeutschel with serial number OS 12000. With its wide platen, the surface area of the items to-be-scanned is not longer limited to A3 only. Not only paper, the big machine caters for scanning paintings, magazines and even thick books.

Zeutschel, a German brand OCR scanner.

The wide platen of the scanner for putting items-to-be-scanned.

The system identifies the scanned items and demonstrates them on a monitor where user can change the format of the picture and save immediately. The staff advised to save the file in TIFF format with resolution in DPI at least 400, in an attempt to keep the clarity of the image.

Save the scanned document with the setting of TIFF file format and resolution at 400 DPI.

The scanning procedure has to be handled with care since the old materials are fragile and vulnerable. Before scanning the old and ancient items, the staff would take parameters on sanitizing them. To get rid of the possible remains of small insects, they put them into a freezer for a period of time. The staff also put a remark on dealing the brittle paper.

The handle for controlling the glass which covers the items-to-be-scanned on the platen.

“You try not to flip each page more than one time since they may scatter into pieces,” the instructor said.

2. Processing of document preservation (Optical Character Recognition)

To best suit library’s needs, they use a software named ABBYY FineReader here to processing the scanned files, making PDF images them into searchable/editable files and text.

2.1 Open and Save

a. Open “PDF/image” you scanned before(best use the uncompressed PDF)

b. Before proofreading, save FineReader Document into folder where is should be.

c. After proofreading, report txt per page(create a separate file for each page) and PDF/A per volume(create a single file for all pages).

Open and save

2.2 Processing

a. set default languages.

b. Area properties(ordering and direction of text, etc.).

2.3 Draw the areas

a. select text area

b. delete all selected picture area(the red box).

c. use table area for table and complicated structure.

d. check the order of areas, make sure it’s consistent with the text.

Draw the areas.png

2.4 Image editors

a. use preprocess, Deskew, Straighten Text, etc.

2.5 Style editor

a. Chinese Front: SimHei

b. Size: 8-12

c. Merge default style

2.6 Content(check these things)

a. Name&Title

b. Table of content

c. Date

2.7 Export

a. Export three files(TEXT, PDF and abbff.format).

3. Works have done so far

The library in HKBU has conducted various projects such as Preservation for the Documentation of Chinese Christianity, which aims to preserve and make accessible books, periodicals, reports and archival materials that document Chinese Christianity by digitization and microfilming. Typical display methods for those digital publications can be searchable PDF, which can be done by OCR, and flipbook-like display through software such FlippingBook. Also, some digital archives are stored as a database for users to search according to different conditions.

Preservation for the Documentation of Chinese Christianity, one of the document-preservation scheme launched by HKBU.

ORC scanners can turn scanned documents into a searchable pdf format.

Considering different user groups and the nature of different projects, readers can choose the best way to display the content.

Remarks from lecturer

It is our pleasure to learn that HKBU Library has a lot overlapping works with our data journalism process. Data collection is the very first step in data journalism production process. It largely determines what news one can or can not find, especially when one day everyone in the industry is equipped with data analysis and data visualisation skills. We often find government records coming in form of scanned PDF or even as printed documents in their library. Scanning and applying OCR on those documents can help journalists to turn the records into digital and searchable format. The scanning step is easy to find alternatives. One can find desktop scanners in most labs. Or one can use cell phone to take photos. Evernote can use computer graphics algorithm to polish photos as if they are generated by a scanner. The OCR step is more heavy duty and requires professional software, e.g. ABBYY FineReader adopted by HKBU library. Those tech savvy users can checkout tesseract-ocr, the widely used open source library & command line which has 16K+ stars on GitHub as of this writing.

— Pili Hu (Mar 4, 2018)

Points to consider during scanning:

Lighting
Folding
Resolution (400dpi for OCR, 600dpi for artworks)

Points to consider during OCR:

Orientation of texts
Deal with different font-face, especially Chinese characters
Recognise characters in pictures or not (e.g. photo of a banner in news story)
Check and manually fix wrong recognitions (accuracy)
- Title, name and author are high priority information
Label, break or merge areas of texts

Text / Celia Lai, Maggie Liu, Ivy Wang

Photos / Maggie Liu, Ivy Wang, Erin Chan

Editor / Pili Hu

	A quick video I made… on New towns fail to be self-cont…
	Erin Chan on Create Simple Filled Map (HK)…
	National Congress: s… on “Big Data” Tells Y…
	Pili Hu on Data News of the Week \| Gender…
	Pili Hu on Key Takes from Jessica Lo…

The Data & News Society

~ news/numbers; stats/stories