[: en] Creating the corpus: from image to text 1 [:]

[: pl]

In a series of short entries, we would like to present the technical side of preparing our corpus. For us, the information gathered here will be used to document the project, but we hope that it will be useful for beginners, and a little more experienced, to encourage discussion. We encourage all interested parties to contact.

The starting point for creating a corpus is, of course, meticulous planning. Once we have decided what texts we would like to include in it, we can process the scanned source into text that can become the basis for further processing.

Where do we get the images of the sources?

In our project, we obtain images mainly from Polish digital libraries. When the source of interest to us is missing in them, we scan them ourselves.

The easiest way to search Polish digital libraries is by using FBC search enginesbut it is also useful for Polish sources in foreign libraries Europeana. We use archive.org or Google Books less often: the files available in them are not always suitable for later processing. And no, it's not just that about famous artifacts

What's next for the scans?

1. From PDF / DJVU to TIFF

Conversion to * .tiff image files is provided by Unix tools:

  • extract images of pages from PDF with pdfimages
  • Convert DJVU to graphics with the command ddjvu.

2. Image optimization

Although OCR programs often contain image optimization tools, and there are also useful scripts for this, in our work we use ScanTailora. We run the program on the command line, but it has a quite intuitive interface, so even less experienced users will have no problems using it.

Okno ScanTailora: dzielenie stron
ScanTailor window: page break
ScanTailor: Correct orientation
ScanTailor: Margin Selection

3. From image to text

In previous years, we used the most popular commercial program, Abbyy FineReader, to recognize the text. In the new edition, we decided to use only free software, which, however, is of equal quality, and often exceeds paid solutions. Our choice (there will be an opportunity to write about its reasons) fell on Tesseract, powered by Google and using deep learning algorithms. Tesseract you can, of course, train yourself, but you can download from the Linux repositories and from the project's website ready-to-use data for over 130 languages and 35 types of writing.

For us, as you can guess, the most important thing is service:

Most importantly, tesseract He is good at multilingual text, and our sources do not lack this:

Preussisches Urkundenbuch t. II: regest zapisany po niemiecku frakturą, tekst po łacinie zapisany antykwą
Pommersches Urkundenbuch vol. II 1: the register is written in German in fractal, the text in Latin is written in antiquity
Akta grodzkie i ziemskie, t. 9: regest zapisany po polsku, tekst zapisany po łacinie
Town and land files, vol. 9: the register was written in Polish, the text was written in Latin

 

 

 

 

 

 

 

We save the recognized text in two formats: the well-known TXT and hOCRwhich also stores information about text blocks recognized by the OCR program, text position on the page, etc. Why is this information important to us? About that …

... in the next episode of our series

  • hOCR, PAGE XML and other creatures
  • what is PoCoTo and what is it for?
  • how to rewrite with Transkribus

 [:]

Share your love
Krzysztof Nowak
Krzysztof Nowak
Articles: 23
en_GBEnglish (UK)