Blogpost written by Mikel Iruskieta ( member of the CLARIN Knowledge Sharing Infrastructure Committee) describing a research sample case that was succesfully conducted thanks to the services offered by the IMPACT centre of competence - CLARIN K-centre in digitisation
IMPACT Centre of Competence
Note: The work described in this blog post was undertaken thanks to the collaboration with the IMPACT Centre of Competence (www.digitisation.eu).
Digitization of archival and historical material can be problematic for researchers due to a number of issues. One of such issues is the presence of gaps and empty spaces around and in between text. It became apparent in my recent analysis of the most frequent words in Pulgarcito a Cuban illustrated and literary journal for kids wrote between 1919-1920. Pulgarcito was digitized and is available online: http://imagenes.sld.cu/download/pulgarcito/volumen-2.pdf.
The journal consists of rich material: drawings, photographs, fairy tales, comic strips, legends, poems, fables, anecdotes, paintings for children. It is a very interesting material as the text throughout the publication is typed, handwritten and drawn. This makes the digitized publication in PDF quite challenging.
The aim of the task was to use tools in the text analysis in an image-based book digitalization, with texts including also hand-written texts.
After trying a commercial OCR product, the results were very poor, so I decided to approach the IMPACT Centre of Competence (www.digitisation.eu) for support. I needed an OCR system that would deliver good quality text recognition results in a machine-readable format (e.g. XML, TXT). They promptly answered my enquiry and just over a week later I received the journal in plain text format and in XML (both with some OCR errors).
The example below illustrates the conversion of a hand-written and typed text from the PDF into a plain text format. Have a look how many different “A”s can be found on this page:
CUANDO UN N1NO |
Most of the sentences were recognized with some errors, for example the word “NIÑO” (child) was identified as “N1NO”, or the word “UN RETRATO” (a picture) was not split and resulted in “UMBETRAfO”. Finally, the last line was not detected at all.
As expected, better results were given where the text was typed. The example below illustrates it well.
|
|
|
|
The results above were achieved by the IMPACT by using the following methods:
- All PDF images were extracted using a tool pdfimages in Linux.
- The digitization was done with the FineReader 11 SDK version.
- The OCR FineReader 11 SDK version with Spanish language and different types of letters was used with normal and handprinted output in ALTO XML and Text Unicode Defaults.
Once we had the image-based digitized publication book in a txt format, we used ANALHITZA (Otegi et al. 2017). It is a tool created in collaboration with the Spanish CLARIN K-centre to extract words and frecuencies, identify proper nouns (NERC) and extract some word sequences (n-grams), among other things.
The text analysis results were as follows:
Freq. |
Nouns |
Freq. |
Adjectives |
255 |
niño |
160 |
bueno |
194 |
año |
124 |
gran |
159 |
hombre |
99 |
grande |
154 |
día |
75 |
nuevo |
148 |
padre |
62 |
viejo |
148 |
rey |
57 |
blanco |
134 |
hijo |
51 |
pobre |
131 |
vez |
48 |
mayor |
114 |
libro |
48 |
largo |
106 |
casa |
45 |
azul |
103 |
tiempo |
44 |
mejor |
A sample of the NERC (LOC means “location”, PER stand for “person”):
Freq. |
W1 |
Type |
8 |
alemania |
LOC |
2 |
dinamarca |
LOC |
2 |
alejandro |
PER |
1 |
16 de mayo de 1703 |
DATE |
1 |
cataluña |
LOC |
After we extracted the most frequent bigrams (“P” for pronoum, “D” determiner, “C” connector, “V” verb, “N” noun):
Freq. |
w1 |
Cat |
w2 |
Cat |
846 |
de |
P |
el |
D |
692 |
en |
P |
el |
D |
565 |
a |
P |
el |
D |
388 |
y |
C |
el |
D |
245 |
de |
P |
su |
D |
229 |
por |
P |
el |
D |
226 |
el |
D |
que |
Q |
224 |
todo |
D |
el |
D |
206 |
con |
P |
el |
D |
204 |
a |
P |
su |
D |
202 |
que |
C |
el |
D |
201 |
de |
P |
uno |
D |
165 |
ser |
V |
el |
D |
151 |
el |
D |
niño |
N |
After that we used Voyant Tools (Sinclair and Rockwell, 2016) to get visualizations of the data in order to achieve a more user-friendly representation of the data. The result was a word cloud of the entire book:
A further analysis of the word "niña" (Key Word in Context or KWIC) extracted with Voyant Tools, can be used to show how the girls were characterized in 1920 or to learn the cohesion between the gender (feminine) of the article and the noun:
Left |
Term |
Right |
tenía, a su vez, una |
niña |
, que era dulce y bon |
las excelentes cualidades de aquella |
niña |
. La encomendó las tareas más |
pies a cabeza. La pobre |
niña |
todo lo sufría con paciencia |
g n ■w- canzaria. La |
niña |
perdió uno de sus zapatos |
meses regalaremos al niño o |
niña |
que mayor número de ellas |
ha pensado mucho en la |
niña |
! El dice que siempre que |
y escribe mejor- Y la |
niña |
se va, se va despacio |
tropieza con todo! Pero la |
niña |
no se ha des- pertado |
de olor: y es una |
niña |
de sombrero colorado, que trae |
hoy en casa por mi |
niña |
”, le dijo su padre, “y |
The analysis described above shows that there are still many errors and one should carefully check the extracted text, and correct to obtain a more reliable data. The overall task was very fast and efficient and proved to ask interesting research questions. The next steps are to use the Programing Historian publications and see if the text can be cleaned of all OCR errors using regular expressions (Turner-O'Hara 2013):
https://programminghistorian.org/en/lessons/cleaning-ocrd-text-with-regular-expressions
Otegi, A. Imaz, O. Díaz de Ilarraza, A. Iruskieta, M. Uria, L. 2017.ANALHITZA: a tool to extract linguistic information from large corpora in Humanities research. Procesamiento del Lenguaje Natural 58: 77-84.
Pulgarcito Volumen No 2 - No 1 – 1920. URL: http://iiif.sld.cu/coleccion/07/06/2017/pulgarcito-volumen-no-2-no-1-1920 [January 10, 2019]
Sinclair, S. Rockwell, G. "Voyant tools." URL: http://voyant-tools. org/ [September 5, 2016] (2016).
Turner-O'Hara, L. 2013. Cleaning OCR’d text with Regular Expressions. URL: https://programminghistorian.org/en/lessons/cleaning-ocrd-text-with-regular-expressions [January 10, 2019]