Skip to content

Extracting structured information from TXT files

TXT_replaceWORDwithREGEX.py

  • adding delimiters to text based on regular expression
  • preparing text for splitting into individual sections

TXT_splitUPPERCASE.py

  • identifiying person entries based on uppercase characters in names

UniversityRecordsMainz_identifyPLACEofORIGIN.py

TransferPROFData

  • Transfer of semi-structured text data extracted via OCR from the Mainz university registers (originally written with a typewriter) to EXCEL
  • Splitting the information into “name”, “information” and “source citation” columns
  • Further refinement of the entries by matching the “information” with ontology lists
  • Identification of event names, titles, functions, place names and dates

drawing

The following blog post describes the application of some of the above-mentioned scripts in the “Kurmainz” work package:
Monika Barget, “Disambiguating people and places in dirty historical data,” in INSULAE, last updated 26/10/2021