2) Extracting structured information from TXT files
a) TXT_replaceWORDwithREGEX.py
- adding delimiters to text based on regular expression
- preparing text for splitting into individual sections
- identifiying person entries based on uppercase characters in names
c) UniversityRecordsMainz_identifyPLACEofORIGIN.py
- identifying places of origin according to token position in text
- sample output of script for the Mainz university records: UniversityRecordsMainz_output_place-names.txt
- Transfer of semi-structured text data extracted via OCR from the Mainz university registers (originally written with a typewriter) to EXCEL
- Splitting the information into “name”, “information” and “source citation” columns
- Further refinement of the entries by matching the “information” with ontology lists
- Identification of event names, titles, functions, place names and dates