Extracting structured information from TXT files
TXT_replaceWORDwithREGEX.py
- adding delimiters to text based on regular expression
- preparing text for splitting into individual sections
TXT_splitUPPERCASE.py
- identifiying person entries based on uppercase characters in names
UniversityRecordsMainz_identifyPLACEofORIGIN.py
- identifying places of origin according to token position in text
- sample output of script for the Mainz university records: UniversityRecordsMainz_output_place-names.txt
TransferPROFData
- Transfer of semi-structured text data extracted via OCR from the Mainz university registers (originally written with a typewriter) to EXCEL
- Splitting the information into “name”, “information” and “source citation” columns
- Further refinement of the entries by matching the “information” with ontology lists
- Identification of event names, titles, functions, place names and dates
The following blog post describes the application of some of the above-mentioned scripts in the “Kurmainz” work package:
Monika Barget, “Disambiguating people and places in dirty historical data,” in INSULAE, last updated 26/10/2021