Skip to content

Code for data collection, data cleaning and data analysis

The pages linked below contain brief descriptions of scripts for data collection, data cleaning or data analysis in the DigiKAR project. Some of the Jupyter Notebook files (ending in .ipynb) were initially created for Google Colab and need to be adjusted when used in other environments. In the DigiKAR project, Google Colab was used because we did not have access to an institutional research software infrastructure. Ideally, code should be hosted in non-commercial environments, such as university-hosted computing infrastructures for data science.

screenshot google colab

To make the Colab notebooks work for you, please carry out the following steps:

  1. Put the notebook on your own Google Drive, ideally in a folder whose name contains "Colab" so that you can easily identify it later.
  2. Open the notebook and adjust the directory path according to your own file location. You may also change the paths of the input and output data in the script, depending on your own prefered folder structure. Make sure that all folders you name in the script also exist on Google Drive before you execute the script.
  3. Select "open with" and connect to the Google Colab app. If you have not used Google Colab before, select the "connect more apps" option and find Colab there.
  4. Make sure to give Colab all the necessary permissions to run the script and read / write files. If you do not want Colab to access a private Google Drive, you may want to create a new Google account exclusively for research purposes.

INFO

This part of the documentation needs to be updated. The links below do not need to be provided in multiple languages anymore. We can instead link to specific pages in docs for every language.

1) Datenabruf aus mehreren CSV-/EXCEL-Tabellen1) Data retrieval from several CSV/EXCEL tables1) Récupération de données à partir de plusieurs tableaux CSV/EXCEL
2) Extrahieren strukturierter Informationen aus TXT Dateien2) Extracting structured information from TXT files2) Extraction d'informations structurées à partir de fichiers TXT
3) Auslesen von Daten aus HTML und XML (via API)3) Reading data from HTML and XML (via API)3) Lecture de données à partir de HTML et XML (via API): veuillez consulter la version allemande ou anglaise
4) Aufbereitung von Ortsdaten für den World Historical Gazetteer4) Preparing spatial data for World Historical Gazetteer4) Préparation des données spatiales pour le World Historical Gazetteer: veuillez consulter la version allemande ou anglaise
5) Ersetzen von Daten basierend auf Ontologielisten5) Data replacement based on ontology lists5) Remplacement de données basé sur des listes d'ontologies
6) Konsolidierung der Factoid-Tabellen im Arbeitspaket Mainz6) Consolidation of the factoid tables in the Mainz work package6) Consolidation des tables de factoids dans le package de travail Mainz
7) Geocoding historischer Ortsdaten mit Google und Geonames7) Geocoding historical location data with Google and Geonames7) Géocodage de données historiques de lieux avec Google et Geonames