Code for data collection, data cleaning and data analysis

In DigiKAR, we have also experimented with Python code for data collection, data cleaning and data analysis, especially in our Mainz workpackage. The pandas package in Python is widely used for working with two-dimensional dataframes and helped us read, manipulate and write EXCEL and CSV files. We have also worked with packages for XML, HTML, JSON and geocoding.

INFO

Scripts and their use cases are explained in the detailed [workflow sections] below. In most cases, a short description is available in English as well as German and French.

Editing and running code in different environments

Some of the Jupyter Notebook files (ending in .ipynb) in our repository were initially created for Google Colab and need to be adjusted when used in other environments. In the DigiKAR project, Google Colab was initially used for collaborative code editing because we did not have access to an institutional research software infrastructure. Ideally, code should be hosted in non-commercial environments, such as university-hosted computing infrastructures for data science.

screenshot google colab

To make the Colab notebooks work for you, please carry out the following steps:

Put the notebook on your own Google Drive, ideally in a folder whose name contains “Colab” so that you can easily identify it later.
Open the notebook and adjust the directory path according to your own file location. You may also change the paths of the input and output data in the script, depending on your own prefered folder structure. Make sure that all folders you name in the script also exist on Google Drive before you execute the script.
Select “open with” and connect to the Google Colab app. If you have not used Google Colab before, select the “connect more apps” option and find Colab there.
Make sure to give Colab all the necessary permissions to run the script and read / write files. If you do not want Colab to access a private Google Drive, you may want to create a new Google account exclusively for research purposes. |

Code for data collection, data cleaning and data analysis ​

Editing and running code in different environments ​

Code for data collection, data cleaning and data analysis

Editing and running code in different environments