← Go Back

Module 2: Python Crash Course 2 • Introduction to NLP with SpaCy

Text access and preprocessing form the foundation of NLP processing chains. In this module, we'll explore how to leverage Colab Notebooks for data access and learn essential preprocessing techniques to prepare raw text for in-depth analysis. These techniques include tokenization (splitting text into individual words), removing stopwords, and lemmatization (reducing words to their base forms). Our Python Crash Course 2 will delve into text processing and NLP using the powerful SpaCy library, equipping you with practical skills to handle real-world language data effectively.

Preparation for Module 2:

Literature:

Khurana, D., Koli, A., Khatter, K. et al. Natural language processing: state of the art, current trends and challenges. Multimed Tools Appl 82, 3713–3744 (2023). https://doi.org/10.1007/s11042-022-13428-4

Notebooks we will use in class:

Workload (after class):

Finish the exercises of the Python Crash Course (if not finished in class)

Create your first Jupyter notebook, clone our course repository, and import kölnische_Zeitung_erdbeben_artikel.xlsx.

Complete the following tasks:

  1. Clean, tokenize, and lemmatize the corpus
  2. Find the most frequent verbs (use the NLTK package for this task)
  3. Visualize the most frequent verbs with a visualization of your choice

Save your notebook in your GitHub repository.

Date and Time:

November 8, 2024 (10:00 AM to 11:30 AM)