Text access and preprocessing form the foundation of NLP processing chains. In this module, we'll explore how to leverage Colab Notebooks for data access and learn essential preprocessing techniques to prepare raw text for in-depth analysis. These techniques include tokenization (splitting text into individual words), removing stopwords, and lemmatization (reducing words to their base forms). Our Python Crash Course 2 will delve into text processing and NLP using the powerful SpaCy library, equipping you with practical skills to handle real-world language data effectively.
Khurana, D., Koli, A., Khatter, K. et al. Natural language processing: state of the art, current trends and challenges. Multimed Tools Appl 82, 3713–3744 (2023). https://doi.org/10.1007/s11042-022-13428-4
Finish the exercises of the Python Crash Course (if not finished in class)
Create your first Jupyter notebook, clone our course repository, and import kölnische_Zeitung_erdbeben_artikel.xlsx.
Complete the following tasks:
Save your notebook in your GitHub repository.
November 8, 2024 (10:00 AM to 11:30 AM)