NLP-Kurs DMGK - Digital Humanities

About the Course

Author: Sarah Oberbichler Leibniz Institute of European History (IEG)

This course offers an introduction to Natural Language Processing (NLP) and its application in the humanities and cultural studies. Participants work with digitized newspaper collections from the German Digital Library and examine the topic of "Natural and Environmental Disasters in Media". Both theoretical foundations and practical applications of NLP methods are taught.

Course Content and Methodology:

Practical Application: Students learn to apply NLP tools to specific research questions. The digitized newspaper collections of the German Digital Library are used as a data basis, and various analysis methods are employed.
Thematic Focus: The course focuses on the examination of natural and environmental disasters in media. It analyzes how these events are presented and discussed in historical media reports.
Interdisciplinary Approaches: The course explores how NLP technologies can open up new perspectives on cultural, historical, and social issues. It also reflects on how these methods complement and extend traditional humanities approaches.

Learning Objectives:

Application of relevant Python packages for NLP tasks on own research data
Preparation and structuring of large datasets for analysis
Use of transformer models and large language models for NLP tasks with extensive data volumes
Critical reflection on various methods (methodology critique)
Writing a scientific paper on the research results

Course Schedule

Module 1: October 25, 2024 (10:00 AM to 11:30 AM)

Introduction to the topic, the course, and NLP

Introduction to Colab Notebooks

Python Crash Course 1

Module 2: November 8, 2024 (10:00 AM to 11:30 AM)

Python Crash Course 2

Introduction to NLP with SpaCy, NLTK, and SKLEARN

Module 3: November 15, 2024 (10:00 AM to 11:30 AM)

The German Newspaper Portal: Introduction and API Usage
(Guests: Lisa Landes, Michael Büchner, and Stephanie Nitsche from the German National Library)

Module 4: November 22, 2024 (10:00 AM to 11:30 AM)

Transformer Models for Semantic Search

Module 5: December 6, 2024 (10:00 AM to 11:30 AM)

Large Language Models for Article Extraction and Post-OCR Correction

Module 6: January 10, 2025 (10:00 AM to 11:30 AM)

Named Entity Recognition and Text Classification

Module 7: January 24, 2025 (10:00 AM to 11:30 AM)

Individual Consultation Appointments

Modules and Workloads

Module 1: Introduction to the topic, the course, and NLP • Introduction to Colab Notebooks • Python Crash Course 1

Module 1 will introduce the main topic of the course, give an overview on NLP and a crash course on Python using Colab Notebooks.

View Details

October 25, 2024

Module 2: Python Crash Course 2 • Introduction to NLP with SpaCy, NLTK, and SKLEARN

In this module, we'll explore how to leverage Colab Notebooks for data access and become data detectives using basic NLP tasks.

View Details

November 8, 2024

Module 3: The German Newspaper Portal: Overview, API Usage, Data Lab

This module gives background information to the the German Newspaper Portal, introduces to the API and gives an insight into the Data Lab.

View Details

November 15, 2024

Module 4: Transformer Models for Semantic Search

In Module 4 we investigate the variety of transformer models for NLP tasks as well as the semantic search possibilites for historical newspapers.

View Details

November 22, 2024

Module 5: Large Language Models for Article Extraction and Post-OCR Correction

In this module, we'll explore how Open-Access LLMs can be used for complex NLP tasks.

View Details

December 6, 2024

Module 6: Named Entity Recognition and Text Classification

In this module we explore novel ways for NER (using Data Lab API's) and text classification.

View Details

January 10, 2025

Literature

Dobson, J.E. (2023). On reading and interpreting black box deep neural networks. International Journal of Digital Humanities, 5, 431–449. https://doi.org/10.1007/s42803-023-00075-w
Khurana, D., Koli, A., Khatter, K. et al. (2023). Natural language processing: state of the art, current trends and challenges. Multimedia Tools and Applications, 82, 3713–3744. https://doi.org/10.1007/s11042-022-13428-4
König, M. (19. August 2024). ChatGPT und Co. in den Geschichtswissenschaften – Grundlagen, Prompts und Praxisbeispiele. Digital Humanities am DHIP. Abgerufen am 2. Dezember 2024 von https://doi.org/10.58079/126eo
Navigli, R., Conia, S., & Ross, B. (2023). Biases in Large Language Models: Origins, Inventory, and Discussion. Journal of Data and Information Quality, 15(2), Article 10, 21 pages. https://doi.org/10.1145/3597307
Sahoo, P., Singh, A. K., Saha, S., Jain, V., Mondal, S., & Chadha, A. (2024). A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications. arXiv:2402.07927. https://doi.org/10.48550/arXiv.2402.07927