← Go Back

Module 5: Large Language Models for Article Extraction and Post-OCR Correction

Module 5 will be all about large language models, prompting techniques and two specific NLP tasks: article extraction and OCR post-correction

Large Language Models (LLMs) are artificial intelligence systems trained on massive text datasets that can process and generate human language based on the Transformer architecture introduced by Vaswear et al. in 2017. These models use neural networks to predict likely next tokens in a sequence, enabling tasks like text completion, translation, and question answering. While research shows correlations between model size, training data, and performance, specific capabilities and limitations continue to be actively studied and debated in the research community. They fundamentally operate through pattern matching rather than genuine understanding.

Preparation for Module 5:

  1. Watch (if not done already) this YouTube Video on LLMs: 3Blue1Brown: Large Language Models
  2. Please read the blog post by Maraike König (referenced in the Literature section) and analyze the ethical implications of using LLMs. How do you incorporate data ethics in your LLM usage?
  3. Create an NVIDIA token:
    1. Visit the NVIDIA AI Playground
    2. Click on login on the right top of the page
    3. Enter your University Email
    4. Copy the token

Literature:

Sahoo, P., Singh, A. K., Saha, S., Jain, V., Mondal, S., & Chadha, A. A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications. arXiv:2402.07927 (2024). https://doi.org/10.48550/arXiv.2402.07927

Mareike König (19. August 2024). ChatGPT und Co. in den Geschichtswissenschaften – Grundlagen, Prompts und Praxisbeispiele. Digital Humanities am DHIP. Abgerufen am 2. Dezember 2024 von https://doi.org/10.58079/126eo

Notebooks we will use in class:

Workload (after class):

Write a prompt for OCR Post-Correction and try it out in the Notebook "Large Language Models and Article Separation/OCR Post-Correction"

Date and Time:

December 6, 2024 (10:00 AM to 11:30 AM)