Advanced Natural Language Processing
M.Sc course, University of Debrecen, Department of Data Science and Visualization, 2024
This course delves into advanced concepts of Natural Language Processing (NLP) and Machine Learning (ML) with a strong focus on modern deep learning techniques. It covers foundational topics such as tokenization, text representation, and pipelines, as well as cutting-edge research in large language models (LLMs), transformers, and their applications. The course emphasizes both theoretical understanding and practical implementation, preparing students to tackle real-world NLP challenges, including security, privacy, and human-centered design. During the semester, students will also have the opportunity to test and train these architectures on real data using cloud-based services (Google Collab).
======
Requirements
- Attendance sheet: Fewer absences than allowed. Active participation in classes.
- Create a working application, solve a real problem, and present it as a video using the solutions and models learned in class.
- It must be uploaded to Github and shared.
- Maximum length of video is 5-10 minutes.
- In the video, each creator must present their own contribution. (for 3-8 minutes)
- The application must be shown in action at the end of the video. (for 1-2 minutes)
- Organizing into teams (2-4 people) or working individually.
- If the creator(s) uses a service based on a generative language model to complete the task, they must attach the prompt log to the completed project as additional material.
- It is not certain that the team members receive a uniform grade, but they get grades proportionate to the task they have completed in the project.
- Submission deadline: 2025.05.31
- Submission form
Lecture
- I. Tokenization
- I. Text representation I.
- I. Text representation II.
- II. Large language models I. fancy-rnn
- II. Large language models I. CNN-TreeRNN
- III. Large language models II. Basic
- IV. Large language models II. Transformer
- V. Pretrain
- V. Question Answering
- VI. Post-training
- VI. Promting RLHF
- VII. Life After DPO
- VII. Training
- VIII. Efficient Adaptation
- IX. Hardware-aware Algorithms for Sequence Modeling
- X. Evaluation
- X. Natural Language to Code Generation
- X. Security & Privacy of LLMs
- XI. Human-Centered NLPF
- XI. Speech
- XII. Agents
- XII. Linguistics Philosophy
- XIII. Open problems and discussion
- IVX. State of the art
Labor
- N. Python Basics
- N. Numpy and Matplotlib
- N. Pandas Intro
- I. Problem Identification and Discovery
- II. Pipeline
- III. Vector Store
- IV. Filter and Cluster
- V. Agent
- VI. Tokenization
- VII. Embedding
- VIII. Recurent Neural Network
- IX. BERT and LoRA
- X. Code Generation
- XI. Recommmender System
- XII. Transformer
- XII. Transformer
- XIII. [Total LABOR]
Submitted
Usefull Links
Recommended Literatures and Courses
- Jurafsky, Daniel, and James H. Martin. “Speech and language processing (draft).” Chapter A: Hidden Markov Models (Draft of September 11, 2018). Retrieved March 19 (2018): 2019.
- Eisenstein, Jacob. “Introduction to natural language processing.” MIT press, 2019.
- Goldberg, Yoav. “A primer on neural network models for natural language processing.” Journal of Artificial Intelligence Research 57 (2016): 345-420.
- Francois Chollet. “Deep Learning with Python”
- Hugging Face NLP Course
- MIT Introduction to Deep Learning
- Visual Guide to Transformer Neural Networks - (Episode 1)
- Visual Guide to Transformer Neural Networks - (Episode 2)
- Visual Guide to Transformer Neural Networks - (Episode 3)
Key Words
- Tokenization
- Byte-Pair Encoding (BPE)
- Byte-level BPE
- WordLevel
- WordPiece
- Unigram
- SentencePiece
- Embbeding
- Skip-Gram
- CBOW
- GLOVE
- Word2Vec
- Position Embedding
- (Multi-Head) Attention
- Neural Network (Feed Foward layer)
- Normalization
- Transoformer
- Pre-Trained
- Large Language Model
- NLP Tasks
- Summarization
- Translate
- Generation
- Q&A
- Named Entity Recognition
- Sentiment analysis
- Multimodal architectures
- Huggingface
- Keras
- Tensorflow
- Pytorch
- Python
- Pipline
- Notebook
- Google Colab
Usefull Publications
[2] Improving Language Understanding by Generative Pre-Training
[3] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
[4] Efficient Estimation of Word Representations in Vector Space