Recurent Neural Network
M.Sc course, University of Debrecen, Department of Data Science and Visualization, 2025
Colab
What is an RNN, specifically an LSTM?
- RNNs are a type of neural network designed to handle sequential data, like text or time series. They have a “memory” that allows them to retain information from previous steps in the sequence, which is crucial for understanding context.
- LSTMs (Long Short-Term Memory) are a special kind of RNN that address the “vanishing gradient” problem, which hindered the ability of traditional RNNs to learn long-range dependencies in sequences. LSTMs have a more complex internal structure with “gates” that control the flow of information, allowing them to better capture and retain important information over longer sequences.
Why are LSTMs important in NLP?
- ** Contextual Understanding:** LSTMs excel at capturing the relationships between words in a sentence or document, enabling them to understand the context of words and phrases. This is essential for tasks like sentiment analysis, machine translation, and text generation.
- ** Handling Long Sequences:** Their ability to learn long-range dependencies makes them well-suited for processing lengthy text documents, where the meaning of a word might depend on information presented much earlier in the text.
- ** Sequence-to-Sequence Tasks:** LSTMs can be used to map one sequence to another, which is fundamental for tasks like machine translation (translating a sentence from one language to another) or text summarization (condensing a longer text into a shorter summary).
In simpler terms: Think of an LSTM as a powerful reader that can remember the important parts of a story as it reads, allowing it to understand the overall plot and meaning better than a simple reader who only focuses on the current sentence. This “memory” and contextual understanding make LSTMs a valuable tool for NLP tasks where grasping the meaning of text is crucial.
In the notebook:
1. Load Data
- Imports the pandas library for data manipulation.
- Defines a dictionary splits to hold the paths to the training, testing, and unsupervised data files in Parquet format.
- Loads the IMDB movie review dataset using pd.read_parquet from Hugging Face Datasets. It loads the training and testing data into separate Pandas DataFrames (df_imdb_train, df_imdb_test).
- Prints information about the training and testing DataFrames using df.info().
- Splits the testing data into validation and testing sets using train_test_split from sklearn.model_selection with a 20% test size and a random state of 42 for reproducibility.
- Prints the sizes of the validation and testing sets.
2. Tokenization (Create Dictionary)
- Imports necessary libraries from tokenizers for tokenization: Tokenizer, WordPiece, WordPieceTrainer, and Whitespace.
- Creates a Tokenizer object using the WordPiece model and sets the unknown token to [UNK].
- Initializes a WordPieceTrainer to train the tokenizer with a vocabulary size of 5000, a minimum frequency of 2, and special tokens.
- Defines a data_generator function to iterate over the text data for training the tokenizer.
- Trains the tokenizer using the training data (df_imdb_train[‘text’]) and the trainer.
- Saves the trained tokenizer to a file named “tokenizer-wp-imdb.json”.
- Encodes a sample text using the tokenizer and prints the resulting tokens and their IDs.
- Calculates the number of tokens for each text in the training, testing, and validation sets using the tokenizer and stores it in a new column named “count_of_tokens”.
- Displays descriptive statistics of the training DataFrame using df.describe().
- Creates a histogram of the “count_of_tokens” column in the training DataFrame.
- Filters the training, testing, and validation DataFrames to keep only texts with 512 tokens or fewer.
- Prints the lengths of the filtered DataFrames.
3. Prepare to Train
- Enables padding and truncation for the tokenizer, ensuring all sequences have a length of 512.
- Defines a function encode_text to encode a batch of texts using the tokenizer.
- Creates a custom TextDataset class to handle the text data and labels.
- Creates DataLoader instances for the training, validation, and testing sets to handle batching and shuffling of data.
4. Define and Train Model
- Defines a TextClassifier model using PyTorch, consisting of an embedding layer, an LSTM layer, a fully connected layer, and dropout for regularization.
- Initializes the model, loss function (CrossEntropyLoss), and optimizer (Adam).
- Moves the model to the appropriate device (CUDA if available, otherwise CPU).
- Defines a function calculate_accuracy to evaluate the model’s accuracy on a given dataset.
- Trains the model for 5 epochs, iterating over the training data in batches and updating the model’s parameters using the optimizer.
- Calculates and prints the training loss and validation accuracy after each epoch.
5. Evaluate Model
- Selects 10 random samples from the test set for evaluation.
- Performs inference on the selected samples using the trained model and stores the predicted labels.
- Prints the tokens, true labels, and predicted labels for each sample.
- Calculates and prints the test accuracy of the model on the entire test set.