Recurent Neural Network

M.Sc course, University of Debrecen, Department of Data Science and Visualization, 2025

Colab

What is an RNN, specifically an LSTM?

  • RNNs are a type of neural network designed to handle sequential data, like text or time series. They have a “memory” that allows them to retain information from previous steps in the sequence, which is crucial for understanding context.
  • LSTMs (Long Short-Term Memory) are a special kind of RNN that address the “vanishing gradient” problem, which hindered the ability of traditional RNNs to learn long-range dependencies in sequences. LSTMs have a more complex internal structure with “gates” that control the flow of information, allowing them to better capture and retain important information over longer sequences.

Why are LSTMs important in NLP?

  • ** Contextual Understanding:** LSTMs excel at capturing the relationships between words in a sentence or document, enabling them to understand the context of words and phrases. This is essential for tasks like sentiment analysis, machine translation, and text generation.
  • ** Handling Long Sequences:** Their ability to learn long-range dependencies makes them well-suited for processing lengthy text documents, where the meaning of a word might depend on information presented much earlier in the text.
  • ** Sequence-to-Sequence Tasks:** LSTMs can be used to map one sequence to another, which is fundamental for tasks like machine translation (translating a sentence from one language to another) or text summarization (condensing a longer text into a shorter summary).

In simpler terms: Think of an LSTM as a powerful reader that can remember the important parts of a story as it reads, allowing it to understand the overall plot and meaning better than a simple reader who only focuses on the current sentence. This “memory” and contextual understanding make LSTMs a valuable tool for NLP tasks where grasping the meaning of text is crucial.

In the notebook:

1. Load Data

  • Imports the pandas library for data manipulation.
  • Defines a dictionary splits to hold the paths to the training, testing, and unsupervised data files in Parquet format.
  • Loads the IMDB movie review dataset using pd.read_parquet from Hugging Face Datasets. It loads the training and testing data into separate Pandas DataFrames (df_imdb_train, df_imdb_test).
  • Prints information about the training and testing DataFrames using df.info().
  • Splits the testing data into validation and testing sets using train_test_split from sklearn.model_selection with a 20% test size and a random state of 42 for reproducibility.
  • Prints the sizes of the validation and testing sets.

2. Tokenization (Create Dictionary)

  • Imports necessary libraries from tokenizers for tokenization: Tokenizer, WordPiece, WordPieceTrainer, and Whitespace.
  • Creates a Tokenizer object using the WordPiece model and sets the unknown token to [UNK].
  • Initializes a WordPieceTrainer to train the tokenizer with a vocabulary size of 5000, a minimum frequency of 2, and special tokens.
  • Defines a data_generator function to iterate over the text data for training the tokenizer.
  • Trains the tokenizer using the training data (df_imdb_train[‘text’]) and the trainer.
  • Saves the trained tokenizer to a file named “tokenizer-wp-imdb.json”.
  • Encodes a sample text using the tokenizer and prints the resulting tokens and their IDs.
  • Calculates the number of tokens for each text in the training, testing, and validation sets using the tokenizer and stores it in a new column named “count_of_tokens”.
  • Displays descriptive statistics of the training DataFrame using df.describe().
  • Creates a histogram of the “count_of_tokens” column in the training DataFrame.
  • Filters the training, testing, and validation DataFrames to keep only texts with 512 tokens or fewer.
  • Prints the lengths of the filtered DataFrames.

3. Prepare to Train

  • Enables padding and truncation for the tokenizer, ensuring all sequences have a length of 512.
  • Defines a function encode_text to encode a batch of texts using the tokenizer.
  • Creates a custom TextDataset class to handle the text data and labels.
  • Creates DataLoader instances for the training, validation, and testing sets to handle batching and shuffling of data.

4. Define and Train Model

  • Defines a TextClassifier model using PyTorch, consisting of an embedding layer, an LSTM layer, a fully connected layer, and dropout for regularization.
  • Initializes the model, loss function (CrossEntropyLoss), and optimizer (Adam).
  • Moves the model to the appropriate device (CUDA if available, otherwise CPU).
  • Defines a function calculate_accuracy to evaluate the model’s accuracy on a given dataset.
  • Trains the model for 5 epochs, iterating over the training data in batches and updating the model’s parameters using the optimizer.
  • Calculates and prints the training loss and validation accuracy after each epoch.

5. Evaluate Model

  • Selects 10 random samples from the test set for evaluation.
  • Performs inference on the selected samples using the trained model and stores the predicted labels.
  • Prints the tokens, true labels, and predicted labels for each sample.
  • Calculates and prints the test accuracy of the model on the entire test set.