Tokenization

M.Sc course, University of Debrecen, Department of Data Science and Visualization, 2025

Colab

Fragmentation of sentences into words, word parts, or characters.

Types:

  • Character-based
  • Word, subword-based
  • Purpose: To break the data set to be processed into words or characters in such a way that the machine learning process used for the analysis can identify them using its own dictionary.
  • Trade-off: size vs efficiency