Tokenization

M.Sc course, University of Debrecen, Department of Data Science and Visualization, 2025

Colab

Fragmentation of sentences into words, word parts, or characters.

Types:

Character-based
Word, subword-based
Purpose: To break the data set to be processed into words or characters in such a way that the machine learning process used for the analysis can identify them using its own dictionary.
Trade-off: size vs efficiency