Tokenization
M.Sc course, University of Debrecen, Department of Data Science and Visualization, 2025
Colab
Fragmentation of sentences into words, word parts, or characters.
Types:
- Character-based
- Word, subword-based
- Purpose: To break the data set to be processed into words or characters in such a way that the machine learning process used for the analysis can identify them using its own dictionary.
- Trade-off: size vs efficiency