The difference between tokens and words
Large language models operate on tokens, not words, though tokens roughly correspond to words.
A list of words would not be practical. There is no definitive list of all English words, much less all words in all languages. Still, tokens correspond roughly to words, while being more flexible.
Words are typically turned into tokens using BPE (byte pair encoding). There are multiple implementations of this algorithm, giving different tokenizations. Here I use the tokenizer gpt-3.5-turbo used in GPT 3.5 and 4.
Hello world!If we look at the sentence Hello world!" we see that it turns into three tokens: 9906, 1917, and 0. These correspond to Hello", " world", and !".
In this example, each token corresponds to a word or punctuation mark, but there's a little more going on. It is true that 0 is simply the token for the exclamation mark-we'll explain why in a moment-it's not quite true to say 9906 is the token for hello" and 1917 is the token for world".
Many to oneIn fact 1917 is the token for " world". Note the leading space. The token 1917 represents the word world," not capitalized and not at the beginning of a sentence. At the beginning of a sentence, World" would be tokenized as 10343. So one word may correspond to several different tokens, depending on how the word is used.
One to manyIt's also true that a word may be broken into several tokens. Consider the sentence Chuck Mangione plays the flugelhorn." This sentence turns into 9 tokens, corresponding to
Chuck", Mang", ione", " plays", " fl", ug", el", horn", ."
So while there is a token for the common name Chuck", there is no token for the less common name Mangione". And while there is a single token for " trumpet" there is no token for the less common flugelhorn."
CharactersThe tokenizer will break words down as far as necessary to represent them, down to single letters if need be.
Each ASCII character can be represented as a token, as well as many Unicode characters. (There are 100256 total tokens, but currently 154,998 Unicode characters, so not all Unicode characters can be represented as tokens.)
Update: The next post dives into the details of how Unicode characters are handled.
The first 31 ASCII characters are non-printable control characters, and ASCII character 32 is a space. So exclamation point is the first printable, non-space character, with ASCII code 33. The rest of the printable ASCII characters are tokenized as their ASCII value minus 33. So, for example, the letter A, ASCII 65, is tokenized as 65 - 33 = 32.
Tokenizing a dictionaryI ran every line of the american-english word list on my Linux box through the tokenizer, excluding possessives. There are 6,015 words that correspond to a single token, 37,012 that require two tokens, 26,283 that require three tokens, and so on. The maximum was a single word, netzahualcoyotl, that required 8 tokens.
The 6,015 words that correspond to a single token are the most common words in English, and so quite often a token does represent a word. (And maybe a little more, such as whether the word is capitalized.)
The post The difference between tokens and words first appeared on John D. Cook.