Tokenizer

A tokenizer is used to convert a string into a list of tokens. Tokens can either be of type int or str. To learn more about tokenization, you can read this Huggingface article: Summary of the tokenizers (opens in a new tab).

Tokenizer is an abstract class. Inherit from this class and define the _tokenize method. The _tokenize method should take a string as input and output a list of tokens. To use Tokenizer, call the class instance like a function with the input text as the argument.

Methods

_tokenize (abstract): Implement this method with the tokenization logic. Do not call this method directly. Instead, use the __call__ method.

Input: str
Output: List[Any]

__call__ : Internally calls the _tokenize method. Use this method by calling the class instance like a function with the input text as the argument.

Input: str
Output: List[int] or List[str]

Usage

A Tokenizer instance will be used to count the number of tokens in the system prompt, input prompt, as well as in the output response of both the LLM and ChatLLM instances.

Different LLMs use different tokenization methods or libraries. You'll have to find out which method/library does your LLM use. You can find that information by checking the LLM's documentation.

Let's connect the tokenizer that OpenAI uses for gpt-3.5 turbo, text-davinci-003 and gpt-4 to Embedia. They recommend using the tiktoken library for tokenization.

ℹ️

Note that the way your tokenizer counts the number of tokens might slightly vary from how a service provider (eg: OpenAI) counts them. They might add a few tokens internally for the service to function properly.

import asyncio
 
import tiktoken
from embedia import Tokenizer
 
 
class OpenAITokenizer(Tokenizer):
 
    def __init__(self):
        super().__init__()
 
    async def _tokenize(self, text):
        return tiktoken.encoding_for_model("gpt-3.5-turbo").encode(text)
 
 
if __name__ == '__main__':
    tokenizer = OpenAITokenizer()
    tokens = asyncio.run(
        tokenizer('Lorem ipsum dolor sit amet, consectetur adipiscing elit.'))
    print(tokens)

Running the above code will print the following output:

[33883, 27439, 24578, 2503, 28311, 11, 36240, 59024, 31160, 13]

Try it out yourself

Overview LLM