Tokenizer
A tokenizer is used to convert a string into a list of tokens. Tokens can either
be of type int
or str
. To learn more about tokenization, you can read this
Huggingface article:
Summary of the tokenizers (opens in a new tab).
Tokenizer
is an abstract class. Inherit from this class and define the
_tokenize
method. The _tokenize
method should take a string as input and
output a list of tokens. To use Tokenizer
, call the class instance like a
function with the input text as the argument.
Methods
_tokenize
(abstract): Implement this method with the tokenization logic. Do not call this method directly. Instead, use the__call__
method.
- Input:
str
- Output:
List[Any]
__call__
: Internally calls the_tokenize
method. Use this method by calling the class instance like a function with the input text as the argument.
- Input:
str
- Output:
List[int]
orList[str]
Usage
A Tokenizer
instance will be used to count the number of tokens in the system
prompt, input prompt, as well as in the output response of both the LLM
and
ChatLLM
instances.
Different LLMs use different tokenization methods or libraries. You'll have to find out which method/library does your LLM use. You can find that information by checking the LLM's documentation.
Let's connect the tokenizer that OpenAI uses for gpt-3.5 turbo
,
text-davinci-003
and gpt-4
to Embedia. They recommend using the tiktoken
library for tokenization.
Note that the way your tokenizer counts the number of tokens might slightly vary from how a service provider (eg: OpenAI) counts them. They might add a few tokens internally for the service to function properly.
import asyncio
import tiktoken
from embedia import Tokenizer
class OpenAITokenizer(Tokenizer):
def __init__(self):
super().__init__()
async def _tokenize(self, text):
return tiktoken.encoding_for_model("gpt-3.5-turbo").encode(text)
if __name__ == '__main__':
tokenizer = OpenAITokenizer()
tokens = asyncio.run(
tokenizer('Lorem ipsum dolor sit amet, consectetur adipiscing elit.'))
print(tokens)
Running the above code will print the following output:
[33883, 27439, 24578, 2503, 28311, 11, 36240, 59024, 31160, 13]