EmbeddingModel

To convert text into an embedding, you need to use an embedding model. Embeddings are a way to represent text as a vector of numbers. You might be familiar with a 3 dimensional vector (across x, y, z), which is a list of 3 numbers. An embedding might have thousands of dimensions (in an abstract space), that means it is a list of thousands of numbers.

This kind of multidimentional vector is how deep learning models see any information they process. Turns out, such vector representations are also very useful for tasks such as semantic search.

EmbeddingModel is an abstract class. Inherit from this class and define the _embed method. The _embed method should take in a string as the input text and return the embedding. To use EmbeddingModel, call the class instance like a function with the input text as the argument.

Methods

_embed (abstract): Implement this method to convert a text input into an embedding. Do not call this method directly. Instead, use the __call__ method.

Input: Union[List[Any], str]
Output: List[Any]

__call__: Internally calls the _embed method. Use this method by calling the class instance like a function with the input text as the argument.

Input: Union[List[Any], str]
Output: List[Any]
Publishes an EmbeddingStart event before calling the _embed method and publishes an EmbeddingEnd event after the _embed method returns.

Basic Usage

An EmbeddingModel can be used in conjunction with a VectorDB (Learn more about: VectorDB) to build a semantic search index in your application. A semantic search index with an LLM model is the basis of a Retrieval Augmented Generation (RAG) framework.

Let's connect the OpenAI's text embedding model to Embedia.

import asyncio
import os
 
import openai
from embedia import EmbeddingModel
from tenacity import (
    retry,
    retry_if_not_exception_type,
    stop_after_attempt,
    wait_random_exponential,
)
 
 
class OpenAIEmbedding(EmbeddingModel):
 
    def __init__(self):
        super().__init__()
        openai.api_key = os.environ['OPENAI_API_KEY']
 
    @retry(wait=wait_random_exponential(min=1, max=20),
           stop=stop_after_attempt(6),
           retry=retry_if_not_exception_type(openai.InvalidRequestError))
    async def _embed(self, input: str):
        result = await openai.Embedding.acreate(input=input,
                                                model='text-embedding-ada-002')
        return result["data"][0]["embedding"]
 
 
if __name__ == '__main__':
    embedding_model = OpenAIEmbedding()
    embedding = asyncio.run(
        embedding_model(
            'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua'
        ))
    print(len(embedding))

Running the above code should print the following output:

[time: 2023-10-01T12:15:20.338184+00:00] [id: 139646081867760] [event: Embedding Start]
Input:
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore ...
 
[time: 2023-10-01T12:15:21.240092+00:00] [id: 139646081867760] [event: Embedding End]
Embedding:
[-0.007770706433802843, -0.017298607155680656, 0.006062322296202183, -0.02754240296781063, -0.020682834088802338]...
1536

Try it out yourself

Tool VectorDB