VectorDB

A vector database stores vectors and allows you to query them based on similarity. It might also be able to create embeddings from text before storing them.

VectorDB is an abstract class. Inherit from this class and define the _insert and _get_similar methods. The _insert method should take in the vector, text, and metadata and store it in the database. The _get_similar method should take in a vector / text and return the most similar vectors / texts. To use these functions, you can use the insert and get_similar methods.

Methods

_insert (abstract): Implement this method to insert a vector, text, and metadata into the database. Do not call this method directly. Instead use the insert method.

Input: VectorDBInsert (Learn more about: VectorDBInsert)
Output: None

_get_similar (abstract): Implement this method to get the most similar vectors / texts to the input vector / text. Do not call this method directly. Instead use the get_similar method.

Input: VectorDBGetSimilar (Learn more about: VectorDBGetSimilar)
Output: List[Any]

Basic Usage

A vector database is essential for any vector search engine. Let's connect Weaviate's local instance to Embedia. We'll be returning the similar vectors from the database in a TextDoc format (Learn more about TextDoc).

import asyncio
import json
import shutil
 
import weaviate
from embedia import TextDoc, VectorDB, VectorDBGetSimilar, VectorDBInsert
from weaviate.embedded import EmbeddedOptions
 
 
class WeaviateDB(VectorDB):
 
    def __init__(self):
        super().__init__()
        self.client = weaviate.Client(embedded_options=EmbeddedOptions(
            persistence_data_path="./temp/weaviate"))
        if not self.client.schema.get()["classes"]:
            self.client.schema.create_class({
                "class":
                "Document",
                "properties": [
                    {
                        "name": "contents",
                        "dataType": ["text"],
                    },
                    {
                        "name": "meta",
                        "dataType": ["text"],
                    },
                ],
            })
 
    async def _insert(self, data: VectorDBInsert):
        if not data.meta:
            data.meta = {}
        return self.client.data_object.create(
            data_object={
                "contents": data.text,
                "meta": json.dumps(data.meta),
            },
            class_name="Document",
            uuid=data.id,
            vector=data.embedding,
        )
 
    async def _get_similar(self, data: VectorDBGetSimilar):
        response = (self.client.query.get(
            "Document", ["contents", "meta"]).with_near_vector({
                "vector":
                data.embedding,
            }).with_limit(data.n_results).with_additional(["distance",
                                                           "id"]).do())
        docs = response["data"]["Get"]["Document"]
 
        result = []
        for doc in docs:
            meta = json.loads(doc["meta"])
            result.append((
                1 - doc["_additional"]["distance"],
                TextDoc(id=doc["_additional"]["id"],
                        contents=doc["contents"],
                        meta=meta),
            ))
        return result
 
 
if __name__ == "__main__":
    shutil.rmtree("temp", ignore_errors=True)
    db = WeaviateDB()
    asyncio.run(
        db.insert(
            VectorDBInsert(id="dc11d2bb-2b46-4516-8388-62182bf62c77",
                           text="Hello World",
                           meta={},
                           embedding=[1, 2, 3])))
    docs = asyncio.run(
        db.get_similar(
            VectorDBGetSimilar(embedding=[1.1, 2.2, 3.3], n_results=5)))
    print('>>>>>>>>>', docs)

Running the above code will output the following (Dont mind the logs generated by weaviate):

Started /home/runner/.cache/weaviate-embedded: process ID 4369
{"action":"startup","default_vectorizer_module":"none","level":"info","msg":"the default vectorizer modules is set to \"none\", as a result all new schema classes without an explicit vectorizer setting, will use this vectorizer","time":"2023-10-01T13:05:24Z"}
{"action":"startup","auto_schema_enabled":true,"level":"info","msg":"auto schema enabled setting is set to \"true\"","time":"2023-10-01T13:05:24Z"}
{"level":"warning","msg":"Multiple vector spaces are present, GraphQL Explore and REST API list objects endpoint module include params has been disabled as a result.","time":"2023-10-01T13:05:24Z"}
{"action":"grpc_startup","level":"info","msg":"grpc server listening at [::]:50051","time":"2023-10-01T13:05:24Z"}
{"action":"restapi_management","level":"info","msg":"Serving weaviate at http://127.0.0.1:6666","time":"2023-10-01T13:05:24Z"}
{"action":"hnsw_vector_cache_prefill","count":1000,"index_id":"document_Ypeu69WAFfFV","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-10-01T13:05:25Z","took":104986}
>>>>>>>>> [(1.00000011920929, TextDoc(contents='Hello World', meta={}, id='dc11d2bb-2b46-4516-8388-62182bf62c77', created_at='2023-10-01 13:05:25.340397+00:00'))]
{"action":"restapi_management","level":"info","msg":"Shutting down... ","time":"2023-10-01T13:05:25Z"}
{"action":"restapi_management","level":"info","msg":"Stopped serving weaviate at http://127.0.0.1:6666","time":"2023-10-01T13:05:25Z"}

Try it out yourself

EmbeddingModel Event