VectorDB
A vector database stores vectors and allows you to query them based on similarity. It might also be able to create embeddings from text before storing them.
VectorDB
is an abstract class. Inherit from this class and define the
_insert
and _get_similar
methods. The _insert
method should take in the
vector, text, and metadata and store it in the database. The _get_similar
method should take in a vector / text and return the most similar vectors /
texts. To use these functions, you can use the insert
and get_similar
methods.
Methods
_insert
(abstract): Implement this method to insert a vector, text, and metadata into the database. Do not call this method directly. Instead use theinsert
method.
- Input:
VectorDBInsert
(Learn more about: VectorDBInsert) - Output:
None
_get_similar
(abstract): Implement this method to get the most similar vectors / texts to the input vector / text. Do not call this method directly. Instead use theget_similar
method.
- Input:
VectorDBGetSimilar
(Learn more about: VectorDBGetSimilar) - Output:
List[Any]
Basic Usage
A vector database is essential for any vector search engine. Let's connect
Weaviate's local instance to Embedia. We'll be returning the similar vectors
from the database in a TextDoc
format (Learn more about
TextDoc).
import asyncio
import json
import shutil
import weaviate
from embedia import TextDoc, VectorDB, VectorDBGetSimilar, VectorDBInsert
from weaviate.embedded import EmbeddedOptions
class WeaviateDB(VectorDB):
def __init__(self):
super().__init__()
self.client = weaviate.Client(embedded_options=EmbeddedOptions(
persistence_data_path="./temp/weaviate"))
if not self.client.schema.get()["classes"]:
self.client.schema.create_class({
"class":
"Document",
"properties": [
{
"name": "contents",
"dataType": ["text"],
},
{
"name": "meta",
"dataType": ["text"],
},
],
})
async def _insert(self, data: VectorDBInsert):
if not data.meta:
data.meta = {}
return self.client.data_object.create(
data_object={
"contents": data.text,
"meta": json.dumps(data.meta),
},
class_name="Document",
uuid=data.id,
vector=data.embedding,
)
async def _get_similar(self, data: VectorDBGetSimilar):
response = (self.client.query.get(
"Document", ["contents", "meta"]).with_near_vector({
"vector":
data.embedding,
}).with_limit(data.n_results).with_additional(["distance",
"id"]).do())
docs = response["data"]["Get"]["Document"]
result = []
for doc in docs:
meta = json.loads(doc["meta"])
result.append((
1 - doc["_additional"]["distance"],
TextDoc(id=doc["_additional"]["id"],
contents=doc["contents"],
meta=meta),
))
return result
if __name__ == "__main__":
shutil.rmtree("temp", ignore_errors=True)
db = WeaviateDB()
asyncio.run(
db.insert(
VectorDBInsert(id="dc11d2bb-2b46-4516-8388-62182bf62c77",
text="Hello World",
meta={},
embedding=[1, 2, 3])))
docs = asyncio.run(
db.get_similar(
VectorDBGetSimilar(embedding=[1.1, 2.2, 3.3], n_results=5)))
print('>>>>>>>>>', docs)
Running the above code will output the following (Dont mind the logs generated by weaviate):
Started /home/runner/.cache/weaviate-embedded: process ID 4369
{"action":"startup","default_vectorizer_module":"none","level":"info","msg":"the default vectorizer modules is set to \"none\", as a result all new schema classes without an explicit vectorizer setting, will use this vectorizer","time":"2023-10-01T13:05:24Z"}
{"action":"startup","auto_schema_enabled":true,"level":"info","msg":"auto schema enabled setting is set to \"true\"","time":"2023-10-01T13:05:24Z"}
{"level":"warning","msg":"Multiple vector spaces are present, GraphQL Explore and REST API list objects endpoint module include params has been disabled as a result.","time":"2023-10-01T13:05:24Z"}
{"action":"grpc_startup","level":"info","msg":"grpc server listening at [::]:50051","time":"2023-10-01T13:05:24Z"}
{"action":"restapi_management","level":"info","msg":"Serving weaviate at http://127.0.0.1:6666","time":"2023-10-01T13:05:24Z"}
{"action":"hnsw_vector_cache_prefill","count":1000,"index_id":"document_Ypeu69WAFfFV","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-10-01T13:05:25Z","took":104986}
>>>>>>>>> [(1.00000011920929, TextDoc(contents='Hello World', meta={}, id='dc11d2bb-2b46-4516-8388-62182bf62c77', created_at='2023-10-01 13:05:25.340397+00:00'))]
{"action":"restapi_management","level":"info","msg":"Shutting down... ","time":"2023-10-01T13:05:25Z"}
{"action":"restapi_management","level":"info","msg":"Stopped serving weaviate at http://127.0.0.1:6666","time":"2023-10-01T13:05:25Z"}