basic_rag

A simple RAG database

A very simple RAG database that combines a faiss index and a SQLite database to store chunks of text with embeddings. Embeddings are computed using the mistral API.

This is adapted for storing a relatively low amount of chunks, in an on-disk store.

class basic_rag.RAGDatabase(db_path: Path, index_path: Path, rate_limit: RateLimiter | float = 1.1, model='mistral-embed', max_n_tokens=16384)[source]

Simple RAG database

implemented as

a sqlite database with a single table with columns
id, text_chunk, embedding, file_path, start_line, end_line, file_sha
a faiss index

I ended up re-coding this because I was not able to find a RAG database that was both simple enough (no server needed, no huge framework) and flexible enough.

insert_db(chunk: TextChunk, *, id=None, embedding, do_commit=True, add_to_index=False)[source]: Insert a text chunk into the sqlite database and the index

static get_chunks(file, *, chunk_size=25, overlap=5, filename, hash=None)[source]

Cut a file into chunks

Parameters:

file – a Path or bytes object
chunk_size – the size of the chunks
overlap – the overlap between the chunks
filename – the filename
hash – the hash of the file (optional)

classmethod get_all_chunks(files: Sequence[Path | bytes], *, chunk_size=25, overlap=5, file_paths: Sequence[str] | None = None, file_shas_to_skip=None)[source]

Cut a list of files into chunks

Parameters:

files – the files
chunk_size – the size of the chunks
overlap – the overlap between the chunks
file_paths – the filenames (Optional: if not provided, and the files are Path objects the filenames will be the paths)
file_shas_to_skip – the file hashes to skip

generate_index(files: Sequence[Path | bytes], *, api_key, chunk_size=25, overlap=5, file_paths: Sequence[str] | None = None, file_shas_to_skip=None)[source]: Generate the index from a list of files

commit()[source]

Write changes to disk

do a database commit, and write the index

get_chunk_by_id(id)[source]: Get a chunk from database by its id

query(query, n_results=5, *, api_key)[source]: Do a Knn search on the index

update_index(files: Sequence[Path | bytes], *, api_key, chunk_size=25, overlap=5, file_paths: Sequence[str] | None = None)[source]

Update the index from a list of files.

Like generate_index, but preemptively checks which files have changed, and only updates those.

Modules

`basic_rag`	Basic RAG database
`utils`