Personal RAG Data Pipeline Implementation: Github To OpenWebUI (Chroma DB)

I've been working on a project recently that I wanted to share: a data pipeline that automatically syncs markdown files from a GitHub repo to OpenWebUI's Knowledge Store.

This is part of my effort to create a personal context data store for improving inference with cloud-based LLMs.

The idea came from the observation that many tools focus on extracting context and creating memory stores from ongoing conversations. This project is an experiment in the opposite direction: deliberately creating context data and injecting it into conversations using Retrieval-Augmented Generation (RAG).

The code for this, along with other related projects (including assistant configurations that "interview" users to generate context data!), is available on my GitHub profile.

Manually curating and annotating details can be time-consuming. This pipeline is designed as one component of a larger system, efficiently loading data into the vector store. The broader system includes intelligent agents configured to extract structured data, which leads to the next point...

Why Use a Pipeline? The Limitations of Manual Uploads

OpenWebUI's Knowledge Store has a user-friendly interface and all the basics needed for RAG. However, uploading markdown files one at a time through the web interface becomes impractical when dealing with a large or constantly growing number of files. It might work for quick tests, but it's a limitation when building a substantial knowledge base.

The pipeline provides a more resilient solution, allowing for incremental syncs. I can treat my context repository more like a database or code repository. This allows me to manage changes with Git, create reproducible builds, and automate tests. It also allows me to control my infrastructure and integrate directly with OpenWebUI.

The main benefit is the ability to manage your personal knowledge and apply version control principles to it, treating your knowledge as code.

OpenWebUI includes ChromaDB out-of-the-box, hosted locally. It can also connect to external vector stores like Qdrant and Milvus.

The Starting Point: OpenWebUI API

The first step was to examine the OpenWebUI API documentation and identify the correct methods for the Knowledge Store. I needed the specific endpoint, the expected data format, and other essential details for integration.

For initial testing, I created a dedicated knowledge store called "testing".

After creating the store, you'll need its UUID, which the upload script uses to specify the destination for the data.

Context Repository: Simplicity First

My original goal involved building a "knowledge graph" for personal information, which required a data store.

Initially, the repository was a simple collection of markdown files in folders, organized organically. The Python script I wrote uploads these files individually to the knowledge store because the server-side implementation doesn't handle hierarchical structures. Keeping the implementation simple is key in the present.

Building the Data Pipeline

To be frank, I used Google's Sonnet AI model, providing it with the OpenWebUI API documentation via Cline, to guide the process.

The pipeline performs two key actions:

Incremental Syncing: Uses a JSON file to track file changes.
Selective Uploads: Only uploads new or modified files on subsequent runs, which is more efficient than manual management.

This JSON-based change tracking, while simple, is important for scaling the amount of context data.

Implementation and Verification

The implementation process went well. I added logging to the upload script to monitor its progress as it processed my context repository. Each file was uploaded to the "testing" knowledge store, and a quick check in the UI confirmed the successful transfer.

Testing the Knowledge: Was it Effective?

To verify the knowledge's usability, I created a test assistant designed to prioritize the use of ingested knowledge:

System prompt: "You are an assistant that tries strongly to use knowledge before answering a prompt. If suitable knowledge exists, use only this knowledge for answering the prompt. If suitable knowledge exists, use only this knowledge for answering the prompt."
Knowledge: the testing knowledge base for the purposes of this experiment

I then gave the agent a prompt that could only be answered using the uploaded context data.

The agent initiated a RAG request and provided the correct answer.

Ideally, the model should also avoid using its "internal parametric knowledge" when relevant knowledge is available.

Separating Front End User Interfaces from Knowledge

RAG performance on OpenWebUI is currently variable but improving. There are trade-offs to consider.

The key benefit of this pipeline approach is the ability to decouple the context store from the front end. This enables the injection of data in a way that optimizes context reliability.

If the integration isn't satisfactory, components can be swapped out while maintaining the data injection pipeline, simply by developing new retrieval or orchestration logic.

Conclusion

This project demonstrates the feasibility of building an automated pipeline for injecting context data into a vector store. Doing this opens up possibilities for experimenting with RAG and building flexible knowledge management systems. This approach allows us to gain control over how injected context impacts cloud-based LLM inference.

Whether for personal notes or an organization's knowledge base, this architecture offers flexibility and choice.