Llama, Llama, Llama: 3 Simple Steps to Local RAG with Your Content

Get your own local RAG system up and running in an embarrassingly few lines of code thanks to these 3 Llamas.



3 Simple Steps to Local RAG with Your Content
Image by Author | Midjourney & Canva

 

Do you want local RAG with minimal trouble? Do you have a bunch of documents you want to treat as a knowledge base to augment a language model with? Want to build a chatbot that knows about what you want it to know about?

Well, here's arguably the easiest way.

I might not be the most optimized system for inference speed, vector precision, or storage, but it is super easy. Tweaks can be made if desired, but even without, what we do in this short tutorial should get your local RAG system fully operational. And since we will be using Llama 3, we can also hope for some great results.

What are we using as our tools today? 3 llamas: Ollama for model management, Llama 3 as our language model, and LlamaIndex as our RAG framework. Llama, llama, llama.

Let's get started.

 

Step 1: Ollama, for Model Management

 

Ollama can be used to both manage and interact with language models. Today we will be using it both for model management and, since LlamaIndex is able to interact directly with Ollama-managed models, indirectly for interaction as well. This will make our overall process even easier.

We can install Ollama by following the system-specific directions on the application's GitHub repo.

Once installed, we can launch Ollama from the terminal and specify the model we wish to use.

 

Step 2: Llama 3, the Language Model

 

Once Ollama is installed and operational, we can download any of the models listed on its GitHub repo, or create our own Ollama-compatible model from other existing language model implementations. Using the Ollama run command will download the specified model if it is not present on your system, and so downloading Llama 3 8B can be accomplished with the following line:

ollama run llama3

 

Just make sure you have the local storage available to accommodate the 4.7 GB download.

Once the Ollama terminal application starts with the Llama 3 model as the backend, you can go ahead and minimize it. We'll be using LlamaIndex from our own script to interact.

 

Step 3: LlamaIndex, the RAG Framework

 

The last piece of this puzzle is LlamaIndex, our RAG framework. To use LlamaIndex, you will need to ensure that it is installed on your system. As the LlamaIndex packaging and namespace has made recent changes, it's best to check the official documentation to get LlamaIndex installed on your local environment.

Once up and running, and with Ollama running with the Llama3 model active, you can save the following to file (adapted from here):

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.ollama import Ollama

# My local documents
documents = SimpleDirectoryReader("data").load_data()

# Embeddings model
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-base-en-v1.5")

# Language model
Settings.llm = Ollama(model="llama3", request_timeout=360.0)

# Create index
index = VectorStoreIndex.from_documents(documents)

# Perform RAG query
query_engine = index.as_query_engine()
response = query_engine.query("What are the 5 stages of RAG?")
print(response)

 

This script is doing the following:

  • Documents are stored in the "data" folder
  • Embeddings model being used to create your RAG documents embeddings is a BGE variant from Hugging Face
  • Language model is the aforementioned Llama 3, accessed via Ollama
  • The query being asked of our data ("What are the 5 stages of RAG?") is fitting as I dropped a number of RAG-related documents in the data folder

And the output of our query:

The five key stages within RAG are: Loading, Indexing, Storing, Querying, and Evaluation.

 

Note that we would likely want to optimize the script in a number of ways to facilitate faster search and maintaining some state (embeddings, for instance), but I will leave that for the interested reader to explore.

 

Final Thoughts

 
Well, we did it. We managed to get a LlamaIndex-based RAG application using Llama 3 being served by Ollama locally in 3 fairly easy steps. There is a lot more you could do with this, including optimizing, extending, adding a UI, etc., but simple fact remains that we were able to get our baseline model built with but a few lines of code across a minimal set of support apps and libraries.

I hope you enjoyed the process.
 
 

Matthew Mayo (@mattmayo13) holds a Master's degree in computer science and a graduate diploma in data mining. As Managing Editor, Matthew aims to make complex data science concepts accessible. His professional interests include natural language processing, machine learning algorithms, and exploring emerging AI. He is driven by a mission to democratize knowledge in the data science community. Matthew has been coding since he was 6 years old.