Skip to main content

Basic RAG in 15 Minutes

1. Introduction

RAG is a method that allows LLM to use relevant information from external sources in its responses. In this guide, we'll create a complete RAG pipeline using current information about recently released movies/series.

Our guide will include the following steps:

  1. Loading articles and chunking them
  2. Creating embeddings and loading into a vector database
  3. Implementing basic semantic search
  4. Connecting an LLM model
  5. Checking that the LLM lacks information
  6. Communicating with the LLM using information from external sources

2. Environment Setup

Let's install and import the necessary libraries:

#!pip install langchain
#!pip install langchain-openai
#!pip install langchain_community
#!pip install requests
#!pip install beautifulsoup4

#!pip install faiss-cpu - if you're running on CPU
#!pip install faiss-gpu - if you're running on GPU with CUDA support

import os
import requests
from bs4 import BeautifulSoup
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_core.documents import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains import create_retrieval_chain
from langchain_community.vectorstores import FAISS
from langchain.chains.combine_documents import create_stuff_documents_chain
os.environ["COMPRESSA_API_KEY"] = "your key"
# If you're running locally on Macbook, also set the following environment variables
# os.environ["TOKENIZERS_PARALLELISM"] = "false"
# os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"

3. Loading Web Pages and Chunking

Let's load information about recent movies and series from Wikipedia, then chunk them (article excerpts)

# For our example, we'll use the series Shogun and the film Master and Margarita
urls = [
"https://en.wikipedia.org/wiki/The_Master_and_Margarita_(2024_film)",
"https://en.wikipedia.org/wiki/Shogun_(2024_TV_series)"
]

def fetch_wikipedia_content(url):
# Send GET request to the specified URL
response = requests.get(url)

# Create BeautifulSoup object for parsing HTML content
soup = BeautifulSoup(response.content, 'html.parser')

# Find the main container with article content
content = soup.find(id="mw-content-text").find(class_="mw-parser-output")

text = []
# Iterate through all 'p', 'h2' and 'h3' elements in the main content
for element in content.find_all(['p', 'h2', 'h3']):
if element.name == 'p':
# If it's a paragraph, add its text as is
text.append(element.text)
elif element.name in ['h2', 'h3']:
# If it's a heading, format it with ## and add a line break
text.append(f"\n## {element.text.strip()}\n")

# Combine all text elements into one string, separating them with line breaks
return '\n'.join(text)

documents = [Document(page_content=fetch_wikipedia_content(url)) for url in urls]

# Check that everything loaded correctly
print (documents)
# Split documents into chunks, specific settings are provided as an example

text_splitter = RecursiveCharacterTextSplitter(
chunk_size=500, chunk_overlap=100, add_start_index=True
)
chunks = text_splitter.split_documents(documents)

print(f"Total chunks created: {len(chunks)}")
print(f"Example chunk:\n{chunks[0].page_content[:1000]}...")

4. Creating Embeddings

We convert the obtained text pieces into embeddings and load them into a vector database for further search. For the example, we'll use Faiss, but you can choose any other vector database (e.g., ChromaDB or Qdrant)

embeddings = OpenAIEmbeddings(api_key=os.getenv("COMPRESSA_API_KEY"), base_url="https://compressa-api.mil-team.ru/v1", model="Compressa-Embeddings")

# Create and populate vector store
vectorstore = FAISS.from_documents(chunks, embeddings)

print("Vector store successfully created")

Let's set up a simple retriever for semantic search:

# Setup the retriever for semantic search
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 3})

# Example usage of retriever
query = "Who directed the film 'Master and Margarita'?"
retrieved_docs = retriever.invoke(query)
print(f'Found {len(retrieved_docs)} relevant documents')
print(f'Example found document:\n{retrieved_docs[0].page_content[:200]}...')

6. Connecting LLM Model

Let's connect and configure the LLM model:

llm = ChatOpenAI(api_key=os.getenv("COMPRESSA_API_KEY"), base_url="https://compressa-api.mil-team.ru/v1")

# Create a prompt for processing queries

system_template = f"""You are a Q&A assistant. Use the following contextual information,
to answer the question. If there's no answer in the context, answer 'I don't know the answer to the question'.
Use a maximum of three sentences and be accurate but brief."""

qa_prompt = ChatPromptTemplate.from_messages([
("system", system_template),
("human", """Contextual information:

{context}

Question: {input}
"""),
])

# Create a chain for answering questions
document_chain = create_stuff_documents_chain(llm, qa_prompt)

7. Checking LLM's Lack of Information

Before using RAG, let's check that the LLM doesn't have information about our series:

def check_llm_knowledge(question):
response = llm.invoke(question)
return response.content

questions = [
"Who directed the film 'Master and Margarita' from 2024?",
"Who plays the lead role in the series 'Shogun' from 2024?"
]

print("Checking LLM knowledge without using RAG:")
for question in questions:
print(f"\nQuestion: {question}")
print(f"LLM Answer: {check_llm_knowledge(question)}")

8. Building the Complete RAG Pipeline

Now let's assemble all components into a single RAG pipeline:

rag_chain = create_retrieval_chain(retriever, document_chain)

def ask_question(question):
response = rag_chain.invoke({"input": question})
return response["answer"]
# Testing and demonstrating the RAG pipeline

questions = [
"Who directed the film 'Master and Margarita' from 2024?",
"Who plays the lead role in the series 'Shogun' from 2024?",
"When was the series 'Shogun' from 2024 released?",
"Which actors play the lead roles in the film 'Master and Margarita' from 2024?",
"How many episodes are in the series 'Shogun' from 2024?",
]

print("\nUsing RAG pipeline:")
for question in questions:
print(f"\nQuestion: {question}")
print(f"RAG Answer: {ask_question(question)}")

10. Conclusion

In this guide, we created a basic RAG pipeline that already works for such a simple task. More complex scenarios will likely require further improvements.

For example, we can improve search accuracy using the CompressaRerank model, which additionally prioritizes found passages for the user's query. We've prepared a special guide for this.

If you want to dive deeper into Embeddings and understand how semantic search works technically - check out another practical guide.

It's also worth noting the importance of selecting the right prompts and LLM model settings.