In our previous post, we explored how to create a Retrieval-Augmented Generation (RAG) app that could answer questions about yourself by integrating a knowledge base with a language model. If you haven’t checked it out yet, you can check out the below link.
Step-by-Step Guide: Building AI Chatbot Using Langchain and RAG Techniques
Today, we’ll take it a step further by evaluating our RAG app. We’ll use RAGAS (a tool to evaluate the performance of RAG systems) and LangChain**.**By the end of this post, you’ll know how to measure the performance of the RAG app.
Short Recap, In our previous post, we built an RAG app where:
To ensure our RAG system is working effectively, we need to measure how well it finds and uses information to answer questions. Here are the key metrics we use to evaluate performance:
Example Scenario:
Question: “What programming languages did you use in your fintech projects?”
Available Context Documents:
Context Recall vs Precision Example Scenario:
High Recall, Low Precision retrieval:
Retrieved Documents:
✅ Recall: High (got All programming language information about fintech projects)
❌ Precision: Low (We got the necessary information but included irrelevant website projects, general skills, and educational info)
2. High Precision, Low Recall retrieval:
Retrieved Documents:
- Project A: "Led fintech project using Python for algorithmic trading in a major investment bank"
❌ Recall: Low (missed Java/Spring used in fintech payment processing system)
✅ Precision: High (only contains fintech-specific programming language info). Even though it missed one piece of information, precision is comparatively higher than the previous case as it didn’t include irrelevant information.
3. Ideal (High Recall, High Precision) retrieval:
Retrieved Documents:
- Project A: "Led fintech project using Python for algorithmic trading in a major investment bank"
- Project B: "Built fintech payment processing system using Java and Spring for online banking transactions"
✅ Recall: High (got all fintech-related programming languages: Python, Java, Spring)
✅ Precision: High (only retrieved fintech-specific projects, no irrelevant information)
4. Low Recall, Low Precision retrieval:
Retrieved Documents:
- Education: "Graduated in Computer Science from XYZ University"
- Project C: "Developed company website using React and Node.js"
❌ Recall: Low (missed both fintech projects with Python, Java, and Spring)
❌ Precision: Low (retrieved completely irrelevant information about education and website development)
In short,
Answer Relevancy: Evaluate whether the response directly addresses the specific question asked, staying focused on the key points from the context and avoiding the inclusion of irrelevant or extraneous information. The answer should align with the user’s query and not drift into unrelated topics
Faithfulness: Measures if the answer only uses facts present in the retrieved context without making up additional information (hallucination checks).
**Example:**Let’s say your knowledge base has this context:
"John worked at TechCorp as a Senior Developer from 2018-2022.
He lead the cloud migration project, managing a team of 5 developers.
His team achieved 99.9% uptime and reduced costs by 40%."
Example Scenarios:
Question: “What leadership experience do they have?”
High Relevancy, High Faithfulness:
Answer: "They lead a cloud migration project at TechCorp, managing a team of 5 developers."
✅ Relevancy: Directly addresses leadership experience
✅ Faithfulness: Information matches the context exactly
2. Low Relevancy, High Faithfulness:
Answer: "They worked at TechCorp from 2018-2022 and achieved 99.9% uptime."
❌ Relevancy: Doesn’t address leadership question
✅ Faithfulness: Information is true as per context
3. High Relevancy, Low Faithfulness:
Answer: "They have extensive leadership experience, managing multiple teams across different companies and leading various successful projects."
✅ Relevancy: Addresses leadership question
❌ Faithfulness: Adds information not in context
4. Low Relevancy, Low Faithfulness:
Answer: "They are an excellent programmer with 10 years of experience in Java and Python."
❌ Relevancy: Doesn’t address leadership
❌ Faithfulness: Information not in context
For evaluating these metrics we use Ragas one of the popular open-source frameworks for evaluating LLM apps. The metrics which we are evaluating are common among other frameworks like Haystack, Deepeval, etc.
Initially, I planned to implement testing using Node.js to maintain consistency with our previous backend. However, RAGAS, a framework for evaluating Retrieval Augmented Generation (RAG) systems, is primarily available in Python. So, we implement our previously built RAG app using Python just as a replica and test it with RAGAS. Hopefully, In our future post, we will use a testing framework that has support for Node.js
For Python I have first created a virtual environment
python -m venv .venv
source ./venv/Scripts/activate
I mentioned by dependencies in requirements.txt
langchain
langchain-google-genai
ragas
fastapi[standard]
python-dotenv
langchain-community
pip install -r requirements.txt
Here is the Python implementation keeping the functionality the same as the Node.Js backend. I have used Fastapi for API handling. I am having GOOGLE_API_KEY in .env
from fastapi import FastAPI
from pydantic import BaseModel
import os
from langchain_google_genai import ChatGoogleGenerativeAI, GoogleGenerativeAIEmbeddings
from langchain_community.vectorstores import InMemoryVectorStore
from langchain_core.prompts import ChatPromptTemplate
from langchain.text_splitter import MarkdownTextSplitter
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from ragas import evaluate, EvaluationDataset
from langchain_core.output_parsers import StrOutputParser
from dotenv import load_dotenv
import time
load_dotenv()
model = ChatGoogleGenerativeAI(model="gemini-1.5-pro")
embedding_model = GoogleGenerativeAIEmbeddings(model="models/text-embedding-004")
store = InMemoryVectorStore(embedding=embedding_model)
prompt = ChatPromptTemplate(
[
(
"system",
"""You are an AI assistant designed to answer user's question in a professional way based on the provided context. Respond only to questions relevant to the context. If a question falls outside this scope, respond with: 'I don't know.' Avoid guessing or providing unrelated information.
Context:
{context}
""",
),
("human", "{query}"),
]
)
chain = prompt | model | StrOutputParser()
text = ""
with open("about.md", "r", encoding="utf-8") as f:
text = f.read()
text_splitter = MarkdownTextSplitter(chunk_size=500)
output = text_splitter.split_text(text)
docs = text_splitter.create_documents(output)
store.add_documents(docs)
app = FastAPI()
def format_docs(relevant_docs):
return "\n".join(doc.page_content for doc in relevant_docs)
class Message(BaseModel):
message: str
@app.post("/chat")
async def chat(body: Message):
relevant_docs = store.similarity_search(body.message, k=2)
response = chain.invoke({"query": body.message, "context": format_docs(relevant_docs)})
print(response)
return {
"data": response,
}
For the evaluation, we need questions and answers and test our app to see if it provides the expected answers to it. I have prepared 10 sets of questions of answer and their response. We need a dataset containing the user_input, user_input
, the expected answer for that query is as reference
, contexts retrieved with the similarity search retrieved_contexts
, response
from the LLM output.
We iterate through pairs of sample queries and their expected responses using zip
. Inside we have the same step of searching and generating the response using LLM and appending all that information to the dataset.
time.sleep(30)
Adds a 30-second delay between iterations because the Gemini Pro free tier restricts me to using only 2 requests per minute, we created a EvaluationDataset
from the dataset, we created and evaluated our model on the metrics we previously discussed here we have two variants for Precision one checks the user_query and retrieved contexts with the generated response(LLMContextPrecisionWithReference) and other checks it with the expected response (LLMContextPrecisionWithReference). Finally saved the results to a CSV file.
sample_queries = [
"What frontend technologies is Marish proficient in?",
"What backend technologies does Marish have experience with?",
...
"What is Marish's favorite color?",
"Is Marish willing to relocate?"
]
expected_respones = [
"Marish is proficient in HTML5, CSS3, JavaScript (ES6+), React.js, Next.js, TypeScript, Responsive Design, UI/UX Best Practices, and Testing (Jest, React Testing Library).",
"Marish's backend experience includes Node.js, Express.js, Python, Django, RESTful APIs, GraphQL, and Database Design.",
...,
"I don't know.",
"I don't know."
]
reference_contexts = ["## 🛠️ Technical Skills\n\n### Front end Development\n\n- HTML5, CSS3, JavaScript (ES6+)\n- React.js, Next.js\n- TypeScript\n- ...", " ...", ...]
dataset = []
for query, reference in zip(sample_queries, expected_respones):
relevant_docs = retriever.invoke(query)
response = chain.invoke({"query": query, "context": format_docs(relevant_docs)})
dataset.append({
"user_input": query,
"retrieved_contexts": [rdoc.page_content for rdoc in relevant_docs],
"response": response,
"reference": reference
})
time.sleep(30)
evaluation_dataset = EvaluationDataset.from_list(dataset)
result = evaluate(
dataset=evaluation_dataset,
metrics=[LLMContextRecall(), Faithfulness(), LLMContextPrecisionWithReference(), LLMContextPrecisionWithoutReference()],
show_progress=True,
llm=evaluation_llm,
)
result.to_pandas().to_csv("results.csv")
print(result)
After running we got the below score
{'context_recall': 0.82, 'faithfulness': 0.8889, 'answer_relevancy': 0.8350, 'llm_context_precision_with_reference': 0.2222, 'llm_context_precision_without_reference': 0.2083,}
Improving the relevancy of responses is an ongoing challenge, and tools like RAGAS provide a solid foundation for evaluating and enhancing answer quality. In our upcoming posts, we’ll explore advanced techniques to improve our RAG app. Stay tuned for insights on making your system more accurate, reliable, and efficient!
If there’s any topic you’d like me to cover in more detail in a practical way or if you have suggestions for future posts, feel free to share them in the comments. I’d love to hear your thoughts and make this space even more helpful for you.
Thanks for reading, and Happy Coding!