How to Evaluate Your LangChain RAG App Using RAGAS

thumbnail

Step-by-Step Guide: Evaluating Your LangChain RAG Chatbot with RAGAS

In our previous post, we explored how to create a Retrieval-Augmented Generation (RAG) app that could answer questions about yourself by integrating a knowledge base with a language model. If you haven’t checked it out yet, you can check out the below link.

Step-by-Step Guide: Building AI Chatbot Using Langchain and RAG Techniques

Today, we’ll take it a step further by evaluating our RAG app. We’ll use RAGAS (a tool to evaluate the performance of RAG systems) and LangChain**.**By the end of this post, you’ll know how to measure the performance of the RAG app.

Short Recap, In our previous post, we built an RAG app where:

  1. We created a knowledge base (documents about you).
  2. We used embeddings to convert and store information in the vector database and to do similarity searches with the vector database.
  3. Integrated it with a language model to provide answers based on that information.

To ensure our RAG system is working effectively, we need to measure how well it finds and uses information to answer questions. Here are the key metrics we use to evaluate performance:

Context Retrieval Metrics:

  • Context Recall: Measures if All necessary information was retrieved to answer the question, regardless of whether extra information was also retrieved.
  • Context Precision: Measures how much of the retrieved information is Relevant to the question, focusing on efficiency and accuracy of retrieval, and how well it captures the information without extra stuff.

Example Scenario:

Question: “What programming languages did you use in your fintech projects?”

Available Context Documents:

  1. Education: “Graduated in Computer Science from XYZ University”
  2. Project A: “Led fintech project using Python for algorithmic trading in a major investment bank”
  3. Project B: “Built fintech payment processing system using Java and Spring for online banking transactions”
  4. Project C: “Developed company website using React and Node.js”
  5. Skills: “Expert in Python, Java, JavaScript, and SQL”

Context Recall vs Precision Example Scenario:

  1. High Recall, Low Precision retrieval:

    Retrieved Documents:

    • Project A: "Led fintech project using Python for algorithmic trading in a major investment bank"
    • Project B: "Built fintech payment processing system using Java and Spring for online banking transactions"
    • Project C: "Developed company website using React and Node.js"
    • Skills: "Expert in Python, Java, JavaScript, and SQL"
    • Education: "Graduated in Computer Science from XYZ University"

✅ Recall: High (got All programming language information about fintech projects)
❌ Precision: Low (We got the necessary information but included irrelevant website projects, general skills, and educational info)

2. High Precision, Low Recall retrieval:

Retrieved Documents:
- Project A: "Led fintech project using Python for algorithmic trading in a major investment bank"

❌ Recall: Low (missed Java/Spring used in fintech payment processing system)
✅ Precision: High (only contains fintech-specific programming language info). Even though it missed one piece of information, precision is comparatively higher than the previous case as it didn’t include irrelevant information.

3. Ideal (High Recall, High Precision) retrieval:

Retrieved Documents:
- Project A: "Led fintech project using Python for algorithmic trading in a major investment bank"
- Project B: "Built fintech payment processing system using Java and Spring for online banking transactions"

✅ Recall: High (got all fintech-related programming languages: Python, Java, Spring)
✅ Precision: High (only retrieved fintech-specific projects, no irrelevant information)

4. Low Recall, Low Precision retrieval:

Retrieved Documents:
- Education: "Graduated in Computer Science from XYZ University"
- Project C: "Developed company website using React and Node.js"

❌ Recall: Low (missed both fintech projects with Python, Java, and Spring)
❌ Precision: Low (retrieved completely irrelevant information about education and website development)

In short,

  • High Recall, High Precision: The ideal case
  • High Recall, Low Precision: Gets everything but includes noise
  • Low Recall, High Precision: Gets some relevant info but misses key details
  • Low Recall, Low Precision: The worst case — misses relevant info and includes irrelevant content

Generation metrics

Answer Relevancy: Evaluate whether the response directly addresses the specific question asked, staying focused on the key points from the context and avoiding the inclusion of irrelevant or extraneous information. The answer should align with the user’s query and not drift into unrelated topics

Faithfulness: Measures if the answer only uses facts present in the retrieved context without making up additional information (hallucination checks).

**Example:**Let’s say your knowledge base has this context:

"John worked at TechCorp as a Senior Developer from 2018-2022. 
He lead the cloud migration project, managing a team of 5 developers. 
His team achieved 99.9% uptime and reduced costs by 40%."

Example Scenarios:

Question: “What leadership experience do they have?”

  1. High Relevancy, High Faithfulness:

    Answer: "They lead a cloud migration project at TechCorp, managing a team of 5 developers."

✅ Relevancy: Directly addresses leadership experience
✅ Faithfulness: Information matches the context exactly

2. Low Relevancy, High Faithfulness:

Answer: "They worked at TechCorp from 2018-2022 and achieved 99.9% uptime."

❌ Relevancy: Doesn’t address leadership question
✅ Faithfulness: Information is true as per context

3. High Relevancy, Low Faithfulness:

Answer: "They have extensive leadership experience, managing multiple teams across different companies and leading various successful projects."

✅ Relevancy: Addresses leadership question
❌ Faithfulness: Adds information not in context

4. Low Relevancy, Low Faithfulness:

Answer: "They are an excellent programmer with 10 years of experience in Java and Python."

❌ Relevancy: Doesn’t address leadership
❌ Faithfulness: Information not in context

For evaluating these metrics we use Ragas one of the popular open-source frameworks for evaluating LLM apps. The metrics which we are evaluating are common among other frameworks like Haystack, Deepeval, etc.

🚀 Get Started - Ragas

Initially, I planned to implement testing using Node.js to maintain consistency with our previous backend. However, RAGAS, a framework for evaluating Retrieval Augmented Generation (RAG) systems, is primarily available in Python. So, we implement our previously built RAG app using Python just as a replica and test it with RAGAS. Hopefully, In our future post, we will use a testing framework that has support for Node.js

For Python I have first created a virtual environment

python -m venv .venv
source ./venv/Scripts/activate

I mentioned by dependencies in requirements.txt

langchain
langchain-google-genai
ragas
fastapi[standard]
python-dotenv
langchain-community

pip install -r requirements.txt

Here is the Python implementation keeping the functionality the same as the Node.Js backend. I have used Fastapi for API handling. I am having GOOGLE_API_KEY in .env

from fastapi import FastAPI
from pydantic import BaseModel
import os 
from langchain_google_genai import ChatGoogleGenerativeAI, GoogleGenerativeAIEmbeddings
from langchain_community.vectorstores import InMemoryVectorStore
from langchain_core.prompts import ChatPromptTemplate
from langchain.text_splitter import MarkdownTextSplitter
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from ragas import evaluate, EvaluationDataset
from langchain_core.output_parsers import StrOutputParser
from dotenv import load_dotenv
import time

load_dotenv()

model = ChatGoogleGenerativeAI(model="gemini-1.5-pro")
embedding_model = GoogleGenerativeAIEmbeddings(model="models/text-embedding-004")

store = InMemoryVectorStore(embedding=embedding_model)
prompt = ChatPromptTemplate(
    [
        (
            "system",
            """You are an AI assistant designed to answer user's question in a professional way based on the provided context. Respond only to questions relevant to the context. If a question falls outside this scope, respond with: 'I don't know.' Avoid guessing or providing unrelated information.
            
            Context:
            {context}
            """,
        ),
        ("human", "{query}"),
    ]
)
chain = prompt | model | StrOutputParser()
text = ""
with open("about.md", "r", encoding="utf-8") as f:
    text = f.read()


text_splitter = MarkdownTextSplitter(chunk_size=500)
output = text_splitter.split_text(text)

docs = text_splitter.create_documents(output)
store.add_documents(docs)

app = FastAPI()

def format_docs(relevant_docs):
    return "\n".join(doc.page_content for doc in relevant_docs)

class Message(BaseModel):
    message: str

@app.post("/chat")
async def chat(body: Message):
    relevant_docs = store.similarity_search(body.message, k=2)
    response = chain.invoke({"query": body.message, "context": format_docs(relevant_docs)})
    print(response)
    return {
        "data": response,
    }

For the evaluation, we need questions and answers and test our app to see if it provides the expected answers to it. I have prepared 10 sets of questions of answer and their response. We need a dataset containing the user_input, user_input , the expected answer for that query is as reference , contexts retrieved with the similarity search retrieved_contexts , response from the LLM output.

We iterate through pairs of sample queries and their expected responses using zip . Inside we have the same step of searching and generating the response using LLM and appending all that information to the dataset.

time.sleep(30) Adds a 30-second delay between iterations because the Gemini Pro free tier restricts me to using only 2 requests per minute, we created a EvaluationDataset from the dataset, we created and evaluated our model on the metrics we previously discussed here we have two variants for Precision one checks the user_query and retrieved contexts with the generated response(LLMContextPrecisionWithReference) and other checks it with the expected response (LLMContextPrecisionWithReference). Finally saved the results to a CSV file.

sample_queries = [
  "What frontend technologies is Marish proficient in?",
  "What backend technologies does Marish have experience with?",
  ...
  "What is Marish's favorite color?",
  "Is Marish willing to relocate?"
]
expected_respones = [
  "Marish is proficient in HTML5, CSS3, JavaScript (ES6+), React.js, Next.js, TypeScript, Responsive Design, UI/UX Best Practices, and Testing (Jest, React Testing Library).",
  "Marish's backend experience includes Node.js, Express.js, Python, Django, RESTful APIs, GraphQL, and Database Design.",
  ...,
  "I don't know.",
  "I don't know."
]
reference_contexts = ["## 🛠️ Technical Skills\n\n### Front end Development\n\n- HTML5, CSS3, JavaScript (ES6+)\n- React.js, Next.js\n- TypeScript\n- ...", " ...", ...]

dataset = []

for query, reference in zip(sample_queries, expected_respones):
    relevant_docs = retriever.invoke(query)
    response = chain.invoke({"query": query, "context": format_docs(relevant_docs)})
    
    dataset.append({
        "user_input": query,
        "retrieved_contexts": [rdoc.page_content for rdoc in relevant_docs],
        "response": response,
        "reference": reference
    })
    time.sleep(30)

evaluation_dataset = EvaluationDataset.from_list(dataset)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), LLMContextPrecisionWithReference(), LLMContextPrecisionWithoutReference()],
    show_progress=True,
    llm=evaluation_llm,
)

result.to_pandas().to_csv("results.csv")
print(result)

After running we got the below score

{'context_recall': 0.82, 'faithfulness': 0.8889, 'answer_relevancy': 0.8350, 'llm_context_precision_with_reference': 0.2222, 'llm_context_precision_without_reference': 0.2083,}
  • LLM Context Precision (with and without Reference): Both scores are significantly low (0.22 and 0.20). This indicates that while generating responses, the model struggles to align precisely with detailed aspects of the provided context, whether or not references are present. So we need to optimize retrieval by filtering out irrelevant details and focusing only on what’s required for the response.
  • Context Recall (0.82): It suggests the model is capturing most of the context but might occasionally miss relevant details. So again retriever should be enhanced so that no key information is missed.
  • Answer relevancy (0.8350): This score suggests that responses mostly align with the query but in some cases have incomplete answers or redundant information. Here adjusting the retrieval mechanism and specifying clear intent through prompt will help in improvement.
  • Faithfulness (0.88): Information provided in responses stayed true to the provided context without making up any information that was not present in our context. Filtering out irrelevant information in the context and better prompt can help in addressing this.

Improving the relevancy of responses is an ongoing challenge, and tools like RAGAS provide a solid foundation for evaluating and enhancing answer quality. In our upcoming posts, we’ll explore advanced techniques to improve our RAG app. Stay tuned for insights on making your system more accurate, reliable, and efficient!

If there’s any topic you’d like me to cover in more detail in a practical way or if you have suggestions for future posts, feel free to share them in the comments. I’d love to hear your thoughts and make this space even more helpful for you.

Thanks for reading, and Happy Coding!