Building and Evaluating a Retrieval Augmented Generation (RAG) Pipeline with LlamaIndex, Azure AI Search, Azure OpenAI, Literal AI, and RAGAS

Building and Evaluating a Retrieval Augmented Generation (RAG) Pipeline with LlamaIndex, Azure AI Search, Azure OpenAI, Literal AI, and RAGAS

💡
DISCLAIMER: This post is intended for educational purposes only. Actual results may vary depending on your dataset, environment, and configurations.

Retrieval Augmented Generation (RAG) solutions combine Large Language Models (LLMs) with external retrieval sources to return factually grounded, context-aware answers. While constructing a RAG pipeline is a major step, rigorous evaluation is what empowers you to iteratively refine, tune parameters, and optimize your retrieval system for production-scale performance.

In this post, we’ll walk through a practical, code-driven example showing how to:

  • Construct a RAG pipeline with LlamaIndex and Azure AI Search

  • Integrate Azure OpenAI as the LLM provider

  • Use Literal AI to log and visualize queries and responses for troubleshooting and auditing

  • Employ RAGAS to evaluate multiple performance metrics—such as answer relevancy, context precision, context recall, and faithfulness

Key Insight: By experimenting with different retrieval modes and tuning top_k values, we found that Semantic-Hybrid retrieval mode with top_k=50 produced the highest answer relevancy. This underscores how systematic evaluation and data-driven insights can guide your configuration choices for Azure AI Search retrieval.


Why These Tools?

Azure AI Search:
Azure AI Search serves as a flexible vector store and search platform that supports multiple retrieval modes (keyword, vector, hybrid, semantic-hybrid). It lets you index documents, create embeddings, and efficiently surface the best chunks for your LLM to consume. It’s essential for building scalable RAG pipelines that handle diverse content types and query variations.

LlamaIndex:
LlamaIndex orchestrates the retrieval workflow. It manages how documents are chunked, embedded, and indexed, making it easier to experiment with different vector stores (like Azure AI Search) and LLM providers. LlamaIndex acts as the glue between your data layer and the LLM.

Azure OpenAI Service:
Azure OpenAI provides state-of-the-art LLMs (including variants of GPT models) with enterprise-level reliability and security. It’s the language engine that generates the final responses, ensuring that once we’ve retrieved the most relevant content, we produce coherent, contextually accurate answers.

Literal AI:
Literal AI is a logging and observability platform for LLM pipelines. It tracks each query, retrieval step, and response. By visualizing these steps, you can diagnose issues, understand how retrieval changes affect outputs, and confidently iterate on improvements.

RAGAS:
RAGAS offers a structured evaluation framework. Beyond simple relevancy, it quantifies how faithful answers are to the retrieved content and how precise and complete the returned contexts are. This advanced metric suite ensures that you’re not just tuning blindly—you’re refining based on concrete, multi-dimensional feedback.
CTA: Check out the RAGAS documentation for more details on metrics and usage.

Prerequisites

  • Azure AI Search: For indexing and retrieval

  • Azure OpenAI: For LLM-based generation

  • Python 3.8+ and Jupyter environment

  • Basic knowledge of Python, LLM fundamentals, and environment configuration

Package Versions:

  • llama-index: 0.11.17 (ensure >0.10.28 for Literal AI usage)

  • ragas: 0.1.21

  • azure-search-documents: 11.6.0b8

  • llama-index-vector-stores-azureaisearch: 0.3.0

  • llama-index-llms-azure-openai: 0.2.1

  • llama-index-embeddings-azure-openai: 0.2.5

  • literalai: 0.0.623

  • datasets: 3.0.0

  • nest_asyncio: 1.6.0


Step-by-Step Setup and Code Walkthrough

1. Installing Dependencies

Start by installing required packages in your Jupyter environment or terminal:

!pip install llama-index==0.11.17
!pip install ragas==0.1.21
!pip install azure-search-documents==11.6.0b8
!pip install llama-index-vector-stores-azureaisearch==0.3.0
!pip install llama-index-llms-azure-openai==0.2.1
!pip install llama-index-embeddings-azure-openai==0.2.5
!pip install literalai==0.0.623
!pip install datasets==3.0.0
!pip install nest_asyncio==1.6.0

These packages allow us to create indexes via LlamaIndex orchestration, integrate Azure AI Search, call Azure OpenAI, observe and log with Literal AI, and evaluate with RAGAS.

2. Importing and Initializing Components

import os
from dotenv import load_dotenv
from llama_index.core import StorageContext, VectorStoreIndex, SimpleDirectoryReader
from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding
from llama_index.llms.azure_openai import AzureOpenAI
from literalai import LiteralClient
from ragas import evaluate as ragas_evaluate
from ragas.metrics import answer_relevancy, context_precision, context_recall, faithfulness

load_dotenv()

# Load environment variables for Azure OpenAI, Azure AI Search, and Literal AI
AZURE_OPENAI_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT")
AZURE_OPENAI_API_KEY = os.getenv("AZURE_OPENAI_API_KEY")
AZURE_OPENAI_CHAT_COMPLETION_DEPLOYED_MODEL_NAME = os.getenv("AZURE_OPENAI_CHAT_COMPLETION_DEPLOYED_MODEL_NAME")
AZURE_OPENAI_EMBEDDING_DEPLOYED_MODEL_NAME = os.getenv("AZURE_OPENAI_EMBEDDING_DEPLOYED_MODEL_NAME")
SEARCH_SERVICE_ENDPOINT = os.getenv("AZURE_SEARCH_SERVICE_ENDPOINT")
SEARCH_SERVICE_API_KEY = os.getenv("AZURE_SEARCH_ADMIN_KEY")
LITERAL_API_KEY = os.getenv('LITERAL_API_KEY')

# Initialize LLM and Embedding
llm = AzureOpenAI(
    model=AZURE_OPENAI_CHAT_COMPLETION_DEPLOYED_MODEL_NAME,
    deployment_name=AZURE_OPENAI_CHAT_COMPLETION_DEPLOYED_MODEL_NAME,
    api_key=AZURE_OPENAI_API_KEY,
    azure_endpoint=AZURE_OPENAI_ENDPOINT,
    api_version="2024-10-01-preview"
)
embed_model = AzureOpenAIEmbedding(
    model=AZURE_OPENAI_EMBEDDING_DEPLOYED_MODEL_NAME,
    deployment_name=AZURE_OPENAI_EMBEDDING_DEPLOYED_MODEL_NAME,
    api_key=AZURE_OPENAI_API_KEY,
    azure_endpoint=AZURE_OPENAI_ENDPOINT,
    api_version="2024-10-01-preview"
)

# Initialize Literal AI client and instrument LlamaIndex
literalai_client = LiteralClient(api_key=LITERAL_API_KEY)
literalai_client.instrument_llamaindex()

What’s happening here?

  • We load configurations from .env.

  • Set up the LLM (Azure OpenAI) and embeddings.

  • Instantiate Literal AI for logging and instrumentation.

3. Loading Data and Building the Vector Index

We assume you have some PDF documents in data/pdf. We’ll continue the series using our Contoso-HR documents. Let’s load and index them with LlamaIndex and Azure AI Search:

from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient
from azure.search.documents.indexes import SearchIndexClient
from llama_index.vector_stores.azureaisearch import AzureAISearchVectorStore, IndexManagement

credential = AzureKeyCredential(SEARCH_SERVICE_API_KEY)
index_client = SearchIndexClient(endpoint=SEARCH_SERVICE_ENDPOINT, credential=credential)
search_client = SearchClient(endpoint=SEARCH_SERVICE_ENDPOINT, index_name="llamaindex-azure-aisearch-rag-literal-ai", credential=credential)

# Load documents
documents = SimpleDirectoryReader('data/pdf').load_data()

# Create Azure AI Search vector store
vector_store = AzureAISearchVectorStore(
    search_or_index_client=index_client,
    index_name="llamaindex-azure-aisearch-rag-literal-ai",
    index_management=IndexManagement.CREATE_IF_NOT_EXISTS,
    embedding_dimensionality=1536, 
    # Additional config...
)

storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(documents, storage_context=storage_context)

# Build a query engine with top_k=1 initially
query_engine = index.as_query_engine(similarity_top_k=1)

Context: Here, we use LlamaIndex to chunk and embed PDF documents, store their embeddings in Azure AI Search, and create a query engine that can retrieve documents using different modes (keyword, hybrid, semantic-hybrid). This sets up the retrieval layer for the LLM to ground its responses.

Azure AI Search Portal Search Explorer

Figure 1: Azure AI Search Explorer showing indexed documents and embeddings.

4. Testing a Simple Query

Try out a quick query:

response = query_engine.query("Does my benefits cover scuba diving?")
print(response)

You’ll get a baseline answer. The real magic happens when we start experimenting, logging, and scoring these answers systematically.

5. Preparing a RAG Evaluation Dataset

We define a set of Q&A pairs to evaluate whether the model’s answers are correct and faithful. This dataset powers the RAGAS evaluation later:

evaluation_data = [
    {
        "question": "What are out-of-network providers and what are the implications ...",
        "ground_truth": "Out-of-network providers are those who have not contracted..."
    },
    # ... more Q&A pairs ...
]

literal_dataset = literalai_client.api.create_dataset(name="Contoso-HR Evaluation Dataset", description="Evaluation dataset for Contoso-HR")

for item in evaluation_data:
    literal_dataset.create_item(
        input={"content": item["question"]},
        expected_output={"content": item["ground_truth"]}
    )

Yay! We now have a dataset for RAG evaluation.

6. Running RAGAS Evaluation

We’ll experiment with various top_k values and retrieval modes (keyword, hybrid, semantic_hybrid) to see which yields the best performance.

from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.vector_stores.types import VectorStoreQueryMode

keyword_retriever = index.as_retriever(vector_store_query_mode=VectorStoreQueryMode.SPARSE)
hybrid_retriever = index.as_retriever(vector_store_query_mode=VectorStoreQueryMode.HYBRID)
semantic_hybrid_retriever = index.as_retriever(vector_store_query_mode=VectorStoreQueryMode.SEMANTIC_HYBRID)

keyword_query_engine = RetrieverQueryEngine(retriever=keyword_retriever)
hybrid_query_engine = RetrieverQueryEngine(retriever=hybrid_retriever)
semantic_hybrid_query_engine = RetrieverQueryEngine(retriever=semantic_hybrid_retriever)

top_k_values = [1, 3, 5, 10, 50]
all_results = []
query_engines = {
    "keyword": keyword_query_engine,
    "hybrid": hybrid_query_engine,
    "semantic_hybrid": semantic_hybrid_query_engine
}

# Convert dataset items to RAGAS format
from datasets import Dataset
questions = [i.input["content"] for i in literal_dataset.items]
ground_truths = [i.expected_output["content"] for i in literal_dataset.items]
data_samples_set = Dataset.from_dict({"question": questions, "ground_truth": ground_truths})

for qe_name, qe in query_engines.items():
    for top_k in top_k_values:
        qe.retriever.similarity_top_k = top_k
        evaluation_results = ragas_evaluate(
            query_engine=qe,
            dataset=data_samples_set,
            metrics=[answer_relevancy, context_precision, context_recall, faithfulness],
            llm=llm,
            embeddings=embed_model
        )

        results_df = evaluation_results.to_pandas()

        # Log experiment
        experiment = literal_dataset.create_experiment(
            name=f"{qe_name.capitalize()} Experiment - Top {top_k} retrieval",
            params=[{"top_k": top_k}, {"query_engine": qe_name}]
        )

        # Log results per item
        experiment_items = literal_dataset.items
        for i, row in results_df.iterrows():
            scores = [{
                "name": m.name,
                "type": "AI",
                "value": row[m.name]
            } for m in [answer_relevancy, context_precision, context_recall, faithfulness]]
            experiment.log({
                "datasetItemId": experiment_items[i].id,
                "scores": scores,
                "input": {"question": row["question"]},
                "output": {"content": row["answer"]}
            })

        # Compute averages
        avg_answer_relevancy = results_df["answer_relevancy"].mean()
        avg_context_precision = results_df["context_precision"].mean()
        avg_context_recall = results_df["context_recall"].mean()
        avg_faithfulness = results_df["faithfulness"].mean()

        all_results.append({
            "engine": qe_name,
            "top_k": top_k,
            "answer_relevancy": avg_answer_relevancy,
            "context_precision": avg_context_precision,
            "context_recall": avg_context_recall,
            "faithfulness": avg_faithfulness
        })
💡
Semantic Hybrid is simply Hybrid Search with leveraging Azure AI Search’s built-in reranking model. See Raising the bar for RAG excellence: introducing generative query rewriting and new ranking model

7. Identifying the Best Configuration

import pprint
best_config = max(all_results, key=lambda x: x["answer_relevancy"])
print("Best configuration by answer_relevancy:")
pprint.pprint(best_config)

Expected Output:

Best configuration by answer_relevancy:
{
 'engine': 'semantic_hybrid',
 'top_k': 50,
 'answer_relevancy': 0.9214,
 'context_precision': ...,
 'context_recall': ...,
 'faithfulness': ...
}

This confirms that leveraging Azure AI Search’s semantic-hybrid retrieval mode with a higher top_k gives the best answer relevancy, aligning with the improved retrieval capabilities announced in Raising the Bar for RAG Excellence: Query Rewriting and New Semantic Ranker.

Let’s also take a look at the Literal AI Dashboard on the Experiments blade.

Figure 2: Detailed Experiment Items and Scores in Literal AI

Figure 2 shows per-item metrics, highlighting how each question’s answer scored on metrics like answer_relevancy, context_precision, etc.

Visualizing Results with Literal AI

Literal AI captures every query, retrieval step, and final answer token, making it easier to understand and debug the pipeline. RAGAS then quantifies performance across multiple metrics (answer relevancy, context precision, context recall, faithfulness).

Figure 3: Single Query Analysis in Literal AI

Figure 3 shows a single query ("Does the deductible roll over into the next year?") and how Literal AI displays the retrieved documents, the final answer, and intermediate steps.

Figure 4: Single query analysis shows the exact retrieval steps and final answer.

In Figure 4, you can click into each step to visualize in JSON or YAML the inputs / outputs and the retrieved documents.

Going back to the Experiments blade, we could also click “Compare” to compare two experiments (only 2 supported at this time). I’d love to have a holistic view to visualize all of my experiments.

Figure 5: Comparison of Different “top_k” Results for Semantic_Hybrid @Top_k=50 vs Keyword @Top_K=1 (worst answer relevancy)

💡
Literal AI, if you’re reading this, I didn’t like the use of green vs red colors in the bar chart visuals as it made it seem like good vs bad, perhaps more neutral colors when comparing or having color coding logic based on the desired evaluation metric?

These figures illustrate how Literal AI presents scores and experiment comparisons, enabling quick insights into how retrieval strategies impact performance.


Guidance on Azure AI Search Retrieval Configuration

For RAG pipelines, start by testing semantic-hybrid retrieval with a moderate top_k (e.g., 10) and then incrementally increasing top_k to see if answer quality improves. Semantic-hybrid mode benefits from internal query rewriting and new semantic ranker enhancements, enabling more robust retrieval across diverse content types, languages, and query complexities. Consider:

  • Hybrid or Semantic-Hybrid for complex queries and heterogeneous data

  • Adjusting top_k as an iterative tuning step—higher values can improve recall and faithfulness, but also be mindful of latency and cost

  • Experimentation and Iteration: Incorporate evaluation frameworks like RAGAS early to confirm which changes drive improvements.

For more nuanced guidance on query rewriting and the new semantic ranker, read the “Raising the Bar for RAG Excellence” blog.

Conclusion and Next Steps

Evaluating RAG performance is not a one-and-done exercise. By systematically measuring relevancy, context precision, recall, and faithfulness—and by incrementally tuning retrieval modes and parameters—you can continuously enhance user experience. Tools like RAGAS, combined with Literal AI’s observability and Azure AI Search’s advanced retrieval capabilities, make it straightforward to implement a practical, data-driven evaluation loop. Ultimately, this iterative process ensures that your RAG solution consistently delivers reliable and contextually accurate answers.

Happy experimenting!

💡
For the full code and more details, check the provided GitHub notebook: azure-ai-search-literal-ai-ragas.ipynb

Did you find this article valuable?

Support FullStackFarzzy's Blog by becoming a sponsor. Any amount is appreciated!