Building and Evaluating a Retrieval Augmented Generation (RAG) Pipeline with LlamaIndex, Azure AI Search, Azure OpenAI, Literal AI, and RAGAS
Table of contents
- Why These Tools?
- Prerequisites
- Step-by-Step Setup and Code Walkthrough
- 1. Installing Dependencies
- 2. Importing and Initializing Components
- 3. Loading Data and Building the Vector Index
- 4. Testing a Simple Query
- 5. Preparing a RAG Evaluation Dataset
- 6. Running RAGAS Evaluation
- 7. Identifying the Best Configuration
- Visualizing Results with Literal AI
- Guidance on Azure AI Search Retrieval Configuration
- Conclusion and Next Steps
Retrieval Augmented Generation (RAG) solutions combine Large Language Models (LLMs) with external retrieval sources to return factually grounded, context-aware answers. While constructing a RAG pipeline is a major step, rigorous evaluation is what empowers you to iteratively refine, tune parameters, and optimize your retrieval system for production-scale performance.
In this post, we’ll walk through a practical, code-driven example showing how to:
Construct a RAG pipeline with LlamaIndex and Azure AI Search
Integrate Azure OpenAI as the LLM provider
Use Literal AI to log and visualize queries and responses for troubleshooting and auditing
Employ RAGAS to evaluate multiple performance metrics—such as answer relevancy, context precision, context recall, and faithfulness
Key Insight: By experimenting with different retrieval modes and tuning top_k
values, we found that Semantic-Hybrid retrieval mode with top_k=50
produced the highest answer relevancy. This underscores how systematic evaluation and data-driven insights can guide your configuration choices for Azure AI Search retrieval.
Why These Tools?
Azure AI Search:
Azure AI Search serves as a flexible vector store and search platform that supports multiple retrieval modes (keyword, vector, hybrid, semantic-hybrid). It lets you index documents, create embeddings, and efficiently surface the best chunks for your LLM to consume. It’s essential for building scalable RAG pipelines that handle diverse content types and query variations.
LlamaIndex:
LlamaIndex orchestrates the retrieval workflow. It manages how documents are chunked, embedded, and indexed, making it easier to experiment with different vector stores (like Azure AI Search) and LLM providers. LlamaIndex acts as the glue between your data layer and the LLM.
Azure OpenAI Service:
Azure OpenAI provides state-of-the-art LLMs (including variants of GPT models) with enterprise-level reliability and security. It’s the language engine that generates the final responses, ensuring that once we’ve retrieved the most relevant content, we produce coherent, contextually accurate answers.
Literal AI:
Literal AI is a logging and observability platform for LLM pipelines. It tracks each query, retrieval step, and response. By visualizing these steps, you can diagnose issues, understand how retrieval changes affect outputs, and confidently iterate on improvements.
RAGAS:
RAGAS offers a structured evaluation framework. Beyond simple relevancy, it quantifies how faithful answers are to the retrieved content and how precise and complete the returned contexts are. This advanced metric suite ensures that you’re not just tuning blindly—you’re refining based on concrete, multi-dimensional feedback.
CTA: Check out the RAGAS documentation for more details on metrics and usage.
Prerequisites
Azure AI Search: For indexing and retrieval
Azure OpenAI: For LLM-based generation
Python 3.8+ and Jupyter environment
Basic knowledge of Python, LLM fundamentals, and environment configuration
Package Versions:
llama-index
: 0.11.17 (ensure >0.10.28 for Literal AI usage)ragas
: 0.1.21azure-search-documents
: 11.6.0b8llama-index-vector-stores-azureaisearch
: 0.3.0llama-index-llms-azure-openai
: 0.2.1llama-index-embeddings-azure-openai
: 0.2.5literalai
: 0.0.623datasets
: 3.0.0nest_asyncio
: 1.6.0
Step-by-Step Setup and Code Walkthrough
1. Installing Dependencies
Start by installing required packages in your Jupyter environment or terminal:
!pip install llama-index==0.11.17
!pip install ragas==0.1.21
!pip install azure-search-documents==11.6.0b8
!pip install llama-index-vector-stores-azureaisearch==0.3.0
!pip install llama-index-llms-azure-openai==0.2.1
!pip install llama-index-embeddings-azure-openai==0.2.5
!pip install literalai==0.0.623
!pip install datasets==3.0.0
!pip install nest_asyncio==1.6.0
These packages allow us to create indexes via LlamaIndex orchestration, integrate Azure AI Search, call Azure OpenAI, observe and log with Literal AI, and evaluate with RAGAS.
2. Importing and Initializing Components
import os
from dotenv import load_dotenv
from llama_index.core import StorageContext, VectorStoreIndex, SimpleDirectoryReader
from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding
from llama_index.llms.azure_openai import AzureOpenAI
from literalai import LiteralClient
from ragas import evaluate as ragas_evaluate
from ragas.metrics import answer_relevancy, context_precision, context_recall, faithfulness
load_dotenv()
# Load environment variables for Azure OpenAI, Azure AI Search, and Literal AI
AZURE_OPENAI_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT")
AZURE_OPENAI_API_KEY = os.getenv("AZURE_OPENAI_API_KEY")
AZURE_OPENAI_CHAT_COMPLETION_DEPLOYED_MODEL_NAME = os.getenv("AZURE_OPENAI_CHAT_COMPLETION_DEPLOYED_MODEL_NAME")
AZURE_OPENAI_EMBEDDING_DEPLOYED_MODEL_NAME = os.getenv("AZURE_OPENAI_EMBEDDING_DEPLOYED_MODEL_NAME")
SEARCH_SERVICE_ENDPOINT = os.getenv("AZURE_SEARCH_SERVICE_ENDPOINT")
SEARCH_SERVICE_API_KEY = os.getenv("AZURE_SEARCH_ADMIN_KEY")
LITERAL_API_KEY = os.getenv('LITERAL_API_KEY')
# Initialize LLM and Embedding
llm = AzureOpenAI(
model=AZURE_OPENAI_CHAT_COMPLETION_DEPLOYED_MODEL_NAME,
deployment_name=AZURE_OPENAI_CHAT_COMPLETION_DEPLOYED_MODEL_NAME,
api_key=AZURE_OPENAI_API_KEY,
azure_endpoint=AZURE_OPENAI_ENDPOINT,
api_version="2024-10-01-preview"
)
embed_model = AzureOpenAIEmbedding(
model=AZURE_OPENAI_EMBEDDING_DEPLOYED_MODEL_NAME,
deployment_name=AZURE_OPENAI_EMBEDDING_DEPLOYED_MODEL_NAME,
api_key=AZURE_OPENAI_API_KEY,
azure_endpoint=AZURE_OPENAI_ENDPOINT,
api_version="2024-10-01-preview"
)
# Initialize Literal AI client and instrument LlamaIndex
literalai_client = LiteralClient(api_key=LITERAL_API_KEY)
literalai_client.instrument_llamaindex()
What’s happening here?
We load configurations from
.env
.Set up the LLM (Azure OpenAI) and embeddings.
Instantiate Literal AI for logging and instrumentation.
3. Loading Data and Building the Vector Index
We assume you have some PDF documents in data/pdf
. We’ll continue the series using our Contoso-HR documents. Let’s load and index them with LlamaIndex and Azure AI Search:
from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient
from azure.search.documents.indexes import SearchIndexClient
from llama_index.vector_stores.azureaisearch import AzureAISearchVectorStore, IndexManagement
credential = AzureKeyCredential(SEARCH_SERVICE_API_KEY)
index_client = SearchIndexClient(endpoint=SEARCH_SERVICE_ENDPOINT, credential=credential)
search_client = SearchClient(endpoint=SEARCH_SERVICE_ENDPOINT, index_name="llamaindex-azure-aisearch-rag-literal-ai", credential=credential)
# Load documents
documents = SimpleDirectoryReader('data/pdf').load_data()
# Create Azure AI Search vector store
vector_store = AzureAISearchVectorStore(
search_or_index_client=index_client,
index_name="llamaindex-azure-aisearch-rag-literal-ai",
index_management=IndexManagement.CREATE_IF_NOT_EXISTS,
embedding_dimensionality=1536,
# Additional config...
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(documents, storage_context=storage_context)
# Build a query engine with top_k=1 initially
query_engine = index.as_query_engine(similarity_top_k=1)
Context: Here, we use LlamaIndex to chunk and embed PDF documents, store their embeddings in Azure AI Search, and create a query engine that can retrieve documents using different modes (keyword, hybrid, semantic-hybrid). This sets up the retrieval layer for the LLM to ground its responses.
Figure 1: Azure AI Search Explorer showing indexed documents and embeddings.
4. Testing a Simple Query
Try out a quick query:
response = query_engine.query("Does my benefits cover scuba diving?")
print(response)
You’ll get a baseline answer. The real magic happens when we start experimenting, logging, and scoring these answers systematically.
5. Preparing a RAG Evaluation Dataset
We define a set of Q&A pairs to evaluate whether the model’s answers are correct and faithful. This dataset powers the RAGAS evaluation later:
evaluation_data = [
{
"question": "What are out-of-network providers and what are the implications ...",
"ground_truth": "Out-of-network providers are those who have not contracted..."
},
# ... more Q&A pairs ...
]
literal_dataset = literalai_client.api.create_dataset(name="Contoso-HR Evaluation Dataset", description="Evaluation dataset for Contoso-HR")
for item in evaluation_data:
literal_dataset.create_item(
input={"content": item["question"]},
expected_output={"content": item["ground_truth"]}
)
Yay! We now have a dataset for RAG evaluation.
6. Running RAGAS Evaluation
We’ll experiment with various top_k
values and retrieval modes (keyword, hybrid, semantic_hybrid) to see which yields the best performance.
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.vector_stores.types import VectorStoreQueryMode
keyword_retriever = index.as_retriever(vector_store_query_mode=VectorStoreQueryMode.SPARSE)
hybrid_retriever = index.as_retriever(vector_store_query_mode=VectorStoreQueryMode.HYBRID)
semantic_hybrid_retriever = index.as_retriever(vector_store_query_mode=VectorStoreQueryMode.SEMANTIC_HYBRID)
keyword_query_engine = RetrieverQueryEngine(retriever=keyword_retriever)
hybrid_query_engine = RetrieverQueryEngine(retriever=hybrid_retriever)
semantic_hybrid_query_engine = RetrieverQueryEngine(retriever=semantic_hybrid_retriever)
top_k_values = [1, 3, 5, 10, 50]
all_results = []
query_engines = {
"keyword": keyword_query_engine,
"hybrid": hybrid_query_engine,
"semantic_hybrid": semantic_hybrid_query_engine
}
# Convert dataset items to RAGAS format
from datasets import Dataset
questions = [i.input["content"] for i in literal_dataset.items]
ground_truths = [i.expected_output["content"] for i in literal_dataset.items]
data_samples_set = Dataset.from_dict({"question": questions, "ground_truth": ground_truths})
for qe_name, qe in query_engines.items():
for top_k in top_k_values:
qe.retriever.similarity_top_k = top_k
evaluation_results = ragas_evaluate(
query_engine=qe,
dataset=data_samples_set,
metrics=[answer_relevancy, context_precision, context_recall, faithfulness],
llm=llm,
embeddings=embed_model
)
results_df = evaluation_results.to_pandas()
# Log experiment
experiment = literal_dataset.create_experiment(
name=f"{qe_name.capitalize()} Experiment - Top {top_k} retrieval",
params=[{"top_k": top_k}, {"query_engine": qe_name}]
)
# Log results per item
experiment_items = literal_dataset.items
for i, row in results_df.iterrows():
scores = [{
"name": m.name,
"type": "AI",
"value": row[m.name]
} for m in [answer_relevancy, context_precision, context_recall, faithfulness]]
experiment.log({
"datasetItemId": experiment_items[i].id,
"scores": scores,
"input": {"question": row["question"]},
"output": {"content": row["answer"]}
})
# Compute averages
avg_answer_relevancy = results_df["answer_relevancy"].mean()
avg_context_precision = results_df["context_precision"].mean()
avg_context_recall = results_df["context_recall"].mean()
avg_faithfulness = results_df["faithfulness"].mean()
all_results.append({
"engine": qe_name,
"top_k": top_k,
"answer_relevancy": avg_answer_relevancy,
"context_precision": avg_context_precision,
"context_recall": avg_context_recall,
"faithfulness": avg_faithfulness
})
7. Identifying the Best Configuration
import pprint
best_config = max(all_results, key=lambda x: x["answer_relevancy"])
print("Best configuration by answer_relevancy:")
pprint.pprint(best_config)
Expected Output:
Best configuration by answer_relevancy:
{
'engine': 'semantic_hybrid',
'top_k': 50,
'answer_relevancy': 0.9214,
'context_precision': ...,
'context_recall': ...,
'faithfulness': ...
}
This confirms that leveraging Azure AI Search’s semantic-hybrid retrieval mode with a higher top_k
gives the best answer relevancy, aligning with the improved retrieval capabilities announced in Raising the Bar for RAG Excellence: Query Rewriting and New Semantic Ranker.
Let’s also take a look at the Literal AI Dashboard on the Experiments blade.
Figure 2: Detailed Experiment Items and Scores in Literal AI
Figure 2 shows per-item metrics, highlighting how each question’s answer scored on metrics like answer_relevancy, context_precision, etc.
Visualizing Results with Literal AI
Literal AI captures every query, retrieval step, and final answer token, making it easier to understand and debug the pipeline. RAGAS then quantifies performance across multiple metrics (answer relevancy, context precision, context recall, faithfulness).
Figure 3: Single Query Analysis in Literal AI
Figure 3 shows a single query ("Does the deductible roll over into the next year?") and how Literal AI displays the retrieved documents, the final answer, and intermediate steps.
Figure 4: Single query analysis shows the exact retrieval steps and final answer.
In Figure 4, you can click into each step to visualize in JSON or YAML the inputs / outputs and the retrieved documents.
Going back to the Experiments blade, we could also click “Compare” to compare two experiments (only 2 supported at this time). I’d love to have a holistic view to visualize all of my experiments.
Figure 5: Comparison of Different “top_k” Results for Semantic_Hybrid @Top_k=50 vs Keyword @Top_K=1 (worst answer relevancy)
These figures illustrate how Literal AI presents scores and experiment comparisons, enabling quick insights into how retrieval strategies impact performance.
Guidance on Azure AI Search Retrieval Configuration
For RAG pipelines, start by testing semantic-hybrid retrieval with a moderate top_k
(e.g., 10) and then incrementally increasing top_k
to see if answer quality improves. Semantic-hybrid mode benefits from internal query rewriting and new semantic ranker enhancements, enabling more robust retrieval across diverse content types, languages, and query complexities. Consider:
Hybrid or Semantic-Hybrid for complex queries and heterogeneous data
Adjusting
top_k
as an iterative tuning step—higher values can improve recall and faithfulness, but also be mindful of latency and costExperimentation and Iteration: Incorporate evaluation frameworks like RAGAS early to confirm which changes drive improvements.
For more nuanced guidance on query rewriting and the new semantic ranker, read the “Raising the Bar for RAG Excellence” blog.
Conclusion and Next Steps
Evaluating RAG performance is not a one-and-done exercise. By systematically measuring relevancy, context precision, recall, and faithfulness—and by incrementally tuning retrieval modes and parameters—you can continuously enhance user experience. Tools like RAGAS, combined with Literal AI’s observability and Azure AI Search’s advanced retrieval capabilities, make it straightforward to implement a practical, data-driven evaluation loop. Ultimately, this iterative process ensures that your RAG solution consistently delivers reliable and contextually accurate answers.
Happy experimenting!