Enhancing RAG with Maximum Marginal Relevance (MMR) in Azure AI Search

💡
DISCLAIMER: The content below is intended for educational purposes. Actual performance and outcomes will depend on your dataset, indexing strategies, and the task you’re trying to solve with your RAG pipeline. Experimentation is strongly recommended.

Retrieval-Augmented Generation (RAG) solutions combine Large Language Models (LLMs) with external data sources to produce factually grounded, context-aware answers. A critical challenge in RAG pipelines, however, is redundancy: the top retrieved documents may be too similar to one another, limiting the diversity of information fed into the LLM. This often results in answers that are complete in one dimension but may miss critical nuances or alternative perspectives.

Maximum Marginal Relevance (MMR) offers a solution. MMR is a post-processing technique that re-ranks retrieved documents to balance both relevance and diversity. By applying MMR, you ensure that each selected document contributes unique, valuable context to the final synthesized answer. The result is more comprehensive, diverse, and potentially more useful responses from the LLM.

Why MMR in RAG Pipelines?

Key Benefits:

  • Mitigate Redundancy: Avoid feeding multiple near-duplicate documents into the LLM.

  • Improve Contextual Quality: Every chunk in the LLM’s context counts. MMR ensures each is adding new insights.

  • Comprehensive Answers: Particularly important for complex queries where multiple angles are needed.

MMR balances relevance (how close to the query) and diversity (how different from already chosen docs) to select a richer set of documents.

💡
In this blog, we won’t evaluate or perform full RAG e2e but moreso dive into the retrieved results comparison.

Core Concepts of MMR

  • Relevance: Document’s similarity to the query.

  • Diversity: Ensures a returned set of documents are not all the same.

  • Balancing Parameter (λ): Controls the trade-off

    • λ = 1: Pure relevance

    • λ = 0: Maximum diversity

    • λ ∈ (0,1): Balances between both

MMR Formula:

MMR(D₍ᵢ₎) = λ · Sim(D₍ᵢ₎, Q) - (1 - λ) · max[Sim(D₍ᵢ₎, D₍ⱼ₎)]

Where:

  • Sim(D₍ᵢ₎, Q) = similarity between document D₍ᵢ₎ and query Q

  • max[Sim(D₍ᵢ₎, D₍ⱼ₎)] = maximum similarity between document D₍ᵢ₎ and any already selected document D₍ⱼ₎ in set S

  • λ = trade-off parameter between relevance and diversity (0 ≤ λ ≤ 1)

Integrating MMR with Azure AI Search

Azure AI Search provides a scalable retrieval system / vecto stoe for indexing and retrieving documents. After retrieving top results from Azure AI Search, apply MMR as a post-processing step:

  1. Retrieve Documents with Vector Search: Get top-N documents using Azure AI Search + Azure OpenAI embeddings.

  2. Compute Relevance & Diversity Scores: For each candidate document, calculate its similarity to the query and to already selected docs.

  3. Apply MMR Re-Ranking: Iteratively pick documents that balance relevance and diversity.

  4. Send Re-Ranked Results to the LLM: Feed these improved documents into your LLM for a final, more comprehensive answer.

Step-by-Step Code Walkthrough

Prerequisites:

  • azure-search-documents, openai, and python-dotenv packages installed.

  • Environment variables set for AZURE_SEARCH_SERVICE_ENDPOINT, AZURE_SEARCH_ADMIN_KEY, AZURE_OPENAI_ENDPOINT, and AZURE_OPENAI_API_KEY.

1. Install Dependencies

!pip install azure-search-documents==11.6.0b8
!pip install openai==1.45.0
!pip install python-dotenv

2. Imports and Client Configuration

import os
import numpy as np
from typing import List, Dict, Any
from azure.search.documents import SearchClient
from azure.core.credentials import AzureKeyCredential
from dotenv import load_dotenv
from openai import AzureOpenAI

load_dotenv()

# Load environment variables
AZURE_SEARCH_SERVICE_ENDPOINT = os.getenv("AZURE_SEARCH_SERVICE_ENDPOINT")
AZURE_SEARCH_ADMIN_KEY = os.getenv("AZURE_SEARCH_ADMIN_KEY")
AZURE_OPENAI_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT")
AZURE_OPENAI_API_KEY = os.getenv("AZURE_OPENAI_API_KEY")
AZURE_OPENAI_API_VERSION = "2024-10-01-preview"
AZURE_OPENAI_EMBEDDING_DEPLOYED_MODEL_NAME = "text-embedding-3-large"

credential = AzureKeyCredential(AZURE_SEARCH_ADMIN_KEY)
index_name = "your-index-name"
search_client = SearchClient(endpoint=AZURE_SEARCH_SERVICE_ENDPOINT, index_name=index_name, credential=credential)

client = AzureOpenAI(
    api_key=AZURE_OPENAI_API_KEY,
    api_version=AZURE_OPENAI_API_VERSION,
    azure_endpoint=AZURE_OPENAI_ENDPOINT
)

def generate_embeddings(text: str):
    embeddings_response = client.embeddings.create(
        model=AZURE_OPENAI_EMBEDDING_DEPLOYED_MODEL_NAME, 
        input=text, 
        dimensions=3072
    )
    return embeddings_response.data[0].embedding
from azure.search.documents.models import VectorizedQuery

query = "What did Prime Minister Tony Blair say about climate change?"
query_embedding = generate_embeddings(query)

vector_query = VectorizedQuery(
    vector=query_embedding, 
    k_nearest_neighbors=50, 
    fields="embedding"
)

results = search_client.search(
    search_text=None,
    vector_queries=[vector_query],
    top=50
)

results_list = list(results)
print(f"Number of results retrieved: {len(results_list)}")

We now have the top 50 most relevant documents (based on vector similarity). Next step: apply MMR.

4. Define MMR Reranking Functions

def cosine_similarity(vec1: List[float], vec2: List[float]) -> float:
    vec1_np = np.array(vec1)
    vec2_np = np.array(vec2)
    denom = (np.linalg.norm(vec1_np) * np.linalg.norm(vec2_np) + 1e-9)
    return float(np.dot(vec1_np, vec2_np) / denom)

def mmr_reranking(
    query_embedding: List[float],
    search_results: List[Dict[str, Any]],
    lambda_param: float = 0.5,
    top_k: int = 10
) -> List[Dict[str, Any]]:
    if not search_results:
        print("Warning: Empty search results.")
        return []

    selected = []
    remaining = search_results.copy()

    while len(selected) < top_k and remaining:
        mmr_scores = {}
        for i, doc in enumerate(remaining):
            relevance_score = cosine_similarity(query_embedding, doc['embedding'])
            if selected:
                similarities = [cosine_similarity(doc['embedding'], sel_doc['embedding']) for sel_doc in selected]
                diversity_penalty = max(similarities)
            else:
                diversity_penalty = 0.0

            mmr_score = lambda_param * relevance_score - (1 - lambda_param) * diversity_penalty
            mmr_scores[i] = mmr_score

        next_idx = max(mmr_scores.items(), key=lambda x: x[1])[0]
        selected.append(remaining[next_idx])
        remaining.pop(next_idx)

    return selected

5. Apply MMR Reranking

reranked_results = mmr_reranking(
    query_embedding=query_embedding,
    search_results=results_list,
    lambda_param=0.7,  # Slight bias towards relevance
    top_k=10
)

print("\nOriginal Top 10 Titles:")
for i, doc in enumerate(results_list[:10], 1):
    print(f"{i}. {doc['title']}")

print("\nMMR Reranked Top 10 Titles:")
for i, doc in enumerate(reranked_results, 1):
    print(f"{i}. {doc['title']}")

Visualizing the Impact of MMR

To understand how MMR affects diversity, let’s visualize the document similarity using a heatmap. Below is an example visualization (not the full code, just conceptual):

Interpretation:

  • The left heatmap (Original Results Similarity) often shows clusters of very similar documents.

  • The right heatmap (MMR Reranked Results Similarity) has a more dispersed similarity pattern, indicating a more diverse selection.

Diversity Score:

  • Calculated by averaging off-diagonal similarity.

  • Lower after MMR, confirming a more diverse document set.

Logic Behind the Heatmap:

  • We compute pairwise cosine similarities for the selected documents.

  • Use seaborn and matplotlib to plot heatmaps.

  • Add text annotations for diversity scores.

Business and Technical Value

For Business Use Cases:

  • Legal, financial, or regulatory queries benefit from multiple perspectives.

  • Product comparisons, healthcare advice, or travel recommendations get richer, more balanced answers.

For Technical Stakeholders:

  • MMR is easy to integrate as a modular post-processing step.

  • Adjust λ to tune the emphasis on diversity vs. relevance.

  • Pair with metrics-based evaluations (like Azure AI Foundry Evaluations or RAGAS) to quantify improvements.

MMR in RAG Agents

As RAG pipelines serve as the backbone of “Agents”—LLM-driven solutions that use external tools and knowledge sources—MMR ensures these agents reason over a more complete information set. The agent’s outputs become more trustworthy and aligned with user needs, especially for multi-dimensional or ambiguous queries.

Next Steps

  1. Experiment:
    Try different λ values, top_k sizes, and retrieval modes (pure vector, hybrid, semantic-hybrid) in Azure AI Search.

  2. Evaluate:
    Use evaluation frameworks to measure improvements in answer quality, context precision, and faithfulness to see which queries benefit from diversity in results.

  3. Iterate and Improve:
    MMR is a lever for quality. Combine it with other techniques—query rewriting, semantic reranking, or advanced RAG techniques—to continuously refine your RAG pipeline.

Conclusion

MMR provides a powerful method to enhance your Retrieval-Augmented Generation pipelines by ensuring a more diverse set of documents. By plugging MMR into your Azure AI Search workflow, you create richer context windows that help LLMs produce more nuanced and comprehensive answers.

Key Takeaways:

  • MMR balances relevance and diversity, reducing redundancy.

  • Integrates seamlessly as a post-processing step after retrieval.

  • Directly improves the quality and credibility of LLM-generated responses in RAG pipelines.

Try out MMR in your environment, visualize the results, and measure the performance improvements. With iterative tuning and evaluation, MMR can significantly elevate the fidelity and breadth of your generated answers.

Did you find this article valuable?

Support FullStackFarzzy's Blog by becoming a sponsor. Any amount is appreciated!