In today's AI landscape, building reliable knowledge systems requires more than just powerful language models. Enter Retrieval-Augmented Generation (RAG) – a pattern that enhances AI responses with contextual knowledge. In this guide, we'll build a production-grade RAG pipeline using Docling, Azure AI Search, and Azure OpenAI, taking you from concept to deployment with practical examples and best practices.

Understanding the RAG Architecture

RAG has emerged as a crucial pattern for grounding AI responses in reliable information. Let's understand why this matters and how our tools work together to create a robust solution.

The RAG Pipeline at a Glance

Here's how our pipeline processes documents and generates responses:

Each component in this pipeline serves a specific purpose:

Docling handles document processing and chunking
Azure OpenAI creates semantic embeddings
Azure AI Search manages vector storage and retrieval
The RAG prompt combines retrieved context with user queries

Why These Tools?

Let's examine what makes each component essential for production systems:

Docling's Advanced Document Processing:

Handles complex formats (PDFs, PPTX, DOCX) with structure preservation
Provides OCR capabilities for image-heavy documents
Implements GPU-accelerated processing for speed
Maintains hierarchical document structure during chunking

Azure AI Search's Vector Capabilities:

Offers efficient HNSW-based vector search
Supports hybrid retrieval (combining semantic and keyword search)
Provides automatic scaling and maintenance
Integrates seamlessly with Azure OpenAI

Azure OpenAI's Features:

Delivers state-of-the-art embedding models
Ensures enterprise-grade reliability
Offers cost-effective API pricing
Provides managed inference endpoints

Implementation Deep Dive

Let's walk through each stage of the pipeline with concrete examples and implementation details.

1. Document Processing: From Raw Files to Structured Content

Here's the minimal code needed to process a document with Docling:

from docling.document_converter import DocumentConverter

# Initialize with GPU acceleration if available
converter = DocumentConverter()

# Process a document (supports local files or URLs)
result = converter.convert("path/to/document.pdf")

# Preview the structured content
print(result.document.export_to_markdown()[:500])

When processing documents, Docling handles various challenges:

Layout analysis for complex PDFs
Table structure preservation
Image extraction and OCR when needed
Metadata retention for context

2. Hierarchical Chunking: Preserving Context and Structure

The chunking stage is crucial for effective retrieval. Here's how to implement it:

from docling_core.transforms.chunker import HierarchicalChunker

chunker = HierarchicalChunker()
doc_chunks = list(chunker.chunk(result.document))

all_chunks = []
for idx, c in enumerate(doc_chunks):
    chunk_text = c.text
    all_chunks.append((f"chunk_{idx}", chunk_text))

console.print(f"Total chunks from PDF: {len(all_chunks)}")

3. Vector Search Setup: Optimizing for Retrieval

Setting up Azure AI Search vector index:

from azure.identity import DefaultAzureCredential
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
    SearchIndex,
    SearchField,
    SearchFieldDataType,
    SimpleField,
    SearchableField,
    VectorSearch,
    HnswAlgorithmConfiguration,
    VectorSearchProfile,
    AzureOpenAIVectorizer,
    AzureOpenAIVectorizerParameters,
)
from azure.core.credentials import AzureKeyCredential

VECTOR_DIM = 1536  # Adjust based on your chosen embeddings model

index_client = SearchIndexClient(AZURE_SEARCH_ENDPOINT, AzureKeyCredential(AZURE_SEARCH_KEY))

def create_search_index(index_name: str):
    fields = [
        SimpleField(name="chunk_id", type=SearchFieldDataType.String, key=True),
        SearchableField(name="content", type=SearchFieldDataType.String),
        SearchField(
            name="content_vector",
            type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
            searchable=True,
            filterable=False,
            sortable=False,
            facetable=False,
            vector_search_dimensions=VECTOR_DIM,
            vector_search_profile_name="default",
        ),
    ]

    vector_search = VectorSearch(
        algorithms=[HnswAlgorithmConfiguration(name="default")],
        profiles=[
            VectorSearchProfile(
                name="default",
                algorithm_configuration_name="default",
                vectorizer_name="default",
            )
        ],
        vectorizers=[
            AzureOpenAIVectorizer(
                vectorizer_name="default",
                parameters=AzureOpenAIVectorizerParameters(
                    resource_url=AZURE_OPENAI_ENDPOINT,
                    deployment_name=AZURE_OPENAI_EMBEDDINGS,
                    model_name="text-embedding-3-small",
                    api_key=AZURE_OPENAI_API_KEY,
                ),
            )
        ],
    )

    new_index = SearchIndex(
        name=index_name,
        fields=fields,
        vector_search=vector_search
    )

    try:
        index_client.delete_index(index_name)
    except:
        pass

    index_client.create_or_update_index(new_index)
    console.print(f"Index '{index_name}' created.")

create_search_index(AZURE_SEARCH_INDEX_NAME)

4. Efficient Batch Processing

I recommend efficient batch processing for generating embeddings and upserting to Azure AI Search:

from openai import AzureOpenAI
from azure.search.documents import SearchClient
import uuid

search_client = SearchClient(AZURE_SEARCH_ENDPOINT, AZURE_SEARCH_INDEX_NAME, AzureKeyCredential(AZURE_SEARCH_KEY))
openai_client = AzureOpenAI(
    api_key=AZURE_OPENAI_API_KEY,
    api_version=AZURE_OPENAI_API_VERSION,
    azure_endpoint=AZURE_OPENAI_ENDPOINT,
)

def embed_text(text: str):
    response = openai_client.embeddings.create(
        input=text,
        model=AZURE_OPENAI_EMBEDDINGS
    )
    return response.data[0].embedding

upload_docs = []
for chunk_id, chunk_text in all_chunks:
    embedding_vector = embed_text(chunk_text)
    upload_docs.append(
        {
            "chunk_id": str(uuid.uuid4()),
            "content": chunk_text,
            "content_vector": embedding_vector,
        }
    )

BATCH_SIZE = 250
for i in range(0, len(upload_docs), BATCH_SIZE):
    subset = upload_docs[i : i + BATCH_SIZE]
    resp = search_client.upload_documents(documents=subset)
    console.print(
        f"Uploaded batch {i} -> {i+len(subset)}; success: {resp[0].succeeded}, status code: {resp[0].status_code}"
    )

console.print("All chunks uploaded to Azure Search.")

5. RAG Query Implementation

Here's a complete example of implementing RAG queries:

from azure.search.documents.models import VectorizableTextQuery

def generate_chat_response(prompt: str, system_message: str = None):
    messages = []
    if system_message:
        messages.append({"role": "system", "content": system_message})
    messages.append({"role": "user", "content": prompt})

    completion = openai_client.chat.completions.create(
        model=AZURE_OPENAI_CHAT_MODEL,
        messages=messages,
        temperature=0.7
    )
    return completion.choices[0].message.content

user_query = "in 2024, AI companies reached how many $$$ in value?"
user_embed = embed_text(user_query)

vector_query = VectorizableTextQuery(
    text=user_query, # passing in text for a hybrid search
    k_nearest_neighbors=5,
    fields="content_vector"
)

search_results = search_client.search(
    search_text=user_query,
    vector_queries=[vector_query],
    select=["content"],
    top=10
)

retrieved_chunks = []
for result in search_results:
    snippet = result["content"]
    retrieved_chunks.append(snippet)

context_str = "\n---\n".join(retrieved_chunks)
rag_prompt = f"""
You are an AI assistant helping answering questions about the State of AI 2024 Report.
Use ONLY the text below to answer the user's question.
If the answer isn't in the text, say you don't know.

Context:
{context_str}

Question: {user_query}
Answer:
"""

final_answer = generate_chat_response(rag_prompt)

console.print(Panel(rag_prompt, title="RAG Prompt", style="bold red"))
console.print(Panel(final_answer, title="RAG Response", style="bold green"))

💡

Notice we are using integrated vectorization in Azure AI Search or automatic query vectorization.

Answer: AI companies reached $9T in value in 2024.

Going Further

Want to explore more? Here are some advanced topics to consider:

Enhanced Retrieval:
- Experiment with scoring profiles
- Add re-ranking strategies such as Semantic Ranker in Azure AI Search
- Try Query Rewriting (preview) in Azure AI Search
Quality Improvements:
- Add relevance feedback loops
- Implement chunk quality scoring
- Monitor and tune retrieval performance

Conclusion

This RAG pipeline streamlines your end-to-end process—from ingesting documents with Docling to generating final answers with Azure OpenAI. It’s a robust and flexible foundation to build upon for content-heavy domains like legal, medical, finance, or any scenario where document ingestion is mission-critical.

💡

Check out the full notebook in my repo: docling/docs/examples/rag_azuresearch.ipynb at main · farzad528/docling

Revolutionizing Document Ingestion & RAG with Docling, Azure AI Search, and Azure OpenAI

Table of contents

Understanding the RAG Architecture

The RAG Pipeline at a Glance

Why These Tools?

Docling's Advanced Document Processing:

Azure AI Search's Vector Capabilities:

Azure OpenAI's Features:

Implementation Deep Dive

1. Document Processing: From Raw Files to Structured Content

2. Hierarchical Chunking: Preserving Context and Structure

3. Vector Search Setup: Optimizing for Retrieval

4. Efficient Batch Processing

5. RAG Query Implementation

Going Further

Conclusion

Revolutionizing Document Ingestion & RAG with Docling, Azure AI Search, and Azure OpenAI

Table of contents

Understanding the RAG Architecture

The RAG Pipeline at a Glance

Why These Tools?

Docling's Advanced Document Processing:

Azure AI Search's Vector Capabilities:

Azure OpenAI's Features:

Implementation Deep Dive

1. Document Processing: From Raw Files to Structured Content

2. Hierarchical Chunking: Preserving Context and Structure

3. Vector Search Setup: Optimizing for Retrieval

4. Efficient Batch Processing

5. RAG Query Implementation

Going Further

Conclusion

Did you find this article valuable?