Revolutionizing Document Ingestion & RAG with Docling, Azure AI Search, and Azure OpenAI

Revolutionizing Document Ingestion & RAG with Docling, Azure AI Search, and Azure OpenAI

In today's AI landscape, building reliable knowledge systems requires more than just powerful language models. Enter Retrieval-Augmented Generation (RAG) – a pattern that enhances AI responses with contextual knowledge. In this guide, we'll build a production-grade RAG pipeline using Docling, Azure AI Search, and Azure OpenAI, taking you from concept to deployment with practical examples and best practices.

Understanding the RAG Architecture

RAG has emerged as a crucial pattern for grounding AI responses in reliable information. Let's understand why this matters and how our tools work together to create a robust solution.

The RAG Pipeline at a Glance

Here's how our pipeline processes documents and generates responses:

Each component in this pipeline serves a specific purpose:

  1. Docling handles document processing and chunking

  2. Azure OpenAI creates semantic embeddings

  3. Azure AI Search manages vector storage and retrieval

  4. The RAG prompt combines retrieved context with user queries

Why These Tools?

Let's examine what makes each component essential for production systems:

Docling's Advanced Document Processing:

  • Handles complex formats (PDFs, PPTX, DOCX) with structure preservation

  • Provides OCR capabilities for image-heavy documents

  • Implements GPU-accelerated processing for speed

  • Maintains hierarchical document structure during chunking

Azure AI Search's Vector Capabilities:

  • Offers efficient HNSW-based vector search

  • Supports hybrid retrieval (combining semantic and keyword search)

  • Provides automatic scaling and maintenance

  • Integrates seamlessly with Azure OpenAI

Azure OpenAI's Features:

  • Delivers state-of-the-art embedding models

  • Ensures enterprise-grade reliability

  • Offers cost-effective API pricing

  • Provides managed inference endpoints

Implementation Deep Dive

Let's walk through each stage of the pipeline with concrete examples and implementation details.

1. Document Processing: From Raw Files to Structured Content

Here's the minimal code needed to process a document with Docling:

from docling.document_converter import DocumentConverter

# Initialize with GPU acceleration if available
converter = DocumentConverter()

# Process a document (supports local files or URLs)
result = converter.convert("path/to/document.pdf")

# Preview the structured content
print(result.document.export_to_markdown()[:500])

When processing documents, Docling handles various challenges:

  • Layout analysis for complex PDFs

  • Table structure preservation

  • Image extraction and OCR when needed

  • Metadata retention for context

2. Hierarchical Chunking: Preserving Context and Structure

The chunking stage is crucial for effective retrieval. Here's how to implement it:

from docling_core.transforms.chunker import HierarchicalChunker

chunker = HierarchicalChunker()
doc_chunks = list(chunker.chunk(result.document))

all_chunks = []
for idx, c in enumerate(doc_chunks):
    chunk_text = c.text
    all_chunks.append((f"chunk_{idx}", chunk_text))

console.print(f"Total chunks from PDF: {len(all_chunks)}")

3. Vector Search Setup: Optimizing for Retrieval

Setting up Azure AI Search vector index:

from azure.identity import DefaultAzureCredential
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
    SearchIndex,
    SearchField,
    SearchFieldDataType,
    SimpleField,
    SearchableField,
    VectorSearch,
    HnswAlgorithmConfiguration,
    VectorSearchProfile,
    AzureOpenAIVectorizer,
    AzureOpenAIVectorizerParameters,
)
from azure.core.credentials import AzureKeyCredential

VECTOR_DIM = 1536  # Adjust based on your chosen embeddings model

index_client = SearchIndexClient(AZURE_SEARCH_ENDPOINT, AzureKeyCredential(AZURE_SEARCH_KEY))

def create_search_index(index_name: str):
    fields = [
        SimpleField(name="chunk_id", type=SearchFieldDataType.String, key=True),
        SearchableField(name="content", type=SearchFieldDataType.String),
        SearchField(
            name="content_vector",
            type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
            searchable=True,
            filterable=False,
            sortable=False,
            facetable=False,
            vector_search_dimensions=VECTOR_DIM,
            vector_search_profile_name="default",
        ),
    ]

    vector_search = VectorSearch(
        algorithms=[HnswAlgorithmConfiguration(name="default")],
        profiles=[
            VectorSearchProfile(
                name="default",
                algorithm_configuration_name="default",
                vectorizer_name="default",
            )
        ],
        vectorizers=[
            AzureOpenAIVectorizer(
                vectorizer_name="default",
                parameters=AzureOpenAIVectorizerParameters(
                    resource_url=AZURE_OPENAI_ENDPOINT,
                    deployment_name=AZURE_OPENAI_EMBEDDINGS,
                    model_name="text-embedding-3-small",
                    api_key=AZURE_OPENAI_API_KEY,
                ),
            )
        ],
    )

    new_index = SearchIndex(
        name=index_name,
        fields=fields,
        vector_search=vector_search
    )

    try:
        index_client.delete_index(index_name)
    except:
        pass

    index_client.create_or_update_index(new_index)
    console.print(f"Index '{index_name}' created.")

create_search_index(AZURE_SEARCH_INDEX_NAME)

4. Efficient Batch Processing

I recommend efficient batch processing for generating embeddings and upserting to Azure AI Search:

from openai import AzureOpenAI
from azure.search.documents import SearchClient
import uuid

search_client = SearchClient(AZURE_SEARCH_ENDPOINT, AZURE_SEARCH_INDEX_NAME, AzureKeyCredential(AZURE_SEARCH_KEY))
openai_client = AzureOpenAI(
    api_key=AZURE_OPENAI_API_KEY,
    api_version=AZURE_OPENAI_API_VERSION,
    azure_endpoint=AZURE_OPENAI_ENDPOINT,
)

def embed_text(text: str):
    response = openai_client.embeddings.create(
        input=text,
        model=AZURE_OPENAI_EMBEDDINGS
    )
    return response.data[0].embedding

upload_docs = []
for chunk_id, chunk_text in all_chunks:
    embedding_vector = embed_text(chunk_text)
    upload_docs.append(
        {
            "chunk_id": str(uuid.uuid4()),
            "content": chunk_text,
            "content_vector": embedding_vector,
        }
    )

BATCH_SIZE = 250
for i in range(0, len(upload_docs), BATCH_SIZE):
    subset = upload_docs[i : i + BATCH_SIZE]
    resp = search_client.upload_documents(documents=subset)
    console.print(
        f"Uploaded batch {i} -> {i+len(subset)}; success: {resp[0].succeeded}, status code: {resp[0].status_code}"
    )

console.print("All chunks uploaded to Azure Search.")

5. RAG Query Implementation

Here's a complete example of implementing RAG queries:

from azure.search.documents.models import VectorizableTextQuery

def generate_chat_response(prompt: str, system_message: str = None):
    messages = []
    if system_message:
        messages.append({"role": "system", "content": system_message})
    messages.append({"role": "user", "content": prompt})

    completion = openai_client.chat.completions.create(
        model=AZURE_OPENAI_CHAT_MODEL,
        messages=messages,
        temperature=0.7
    )
    return completion.choices[0].message.content

user_query = "in 2024, AI companies reached how many $$$ in value?"
user_embed = embed_text(user_query)

vector_query = VectorizableTextQuery(
    text=user_query, # passing in text for a hybrid search
    k_nearest_neighbors=5,
    fields="content_vector"
)

search_results = search_client.search(
    search_text=user_query,
    vector_queries=[vector_query],
    select=["content"],
    top=10
)

retrieved_chunks = []
for result in search_results:
    snippet = result["content"]
    retrieved_chunks.append(snippet)

context_str = "\n---\n".join(retrieved_chunks)
rag_prompt = f"""
You are an AI assistant helping answering questions about the State of AI 2024 Report.
Use ONLY the text below to answer the user's question.
If the answer isn't in the text, say you don't know.

Context:
{context_str}

Question: {user_query}
Answer:
"""

final_answer = generate_chat_response(rag_prompt)

console.print(Panel(rag_prompt, title="RAG Prompt", style="bold red"))
console.print(Panel(final_answer, title="RAG Response", style="bold green"))
💡
Notice we are using integrated vectorization in Azure AI Search or automatic query vectorization.
Answer: AI companies reached $9T in value in 2024.

Going Further

Want to explore more? Here are some advanced topics to consider:

  1. Enhanced Retrieval:

    • Experiment with scoring profiles

    • Add re-ranking strategies such as Semantic Ranker in Azure AI Search

    • Try Query Rewriting (preview) in Azure AI Search

  2. Quality Improvements:

    • Add relevance feedback loops

    • Implement chunk quality scoring

    • Monitor and tune retrieval performance

Conclusion

This RAG pipeline streamlines your end-to-end process—from ingesting documents with Docling to generating final answers with Azure OpenAI. It’s a robust and flexible foundation to build upon for content-heavy domains like legal, medical, finance, or any scenario where document ingestion is mission-critical.

Did you find this article valuable?

Support FullStackFarzzy's Blog by becoming a sponsor. Any amount is appreciated!