Unlocking Powerful Multimodal Retrieval with voyage-multimodal-3 Embeddings in Azure AI Search

Unlocking Powerful Multimodal Retrieval with voyage-multimodal-3 Embeddings in Azure AI Search

Voyage AI recently announced a major breakthrough in multimodal embeddings with the release of voyage-multimodal-3. This state-of-the-art model captures both textual and visual features in a unified vector space, enabling seamless retrieval augmented generation (RAG) and semantic search over complex documents containing interleaved text and images. In benchmark evaluations, voyage-multimodal-3 significantly outperformed existing models like OpenAI CLIP and Cohere multimodal v3.

In this post, we'll walk through how to leverage voyage-multimodal-3 with Azure AI Search to unlock powerful new RAG capabilities for your applications. We'll explore how to generate multimodal embeddings, perform mixed content analysis, and integrate these embeddings into Azure AI Search for efficient retrieval.

Setup

First, ensure you have the following prerequisites:

  • Azure Subscription with access to Azure AI Search

  • Voyage API Key (sign up at Voyage AI)

Install the required Python Packages

Install the required Python packages:

!pip install voyageai>=0.3.0 pandas PyMuPDF pillow numpy tenacity

Authenticate Clients

Authenticate the Voyage and Azure clients using your API keys and endpoints:

import os
import voyageai
from voyageai import Client
from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient
from azure.search.documents.indexes import SearchIndexClient
import numpy as np
from PIL import Image
from io import BytesIO
import urllib.request
import fitz  # PyMuPDF
from dotenv import load_dotenv

load_dotenv()

VOYAGE_API_KEY = os.getenv("VOYAGE_API_KEY")
AZURE_SEARCH_SERVICE_ENDPOINT = os.getenv("AZURE_SEARCH_SERVICE_ENDPOINT")
AZURE_SEARCH_ADMIN_KEY = os.getenv("AZURE_SEARCH_ADMIN_KEY")
INDEX_NAME = "multimodal-voyage-index"

azure_search_credential = AzureKeyCredential(AZURE_SEARCH_ADMIN_KEY)
index_client = SearchIndexClient(
    endpoint=AZURE_SEARCH_SERVICE_ENDPOINT, credential=azure_search_credential
)
search_client = SearchClient(
    endpoint=AZURE_SEARCH_SERVICE_ENDPOINT,
    index_name=INDEX_NAME,
    credential=azure_search_credential,
)

voyage_client = voyageai.Client(api_key=VOYAGE_API_KEY)
MODEL_NAME = "voyage-multimodal-3"

Generating Multimodal Embeddings

A key advantage of voyage-multimodal-3 is its ability to embed interleaved text and images together in a shared vector space. Previous approaches like CLIP process text and images through separate encoder networks.

Let's embed some sample documents mixing text and images in different orders.

Load Image and Prepare Text

# Helper function to load and resize images from URL
def load_image_from_url(url: str, size: tuple = (256, 256)) -> Image.Image:
    with urllib.request.urlopen(url) as response:
        data = BytesIO(response.read())
    return Image.open(data).resize(size)

# Image URL of cows grazing
image_url = "https://portal.vision.cognitive.azure.com/dist/assets/ImageCaptioningSample1-bbe41ac5.png"  
image = load_image_from_url(image_url)

# Text description
text = "The image showcases a peaceful rural scene with several cows leisurely grazing in a sunlit pasture."

Here’s a lovely image of cows in a farm that we’ll use in this example above:

Generate Embeddings

# Generate embeddings for different input types
documents = [
    [text],           # Text only
    [image],          # Image only
    [text, image],    # Text followed by image
    [image, text],    # Image followed by text
]

result = voyage_client.multimodal_embed(
    inputs=documents, model=MODEL_NAME, input_type="document"
)

Analyze Cosine Similarities

Comparing the cosine similarities between the different embeddings shows that voyage-multimodal-3 effectively preserves semantic meaning whether the image or text comes first.

def cosine_similarity(v1, v2):
    """Calculate cosine similarity between two vectors."""
    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

# Compare similarities between different combinations of documents
similarities = []
for i in range(len(result.embeddings)):
    for j in range(i + 1, len(result.embeddings)):
        sim = cosine_similarity(result.embeddings[i], result.embeddings[j])
        similarities.append((i, j, sim))

# Display similarity results
print("\nSimilarity Analysis:")
for i, j, sim in similarities:
    print(f"Similarity between document {i} and {j}: {sim:.4f}")
Similarity Analysis:
Similarity between document 0 and 1: 0.6536
Similarity between document 0 and 2: 0.8529
Similarity between document 0 and 3: 0.8049
Similarity between document 1 and 2: 0.8546
Similarity between document 1 and 3: 0.9080
Similarity between document 2 and 3: 0.9675

Analysis:

  • Document 0 (Text only) and Document 1 (Image only) have a moderate similarity, showing the model's ability to associate textual descriptions with visual content.

  • The high similarity between Document 2 and Document 3 indicates that the order of text and image has minimal impact on the embedding's semantic meaning.

  • Overall, voyage-multimodal-3 captures the semantic relationships effectively across different modalities.


Mixed Content Analysis

Voyage-multimodal-3 can embed documents containing a mix of structured and unstructured data. Here, we’ll use a real world example with my new favorite video game, Call of Duty Black Ops 6 on the Xbox Marketplace where we’ll embed a product metadata JSON object along with description text and an image.

Prepare Product Metadata

import json

# Define product metadata
product_metadata = {
    "title": "Call of Duty: Black Ops 6 Vault Edition Upgrade (Windows)",
    "description": "Experience intense violence, strong language, and thrilling multiplayer action.",
    "image_url": "https://store-images.s-microsoft.com/image/apps.25689.13909193928944040.073f4f1c-2a1f-4b23-bd2c-278a9a4a4755.72a4aa62-4cda-4a17-874a-abe1caed1db8?q=90&w=177&h=265",
    "minimum_requirements": {
        "os": "Windows 10 version 18362.0 or higher",
        "architecture": "x64",
    },
    # Additional metadata omitted for brevity...
}

# Convert JSON to string
product_metadata_str = json.dumps(product_metadata, indent=4)

# Load product image
product_image = load_image_from_url(product_metadata["image_url"])

# Mixed-content document
mixed_document = [
    [product_metadata["description"], product_image, product_metadata_str]
]

Here’s the real listing on xbox.com: Buy Call of Duty®: Black Ops 6 - Vault Edition Upgrade (Windows) | Xbox

Generate Embedding

# Generate embedding for the mixed-content document
mixed_embedding = voyage_client.multimodal_embed(
    inputs=mixed_document, model=MODEL_NAME, input_type="document"
).embeddings[0]

Compare with Mixed Queries

# Define mixed query archetypes
query_image_url = "https://store-images.s-microsoft.com/image/apps.25689.13909193928944040.073f4f1c-2a1f-4b23-bd2c-278a9a4a4755.72a4aa62-4cda-4a17-874a-abe1caed1db8?q=90&w=177&h=265"
query_feature_image_url = "https://images.unsplash.com/photo-1605899435973-ca2d1a8861cf?w=500&auto=format&fit=crop&q=60&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxzZWFyY2h8NHx8eGJveHxlbnwwfHwwfHx8MA%3D%3D" # Image of Xbox Gaming Controller

mixed_queries = [
    ["Find games like Call of Duty"],  # Text-only query
    ["Show me intense shooter games"],  # Another text query
    [load_image_from_url(query_image_url)],  # Image-only query
    ["Games compatible with this console", load_image_from_url(query_feature_image_url)]  # Text + image query
]

# Generate query embeddings
query_embeddings = voyage_client.multimodal_embed(
    inputs=mixed_queries, model=MODEL_NAME, input_type="query"
).embeddings

# Calculate similarities
print("\nMixed Content Query Analysis:")
for i, query_embedding in enumerate(query_embeddings):
    sim = cosine_similarity(mixed_embedding, query_embedding)
    print(f"Query {i + 1} similarity: {sim:.4f}")

Output:

Mixed Content Query Analysis:
Query 1 similarity: 0.3488
Query 2 similarity: 0.2914
Query 3 similarity: 0.7156
Query 4 similarity: 0.4130

Analysis:

  • Query 3 (Image-only query) has the highest similarity, showing the model's strength in matching visual content.

  • Textual queries also show reasonable similarity, demonstrating the model's ability to understand mixed content.


PDF Screenshot Analysis

Voyage-multimodal-3 offers a powerful way to search visually complex documents like PDFs without manual parsing. By converting each page to an image screenshot, we can generate a searchable embedding for the entire PDF.

Convert PDF to Images

def pdf_to_images(pdf_url):
    """Convert PDF pages to images."""
    pdf_data = urllib.request.urlopen(pdf_url).read()
    pdf_stream = BytesIO(pdf_data)
    pdf = fitz.open(stream=pdf_stream, filetype="pdf")
    images = []
    for page in pdf:
        pix = page.get_pixmap()
        img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
        images.append(img)
    pdf.close()
    return images

# PDF URL 
pdf_url = "https://ignite2024demo.blob.core.windows.net/state-of-ai-2024/State of AI Report 2024.pdf"

# Convert PDF to images
pdf_images = pdf_to_images(pdf_url)

Generate Page Embeddings

# Generate embeddings for each page
page_embeddings = voyage_client.multimodal_embed(
    inputs=[[img] for img in pdf_images], model=MODEL_NAME, input_type="document"
).embeddings

Find Relevant Pages

# Query
query = ["Find sections with bar charts about AI trends"]
query_embedding = voyage_client.multimodal_embed(
    inputs=[query], model=MODEL_NAME, input_type="query"
).embeddings[0]

# Calculate similarities
similarities = [cosine_similarity(query_embedding, page_emb) for page_emb in page_embeddings]
most_relevant = np.argsort(similarities)[::-1]

print("\nMost relevant PDF pages:")
for i, page_idx in enumerate(most_relevant[:3]):
    print(f"Page {page_idx + 1}: Similarity = {similarities[page_idx]:.4f}")

Output:

Most relevant PDF pages:
Page 149: Similarity = 0.3926
Page 109: Similarity = 0.3892
Page 108: Similarity = 0.3764

Pages:

Here are screenshots of those pages for reference: page_149.jpg:

page_149.jpg

page_109.jpg:

page_109.jpg

page_108,jpg:

page_108.jpg

Analysis: The model successfully identifies the pages most relevant to the query, even though it references visual elements not present in the raw text.


Azure AI Search Integration

Let's integrate these capabilities with Azure AI Search for efficient, scalable retrieval.

Create a Search Index

from azure.search.documents.indexes.models import *

fields = [
    SimpleField(name="id", type=SearchFieldDataType.String, key=True),
    SimpleField(name="pdf_url", type=SearchFieldDataType.String),
    SimpleField(name="page_number", type=SearchFieldDataType.Int32, filterable=True),
    SearchField(
        name="embedding",
        type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
        searchable=True,
        vector_search_dimensions=1024,
        vector_search_profile_name="vector_profile",
    ),
]

vector_search = VectorSearch(
    algorithms=[
        VectorSearchAlgorithmConfiguration(
            name="vector_config",
            kind="hnsw",  # Hierarchical Navigable Small World graph
        )
    ],
    profiles=[
        VectorSearchProfile(
            name="vector_profile", algorithm_configuration="vector_config"
        )
    ],
)

index = SearchIndex(
    name=INDEX_NAME, fields=fields, vector_search=vector_search
)
index_client.create_or_update_index(index)

Upload Documents

documents = [
    {
        "id": f"state_of_ai_2024_page_{page_num + 1}",
        "pdf_url": pdf_url,
        "page_number": page_num + 1,
        "embedding": embedding.tolist(),  # Convert numpy array to list
    }
    for page_num, embedding in enumerate(page_embeddings)
]

result = search_client.upload_documents(documents=documents)

Implement Search Function

def search_pdf_pages(query_text: str, top_k: int = 3):
    # Generate query embedding
    query_result = voyage_client.multimodal_embed(
        inputs=[[query_text]], model=MODEL_NAME, input_type="query"
    )
    query_vector = query_result.embeddings[0].tolist()

    # Create vector query
    vector_query = VectorizedQuery(
        vector=query_vector, k=top_k, fields="embedding"
    )

    # Perform search
    results = search_client.search(
        search_text="", vectors=[vector_query], select=["id", "pdf_url", "page_number"]
    )

    return list(results)

# Perform search
print("\nSearching for pages with bar charts about AI trends:")
results = search_pdf_pages("Find sections with bar charts about AI trends")
for r in results:
    print(f"Page {r['page_number']}, Score: {r['@search.score']}")

Output:

Searching for pages with bar charts about AI trends:
Page 149, Score: 0.62212956
Page 109, Score: 0.6208101
Page 108, Score: 0.61590874

Analysis: The search successfully finds the top relevant pages, demonstrating how voyage-multimodal-3 embeddings enhance Azure AI Search.

💡
Note that the search scores here are different as Azure AI Search applies transformations such that the score function is monotonically decreasing, meaning score values will always decrease in value as the similarity becomes worse. See Vector relevance and ranking - Azure AI Search | Microsoft Learn

Conclusion

Voyage-multimodal-3 enables exciting new RAG search capabilities over complex, visually rich documents. By embedding both textual and visual elements in a shared vector space, it surfaces the most relevant information without manual parsing, tagging, or indexing.

Integrating voyage-multimodal-3 with Azure AI Search is straightforward, allowing you to build scalable, high-performance multimodal search applications with minimal code. This opens up transformative possibilities for enterprise search, knowledge management, question-answering, and more.

References

The information and analysis in this blog post were inspired by insights from the State of AI Report 2024 by Air Street Capital. The State of AI Report analyses the most interesting developments in AI, aiming to trigger an informed conversation about the state of AI and its implications for the future. Now in its seventh year, the State of AI Report 2024 is reviewed by leading AI practitioners in industry and research, covering key dimensions such as research, industry, politics, safety, and predictions.

Follow me on LinkedIn and GitHub for more insights and tutorials.

Did you find this article valuable?

Support FullStackFarzzy's Blog by becoming a sponsor. Any amount is appreciated!