Leveraging OpenAI's New Text-Embedding-3 Large Model in Azure AI Search

Leveraging OpenAI's New Text-Embedding-3 Large Model in Azure AI Search

How might we explore how variable dimensions affect vector storage size?

Introduction

When integrated with OpenAI's latest text-embedding-3-large model, Azure AI Search offers remarkable embedding generation and storage optimization capabilities. This article demonstrates how to employ Azure OpenAI Service for generating embeddings and storing them efficiently within Azure AI Search.

Objectives of This Exploration

Our journey today revolves around three pivotal areas:

  1. Introducing Azure OpenAI Service: How customers can seamlessly take advantage of OpenAI's cutting-edge new embedding models within your Azure environment.

  2. Using Azure AI Search as a Vector Store: The latest addition to Azure AI Search capabilities, enabling efficient storage and retrieval of high-dimensional vector embeddings.

  3. Discovering OpenAI's new Embedding Model: We dive into the new model's capabilities, focusing on its variable dimensions feature and how it revolutionizes data representation and retrieval processes.

Setting Up Your Environment

Start by installing the necessary libraries to interact with Azure AI Search and Azure OpenAI's services:

!pip install azure-search-documents azure-identity
!pip install openai 
!pip install python-dotenv 
!pip install langchain tiktoken

Authentication Methods

Ensure your Azure OpenAI credentials are correctly set up. Lots of customers usually ask about the best way to authenticate into Azure AI services. In this example, I'll show you both using Azure Active Directory (AAD) and API Keys.

import os
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
from dotenv import load_dotenv
from openai import AzureOpenAI

# Set up OpenAI client based on environment variables
load_dotenv()
AZURE_OPENAI_ENDPOINT: str = os.getenv("AZURE_OPENAI_ENDPOINT")
AZURE_OPENAI_API_KEY: str = os.getenv("AZURE_OPENAI_API_KEY")
AZURE_OPENAI_API_VERSION: str = "2023-05-15"
AZURE_OPENAI_EMBEDDING_DEPLOYED_MODEL_NAME: str = os.getenv(
    "AZURE_OPENAI_EMBEDDING_DEPLOYED_MODEL_NAME"
)


credential = DefaultAzureCredential()
token_provider = get_bearer_token_provider(
    credential, "https://cognitiveservices.azure.com/.default"
)

# Set this flag to True if you are using Azure Active Directory
use_aad_for_aoai = False

if use_aad_for_aoai:
    # Use Azure Active Directory (AAD) authentication
    client = AzureOpenAI(
        azure_endpoint=AZURE_OPENAI_ENDPOINT,
        api_version=AZURE_OPENAI_API_VERSION,
        azure_ad_token_provider=token_provider,
    )
else:
    # Use API key authentication
    client = AzureOpenAI(
        api_key=AZURE_OPENAI_API_KEY,
        api_version=AZURE_OPENAI_API_VERSION,
        azure_endpoint=AZURE_OPENAI_ENDPOINT,
    )
💡
IMHO, I like using API keys for quick sandbox development activities, and for production workloads, I prefer using Managed Identity. For more information, visit How to configure Azure OpenAI Service with managed identities - Azure OpenAI | Microsoft Learn

Generating Embeddings of Various Sizes

I like t-shirt sizing things in groups of 3. Using the text-embedding-3-large model, let's generate embeddings of different t-shirt sizes.

def generate_small_embedding(text: str):
    # Generate 256d embeddings for the provided text
    embeddings_response = client.embeddings.create(
        model=AZURE_OPENAI_EMBEDDING_DEPLOYED_MODEL_NAME, input=text, dimensions=256
    )
    return embeddings_response.data[0].embedding

def generate_medium_embedding(text: str):
    # Generate 1536d embeddings for the provided text
    embeddings_response = client.embeddings.create(
        model=AZURE_OPENAI_EMBEDDING_DEPLOYED_MODEL_NAME, input=text, dimensions=1536
    )
    return embeddings_response.data[0].embedding

def generate_large_embedding(text: str):
    # Generate 3072d embeddings for the provided text
    embeddings_response = client.embeddings.create(
        model=AZURE_OPENAI_EMBEDDING_DEPLOYED_MODEL_NAME, input=text, dimensions=3072
    )
    return embeddings_response.data[0].embedding
💡
Note: The text-embedding-3-large model uses a technique called Matroshkya Representation Learning during training to enable variable dimensions. I may dive into this a later date...Read more: [2205.13147] Matryoshka Representation Learning (arxiv.org)

Verify the dimension length

Let's check the dimensionality of our embeddings to ensure correctness:

print(len(generate_small_embedding("Hello, world!"))) # Outputs: 256
print(len(generate_medium_embedding("Hello, world!"))) # Outputs: 1536
print(len(generate_large_embedding("Hello, world!"))) # Outputs: 3072

Evaluating Embedding Sizes

It's crucial to understand the storage implications of embeddings:

import sys

def print_vector_size(text: str, generator_function, description: str):
    # Generate an embedding for the text using the specified generator function
    embedding_vector = generator_function(text)

    # Measure the size of the vector embedding in bytes
    vector_size_bytes = sys.getsizeof(embedding_vector)

    # Print the size
    print(f"Size of the {description} vector embedding in bytes: {vector_size_bytes}")

# Example text
text = "Hello, world!"

# Print the size for each type of embedding
print_vector_size(text, generate_small_embedding, "small")
print_vector_size(text, generate_medium_embedding, "medium")
print_vector_size(text, generate_large_embedding, "large")

Let's visualize the output:

Relative Differences

When comparing the embedding sizes relative to the medium (1536d) embedding:

  • The small (256d) embedding is significantly smaller than the medium embedding. Specifically, it is approximately ((12344 - 2104) / 12344) * 100 ≈ 82.96% smaller. This highlights the efficiency of using smaller embeddings for applications where memory usage is a critical constraint, though it may come at the cost of losing some detail or expressive power.

  • The large (3072d) embedding is larger than the medium embedding. It is approximately ((24632 - 12344) / 12344) * 100 ≈ 99.41% larger. The increased size of the large embeddings suggests a higher level of detail or expressive power, which can be beneficial for tasks requiring high accuracy.

💡
To learn more about OpenAI Embedding's, please visit the OpenAI Documentation: Embeddings - OpenAI API

Configure an Azure AI Search vector index

Let's take a look at how we can leverage these embeddings in a real-world vector store leveraging Azure AI Search, the go-to vector database on Azure.

from azure.search.documents.indexes import SearchIndexClient
from azure.core.credentials import AzureKeyCredential
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents import SearchClient
from azure.search.documents.indexes.models import (
    SearchIndex,
    SimpleField,
    SearchableField,
    SearchField,
    SearchFieldDataType,
    VectorSearch,
    HnswAlgorithmConfiguration,
    VectorSearchProfile,
    VectorSearchAlgorithmKind,
)
from azure.identity import DefaultAzureCredential
import os

# Load environment variables
search_service_endpoint = os.getenv("AZURE_SEARCH_SERVICE_ENDPOINT")
search_service_api_key = os.getenv("AZURE_SEARCH_ADMIN_KEY")

Just like how we authenticated with Azure OpenAI Service, we must do the same for Azure AI Search, here are a couple of ways using Managed Identity or API Keys, pick your choice!

# Authentication method flag
use_aad_for_search = False  # Set based on your authentication method

# Choose the correct credential based on your authentication method
credential = (
    DefaultAzureCredential()
    if use_aad_for_search
    else AzureKeyCredential(search_service_api_key)
)

Create indexes with the required dimensions:

# Initialize the SearchIndexClient for creating indexes
index_client = SearchIndexClient(
    endpoint=search_service_endpoint, credential=credential
)

# Index names and their corresponding vector dimensions
index_info = {
    "small": {"name": "index-256d", "dimension": 256},
    "medium": {"name": "index-1536d", "dimension": 1536},
    "large": {"name": "index-3072d", "dimension": 3072},
}

def create_or_update_index(index_name, dimensions):
    fields = [
        SimpleField(name="id", type=SearchFieldDataType.String, key=True),
        SearchableField(name="text", type=SearchFieldDataType.String),
        SearchField(
            name="vector",
            type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
            vector_search_dimensions=dimensions,
            vector_search_profile_name="my-vector-profile",
        ),
    ]
    vector_search = VectorSearch(
        algorithms=[
            HnswAlgorithmConfiguration(
                name="my-hnsw", kind=VectorSearchAlgorithmKind.HNSW
            )
        ],
        profiles=[
            VectorSearchProfile(
                name="my-vector-profile", algorithm_configuration_name="my-hnsw"
            )
        ],
    )
    index = SearchIndex(name=index_name, fields=fields, vector_search=vector_search)
    result = index_client.create_or_update_index(index=index)

# Create or update each index and initialize SearchClient instances
search_clients = {}
for size, info in index_info.items():
    create_or_update_index(info["name"], info["dimension"])
    search_clients[size] = SearchClient(
        endpoint=search_service_endpoint, index_name=info["name"], credential=credential
    )

Now that we have successfully created the following three indexes:

Index Name"Vector" Field Dimensions
index-256d256
index-1536d1536
index-3072d3072

Managing and Loading Data

Load and chunk the dataset for optimal embedding generation:

with open('./data/state_of_the_union.txt', 'r', encoding='utf-8') as file:
    text_content = file.read()

from langchain.text_splitter import CharacterTextSplitter

# Initialize the text splitter with a specific chunk size
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=100, chunk_overlap=0
)
texts = text_splitter.split_text(text_content)
💡
Note: I'm using LangChain here for ingestion and chunking based on token size. While chunking documents is a critical part of Generative AI applications using the RAG pattern, in this example, I'm using quite a basic approach. For more advanced chunking strategies, I recommend checking out LangChain and LlamaIndex as they have some great approaches.

Generate Embeddings for each chunk

Now that I have chunks, I want to generate embeddings and organize them into data frames.

import pandas as pd

def create_embedding_df(texts, embedding_function):
    data = []
    for i, text in enumerate(texts):
        # Generate embedding
        vector = embedding_function(text)
        # Append the data with an ID, the original text, and the generated vector
        data.append({"id": str(i), "text": text, "vector": vector})
    # Create a DataFrame
    df = pd.DataFrame(data, columns=["id", "text", "vector"])
    return df

# Process the texts with each embedding function and create the DataFrames
df_small = create_embedding_df(texts, generate_small_embedding)
df_medium = create_embedding_df(texts, generate_medium_embedding)
df_large = create_embedding_df(texts, generate_large_embedding)

# Convert DataFrames to lists of dictionaries to get into an upsertable format for AI Search
documents_small = df_small.to_dict('records')
documents_medium = df_medium.to_dict('records')
documents_large = df_large.to_dict('records')

Upload the document embeddings to your Azure AI Search index and verify success:

# Define a dictionary mapping index sizes to their corresponding documents
documents_dict = {
    "small": documents_small,
    "medium": documents_medium,
    "large": documents_large,
}

# Loop through each index size and upload the documents
for size, documents in documents_dict.items():
    upload_result = search_clients[size].upload_documents(documents=documents)

Understanding Index Statistics

Retrieve and display index statistics to analyze storage and vector index sizes:

def fetch_index_statistics(index_info):
    # Prepare a list to hold the index statistics
    stats_data = []

    # Iterate over each index in index_info
    for key, value in index_info.items():
        index_name = value["name"]
        # Fetch the statistics for the current index
        stats = index_client.get_index_statistics(index_name=index_name)

        # Append the statistics to our list
        stats_data.append(
            {
                "Index Name": index_name,
                "Storage Size (bytes)": stats["storage_size"],
                "Vector Index Size (bytes)": stats["vector_index_size"],
                "Document Count": stats["document_count"],
            }
        )

    # Convert the list of statistics into a DataFrame for nicer display
    return pd.DataFrame(stats_data)


# Fetch the statistics for each index and store in a DataFrame
df_stats = fetch_index_statistics(index_info)

# Display the DataFrame
print(df_stats)

Visualize the storage and vector index sizes to assess the impact of embedding dimensionality:

Concluding Insights

  • Storage Size (MB) and Vector Index Size (MB) are approximations, calculated from bytes to MB (1 MB = 1,048,576 bytes). View Vector index size documentation to learn more.

  • The Document Count indicates the total number of chunks/documents stored in each index.

  • These statistics highlight the relationship between the dimensionality of the embeddings and the storage requirements. As the dimensionality increases, so does the storage and vector index size, reflecting the trade-off between detail (or accuracy) and resource utilization.

💡
You should evaluate the impact on your dataset, this is just a teeny tiny example index.

Relative differences

  • The index-1536d is approximately 4.6x larger in storage size and 5.7x larger in vector index size than the index-256d.

  • The index-3072d is approximately 2x larger in storage size and 2x larger in vector index size than the index-1536d, showcasing the significant impact of increasing the embedding dimensionality on storage and indexing resource requirements.

These comparative insights demonstrate how embedding dimensionality influences not just the accuracy and detail of stored embeddings but also the practical aspects of storage and indexing within Azure AI Search, emphasizing the need to balance between embedding detail and resource efficiency.

Join me in my next blog where we will evaluate retrieval quality metrics of this cutting-edge new embedding model!


References

Did you find this article valuable?

Support Farzad Sunavala by becoming a sponsor. Any amount is appreciated!