Azure AI Search: Integrating Vectorization and OpenAI Embeddings for CSV Files

A Step-by-Step Tutorial on Setting Up and Configuring CSV Data Sources

·

10 min read

Introduction

In this tutorial, we'll explore how to use Azure AI Search's integrated vectorization feature to perform advanced searches over a CSV file. By the end of this guide, you'll be able to automate data orchestration from data ingestion to advanced retrieval, making your data easily searchable and usable in applications like GPT-4.

Overview of Azure AI Search and Integrated Vectorization

Azure AI Search's integrated vectorization allows you to leverage Indexer and Skillsets to ingest, enrich, and search your data seamlessly. This feature is particularly useful for handling large datasets, providing low-overhead and automated updates. Let's dive into how this works.

One common request from customers is setting up the indexer to search over a CSV. In this example, we'll show you how to search over a 10K row CSV file of AG_news, using Azure OpenAI's text-embedding-3-large model for vectorization.

💡
Pro Tip: Using integrated vectorization can significantly reduce the manual effort required for data updates and enrichments.

Dataset Overview

The "AG_news_csv dataset", sourced from the OpenAI Cookbook, is used in this example. It consists of news articles with fields such as title, description, and label.

Here's a snippet of the dataset:

title,description,label_int,label
World Briefings,BRITAIN: BLAIR WARNS OF CLIMATE THREAT Prime Minister Tony Blair urged the international community to consider global warming a dire threat and agree on a plan of action to curb the  quot;alarming quot; growth of greenhouse gases.,1,World
Nvidia Puts a Firewall on a Motherboard (PC World),PC World - Upcoming chip set will include built-in security features for your PC.,4,Sci/Tech

Setting Up Your Environment

Install Required Libraries

Install the necessary libraries using pip:

!pip install --pre azure-search-documents 
!pip install azure-identity azure-storage-blob openai

Load Environment Variables

Set up your environment variables in a .env file:

# Load environment variables
load_dotenv()

# Environment Variables
AZURE_OPENAI_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT")
AZURE_OPENAI_API_KEY = os.getenv("AZURE_OPENAI_API_KEY")
AZURE_OPENAI_API_VERSION = "2024-02-01"  
BLOB_CONNECTION_STRING = os.getenv("BLOB_CONNECTION_STRING")
AZURE_OPENAI_CHAT_COMPLETION_DEPLOYED_MODEL_NAME = os.getenv("AZURE_OPENAI_CHAT_COMPLETION_DEPLOYED_MODEL_NAME")
BLOB_RESOURCE_ID = os.getenv("BLOB_RESOURCE_ID")
BLOB_CONTAINER_NAME = os.getenv("BLOB_CONTAINER_NAME")
AZURE_OPENAI_EMBEDDING_DEPLOYED_MODEL_NAME = os.getenv("AZURE_OPENAI_EMBEDDING_DEPLOYED_MODEL_NAME")
SEARCH_SERVICE_ENDPOINT = os.getenv("AZURE_SEARCH_SERVICE_ENDPOINT")
SEARCH_SERVICE_API_KEY = os.getenv("AZURE_SEARCH_ADMIN_KEY")  
INDEX_NAME = "csv-sample"

Authentication and Configuration

Authenticate with Azure AI Search using either AAD or API keys. For this example, we'll use AAD:

# User-specified parameter
USE_AAD_FOR_SEARCH = True  

def authenticate_azure_search(api_key=None, use_aad_for_search=False):
    if use_aad_for_search:
        print("Using AAD for authentication.")
        credential = DefaultAzureCredential()
    else:
        print("Using API keys for authentication.")
        if api_key is None:
            raise ValueError("API key must be provided if not using AAD for authentication.")
        credential = AzureKeyCredential(api_key)
    return credential

azure_search_credential = authenticate_azure_search(api_key=SEARCH_SERVICE_API_KEY, use_aad_for_search=USE_AAD_FOR_SEARCH)

For more information on authentication methods, visit this link.

Uploading CSV File to Azure Blob Storage

File Upload

Upload your CSV file to Azure Blob Storage:

def upload_file_to_blob(connection_string, container_name, file_path):
    """Upload a file to the specified blob container."""
    try:
        # Initialize the BlobServiceClient
        blob_service_client = BlobServiceClient.from_connection_string(connection_string)

        # Get the container client
        container_client = blob_service_client.get_container_client(container_name)

        # Create the container if it doesn't exist
        container_client.create_container()

        # Upload the file
        file_name = os.path.basename(file_path)
        blob_client = container_client.get_blob_client(file_name)
        with open(file_path, "rb") as data:
            blob_client.upload_blob(data, overwrite=True)

        print(f"Uploaded blob: {file_name} to container: {container_name}")

    except Exception as e:
        print(f"Error: {e}")

# Main workflow
CSV_FILE_PATH = os.path.join("data", "csv", "AG_news_samples.csv")

upload_file_to_blob(BLOB_CONNECTION_STRING, BLOB_CONTAINER_NAME, CSV_FILE_PATH)
💡
Pro Tip: Use the Azure Storage Explorer for a graphical interface to easily manage and upload your files to Azure Blob Storage. This tool provides a user-friendly way to interact with your storage accounts and perform operations such as uploads, downloads, and setting permissions. Download the Azure Storage Explorer – cloud storage management | Microsoft Azure

Creating the Blob Data Source Connector

In this section, we configure a data source connection for Azure AI Search to connect to Azure Blob Storage. The create_or_update_data_source function helps create or update the data source connection, using the SearchIndexerDataSourceConnection class.

def create_or_update_data_source(indexer_client, container_name, resource_id, index_name):
    """Create or update a data source connection for Azure AI Search using a connection string."""
    try:
        container = SearchIndexerDataContainer(name=container_name)

        data_source_connection = SearchIndexerDataSourceConnection(
            name=f"{index_name}-blob",
            type=SearchIndexerDataSourceType.AZURE_BLOB,
            connection_string=resource_id,
            container=container
        )
        data_source = indexer_client.create_or_update_data_source_connection(data_source_connection)

        print(f"Data source '{data_source.name}' created or updated")
    except Exception as e:
        print(f"Failed to create or update data source: {e}")

# Initialize the SearchIndexerClient with a credential
indexer_client = SearchIndexerClient(SEARCH_SERVICE_ENDPOINT, azure_search_credential)

# Call the function to create or update the data source
create_or_update_data_source(indexer_client, BLOB_CONTAINER_NAME, BLOB_RESOURCE_ID, INDEX_NAME)
💡
You must have the 'Search Blob Index Reader' role on your Blob Storage Account at a minimum to use managed identity.

Why these parameters were chosen:

  • container_name: Specifies the blob container name in your Azure Blob Storage where the CSV file is stored.

  • resource_id: The Resource ID is essential for setting up the connection securely and ensuring the correct blob storage resource is targeted.

  • index_name: Used to name the data source connection uniquely to avoid conflicts with other data sources.

For more details, refer to Search over Azure Blob Storage content -Azure AI Search.

Creating and Configuring the Search Index

Define the Search Index Fields


def create_fields():
    """Creates the fields for the search index based on the specified schema."""
    return [
        SimpleField(name="id", type=SearchFieldDataType.String, key=True, filterable=True),
        SearchField(name="title", type=SearchFieldDataType.String, searchable=True),
        SearchField(name="description", type=SearchFieldDataType.String, searchable=True),
        SearchField(name="label", type=SearchFieldDataType.String, facetable=True,filterable=True),
        SearchField(name="label_int", type=SearchFieldDataType.Int32, sortable=True, filterable=True, facetable=True),
        SearchField(
            name="vector",
            type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
            vector_search_dimensions=3072,
            vector_search_profile_name="myHnswProfile",
            hidden=False,
            stored=True
        ),
    ]

Create the Vector Search Configuration

def create_vector_search_configuration():
    """Creates the vector search configuration."""
    return VectorSearch(
        algorithms=[
            HnswAlgorithmConfiguration(
                name="myHnsw",
                parameters=HnswParameters(
                    m=4,
                    ef_construction=400,
                    ef_search=500,
                    metric=VectorSearchAlgorithmMetric.COSINE,
                ),
            ),
            ExhaustiveKnnAlgorithmConfiguration(
                name="myExhaustiveKnn",
                parameters=ExhaustiveKnnParameters(
                    metric=VectorSearchAlgorithmMetric.COSINE,
                ),
            ),
        ],
        profiles=[
            VectorSearchProfile(
                name="myHnswProfile",
                algorithm_configuration_name="myHnsw",
                vectorizer="myOpenAI",
            ),
            VectorSearchProfile(
                name="myExhaustiveKnnProfile",
                algorithm_configuration_name="myExhaustiveKnn",
                vectorizer="myOpenAI",
            ),
        ],
        vectorizers=[
            AzureOpenAIVectorizer(
                name="myOpenAI",
                kind="azureOpenAI",
                azure_open_ai_parameters=AzureOpenAIParameters(
                    resource_uri=AZURE_OPENAI_ENDPOINT,
                    deployment_id=AZURE_OPENAI_EMBEDDING_DEPLOYED_MODEL_NAME,
                    api_key=AZURE_OPENAI_API_KEY,
                    model_name=AzureOpenAIModelName.TEXT_EMBEDDING3_LARGE
                ),
            ),
        ],
    )
💡
Pro Tip: To save on storage costs and improve search performance, consider using vector quantization for your embeddings. Quantization can significantly reduce the storage size of your vectors without sacrificing much accuracy. For detailed guidance on configuring compression and storage, see Azure AI Search Vector Compression.

Create the Semantic Ranker Configuration

def create_semantic_search_configuration():
    """Creates the semantic search configuration."""
    return SemanticSearch(configurations=[
        SemanticConfiguration(
            name="mySemanticConfig",
            prioritized_fields=SemanticPrioritizedFields(
                title_field=SemanticField(field_name="title"),
                content_fields=[SemanticField(field_name="description")]
            ),
        )
    ])
💡

Create the Search Index

def create_search_index(index_name, fields, vector_search, semantic_search):
    """Creates or updates the search index."""
    index = SearchIndex(
        name=index_name,
        fields=fields,
        vector_search=vector_search,
        semantic_search=semantic_search
    )
    try:
        result = index_client.create_or_update_index(index)
        print(f"{result.name} created")
    except Exception as e:
        print(f"Failed to create or update index: {e}")

index_client = SearchIndexClient(endpoint=SEARCH_SERVICE_ENDPOINT, credential=azure_search_credential)
fields = create_fields()
vector_search = create_vector_search_configuration()
semantic_search = create_semantic_search_configuration()

Setting up and Running the Skillset

Creating an Embedding Skill

The embedding skill in Azure AI Search leverages Azure OpenAI to generate embeddings for text during indexing.

def create_embedding_skill(azure_openai_endpoint, azure_openai_embedding_deployment, azure_openai_key):
    """Defines the embedding skill for generating embeddings via Azure OpenAI."""
    return AzureOpenAIEmbeddingSkill(
        description="Skill to generate embeddings via Azure OpenAI",
        context="/document",
        resource_uri=azure_openai_endpoint,
        deployment_id=azure_openai_embedding_deployment,
        model_name=AzureOpenAIModelName.TEXT_EMBEDDING3_LARGE,
        api_key=azure_openai_key,
        inputs=[
            InputFieldMappingEntry(name="text", source="/document/description"),
        ],
        outputs=[
            OutputFieldMappingEntry(name="embedding")
        ],
    )

Why these parameters were chosen:

  • resource_uri: Specifies the endpoint of your Azure OpenAI resource.

  • deployment_id: Identifies the specific deployment of the Azure OpenAI embedding model you are using.

  • model_name: The name of the embedding model (e.g., text-embedding-3-large) so Azure AI Search can perform validation.

  • inputs and outputs: Defines the source of the input text and where the generated embeddings will be stored.

To learn more about configuring the embedding skill, visit Azure OpenAI Embedding skill.

Create a skillset

def create_skillset(client, skillset_name, embedding_skill):
    """Creates or updates the skillset with an embedding skill."""
    skillset = SearchIndexerSkillset(
        name=skillset_name,
        description="Skillset for generating embeddings",
        skills=[embedding_skill],
    )
    try:
        client.create_or_update_skillset(skillset)
        print(f"{skillset.name} created")
    except Exception as e:
        print(f"Failed to create or update skillset {skillset_name}: {e}")

# Example usage
skillset_name = f"{INDEX_NAME}-skillset"
client = SearchIndexerClient(endpoint=SEARCH_SERVICE_ENDPOINT, credential=azure_search_credential)

embedding_skill = create_embedding_skill(AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_EMBEDDING_DEPLOYED_MODEL_NAME, AZURE_OPENAI_API_KEY)

create_skillset(client, skillset_name, embedding_skill)

Creating and Configuring the Indexer

The indexer automates data import and indexing from the data source into the search index. The create_and_run_indexer function sets up and runs the indexer.

def create_and_run_indexer(
    indexer_client, indexer_name, skillset_name, index_name, data_source_name
):
    """
    Creates an indexer, applies it to a given index, and runs the indexing process.
    """
    try:
        indexer = SearchIndexer(
            name=indexer_name,
            description="Indexer to index documents and generate embeddings",
            skillset_name=skillset_name,
            target_index_name=index_name,
            data_source_name=data_source_name,
            # Indexing parameters to correctly parse CSV files
            parameters=IndexingParameters(
                batch_size=100,  # Adjust based on your content size and requirements
                configuration=IndexingParametersConfiguration(
                    parsing_mode=BlobIndexerParsingMode.DELIMITED_TEXT,
                    first_line_contains_headers=True,
                    query_timeout=None,
                ),
            ),
            output_field_mappings=[FieldMapping(source_field_name="/document/embedding", target_field_name="vector")]
        )

        # Create or update the indexer
        indexer_client.create_or_update_indexer(indexer)
        print(f"{indexer_name} created or updated.")

        # Run the indexer
        indexer_client.run_indexer(indexer_name)
        print(
            f"{indexer_name} is running. If queries return no results, please wait a bit and try again."
        )
    except Exception as e:
        print(f"Failed to create or run indexer {indexer_name}: {e}")

data_source_name = f"{INDEX_NAME}-blob"
indexer_name = f"{INDEX_NAME}-indexer"
indexer_client = SearchIndexerClient(
    endpoint=SEARCH_SERVICE_ENDPOINT, credential=azure_search_credential
)

create_and_run_indexer(
    indexer_client, indexer_name, skillset_name, INDEX_NAME, data_source_name
)

Why these parameters were chosen:

  • batch_size: Specifies the number of documents to process in each batch, optimizing performance based on content size.

  • parsing_mode: Set to BlobIndexerParsingMode.DELIMITED_TEXT to correctly parse CSV files.

  • first_line_contains_headers: Indicates that the first line of the CSV file contains headers, crucial for correctly mapping CSV columns to index fields.

  • output_field_mappings: Maps the generated embeddings to the appropriate vector field in the search index.

For more details on configuring indexers, refer to Create an indexerin Azure AI Search.

Polling for Indexer Completion

indexer_last_result = indexer_client.get_indexer_status(indexer_name).last_result
indexer_status = IndexerExecutionStatus.IN_PROGRESS if indexer_last_result is None  else indexer_last_result.status

while(indexer_status == IndexerExecutionStatus.IN_PROGRESS):
    indexer_last_result = indexer_client.get_indexer_status(indexer_name).last_result
    indexer_status = IndexerExecutionStatus.IN_PROGRESS if indexer_last_result is None  else indexer_last_result.status
    print(f"Indexer '{indexer_name}' is still running. Current status: '{indexer_status}'.")

print(f"Indexer '{indexer_name}' finished with status '{indexer_status}'.")

Perform a simple vector search

# Pure Vector Search
query = "What did Prime Minister Tony Blair say about climate change?"  

search_client = SearchClient(SEARCH_SERVICE_ENDPOINT, INDEX_NAME, credential=azure_search_credential)
vector_query = VectorizableTextQuery(text=query, k_nearest_neighbors=1, fields="vector", exhaustive=True)
# Use the below query to pass in the raw vector query instead of the query vectorization
# vector_query = VectorizedQuery(vector=generate_embeddings(query), k_nearest_neighbors=3, fields="vector")

results = search_client.search(  
    search_text=None,  
    vector_queries= [vector_query],
    top=1
)  

for result in results:  
    print(f"title: {result['title']}")  
    print(f"description: {result['description']}")  
    print(f"label: {result['label']}")
title: World Briefings
description: BRITAIN: BLAIR WARNS OF CLIMATE THREAT Prime Minister Tony Blair urged the international community to consider global warming a dire threat and agree on a plan of action to curb the  quot;alarming quot; growth of greenhouse gases.
label: World

Perform RAG Using Your Data and GPT-4

import openai

client = openai.AzureOpenAI(
    azure_endpoint=AZURE_OPENAI_ENDPOINT,
    api_key=AZURE_OPENAI_API_KEY,
    api_version="2024-02-01",
)

completion = client.chat.completions.create(
    model=AZURE_OPENAI_CHAT_COMPLETION_DEPLOYED_MODEL_NAME,
    messages=[
        {
            "role": "user",
            "content": query, # Using the same query from the vector search/cod
        },
    ],
    extra_body={
        "data_sources": [
            {
                "type": "azure_search",
                "parameters": {
                    "endpoint": SEARCH_SERVICE_ENDPOINT,
                    "index_name": INDEX_NAME,
                    "authentication": {
                        "type": "api_key",
                        "key": SEARCH_SERVICE_API_KEY,
                    },
                    "query_type": "vector_semantic_hybrid",
                    "embedding_dependency": {
                        "type": "deployment_name",
                        "deployment_name": AZURE_OPENAI_EMBEDDING_DEPLOYED_MODEL_NAME,
                    },
                    "semantic_configuration": "mySemanticConfig",
                },
            }
        ],
    },
)

import textwrap
if completion.choices:
    message_content = completion.choices[0].message.content
    wrapped_message_content = textwrap.fill(message_content, width=100)
    print(f"AI Assistant (GPT-4): {wrapped_message_content}")
AI Assistant (GPT-4): Prime Minister Tony Blair urged the international community to consider global warming a dire threat
and to agree on a plan of action to curb the "alarming" growth of greenhouse gases [doc1].

For more information on how to use your Azure AI Search Data Source for RAG with Azure OpenAI Service, please visit Use your own data with Azure OpenAI Service - Azure OpenAI | Microsoft Learn

Conclusion

Summarizing the Workflow

In this tutorial, we explored how to use Azure AI Search's integrated vectorization feature to perform advanced searches over a CSV file. We walked through the process from setting up the environment, uploading the CSV file to Azure Blob Storage, creating and configuring the search index, setting up and running the skillset, and performing a simple vector search. This approach allows for seamless data orchestration and advanced retrieval, making it easier to integrate with applications like GPT-4.

References

Did you find this article valuable?

Support FullStackFarzzy's Blog by becoming a sponsor. Any amount is appreciated!

Â