Photo by Rubaitul Azad on Unsplash
Azure AI Search: Integrating Vectorization and OpenAI Embeddings for CSV Files
A Step-by-Step Tutorial on Setting Up and Configuring CSV Data Sources
Introduction
In this tutorial, we'll explore how to use Azure AI Search's integrated vectorization feature to perform advanced searches over a CSV file. By the end of this guide, you'll be able to automate data orchestration from data ingestion to advanced retrieval, making your data easily searchable and usable in applications like GPT-4.
Overview of Azure AI Search and Integrated Vectorization
Azure AI Search's integrated vectorization allows you to leverage Indexer and Skillsets to ingest, enrich, and search your data seamlessly. This feature is particularly useful for handling large datasets, providing low-overhead and automated updates. Let's dive into how this works.
Automating Data Orchestration with Azure AI Search
One common request from customers is setting up the indexer to search over a CSV. In this example, we'll show you how to search over a 10K row CSV file of AG_news, using Azure OpenAI's text-embedding-3-large
model for vectorization.
Dataset Overview
The "AG_news_csv dataset", sourced from the OpenAI Cookbook, is used in this example. It consists of news articles with fields such as title, description, and label.
Here's a snippet of the dataset:
title,description,label_int,label
World Briefings,BRITAIN: BLAIR WARNS OF CLIMATE THREAT Prime Minister Tony Blair urged the international community to consider global warming a dire threat and agree on a plan of action to curb the quot;alarming quot; growth of greenhouse gases.,1,World
Nvidia Puts a Firewall on a Motherboard (PC World),PC World - Upcoming chip set will include built-in security features for your PC.,4,Sci/Tech
Setting Up Your Environment
Install Required Libraries
Install the necessary libraries using pip:
!pip install --pre azure-search-documents
!pip install azure-identity azure-storage-blob openai
Load Environment Variables
Set up your environment variables in a .env
file:
# Load environment variables
load_dotenv()
# Environment Variables
AZURE_OPENAI_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT")
AZURE_OPENAI_API_KEY = os.getenv("AZURE_OPENAI_API_KEY")
AZURE_OPENAI_API_VERSION = "2024-02-01"
BLOB_CONNECTION_STRING = os.getenv("BLOB_CONNECTION_STRING")
AZURE_OPENAI_CHAT_COMPLETION_DEPLOYED_MODEL_NAME = os.getenv("AZURE_OPENAI_CHAT_COMPLETION_DEPLOYED_MODEL_NAME")
BLOB_RESOURCE_ID = os.getenv("BLOB_RESOURCE_ID")
BLOB_CONTAINER_NAME = os.getenv("BLOB_CONTAINER_NAME")
AZURE_OPENAI_EMBEDDING_DEPLOYED_MODEL_NAME = os.getenv("AZURE_OPENAI_EMBEDDING_DEPLOYED_MODEL_NAME")
SEARCH_SERVICE_ENDPOINT = os.getenv("AZURE_SEARCH_SERVICE_ENDPOINT")
SEARCH_SERVICE_API_KEY = os.getenv("AZURE_SEARCH_ADMIN_KEY")
INDEX_NAME = "csv-sample"
Authentication and Configuration
Authenticate with Azure AI Search using either AAD or API keys. For this example, we'll use AAD:
# User-specified parameter
USE_AAD_FOR_SEARCH = True
def authenticate_azure_search(api_key=None, use_aad_for_search=False):
if use_aad_for_search:
print("Using AAD for authentication.")
credential = DefaultAzureCredential()
else:
print("Using API keys for authentication.")
if api_key is None:
raise ValueError("API key must be provided if not using AAD for authentication.")
credential = AzureKeyCredential(api_key)
return credential
azure_search_credential = authenticate_azure_search(api_key=SEARCH_SERVICE_API_KEY, use_aad_for_search=USE_AAD_FOR_SEARCH)
For more information on authentication methods, visit this link.
Uploading CSV File to Azure Blob Storage
File Upload
Upload your CSV file to Azure Blob Storage:
def upload_file_to_blob(connection_string, container_name, file_path):
"""Upload a file to the specified blob container."""
try:
# Initialize the BlobServiceClient
blob_service_client = BlobServiceClient.from_connection_string(connection_string)
# Get the container client
container_client = blob_service_client.get_container_client(container_name)
# Create the container if it doesn't exist
container_client.create_container()
# Upload the file
file_name = os.path.basename(file_path)
blob_client = container_client.get_blob_client(file_name)
with open(file_path, "rb") as data:
blob_client.upload_blob(data, overwrite=True)
print(f"Uploaded blob: {file_name} to container: {container_name}")
except Exception as e:
print(f"Error: {e}")
# Main workflow
CSV_FILE_PATH = os.path.join("data", "csv", "AG_news_samples.csv")
upload_file_to_blob(BLOB_CONNECTION_STRING, BLOB_CONTAINER_NAME, CSV_FILE_PATH)
Creating the Blob Data Source Connector
In this section, we configure a data source connection for Azure AI Search to connect to Azure Blob Storage. The create_or_update_data_source
function helps create or update the data source connection, using the SearchIndexerDataSourceConnection
class.
def create_or_update_data_source(indexer_client, container_name, resource_id, index_name):
"""Create or update a data source connection for Azure AI Search using a connection string."""
try:
container = SearchIndexerDataContainer(name=container_name)
data_source_connection = SearchIndexerDataSourceConnection(
name=f"{index_name}-blob",
type=SearchIndexerDataSourceType.AZURE_BLOB,
connection_string=resource_id,
container=container
)
data_source = indexer_client.create_or_update_data_source_connection(data_source_connection)
print(f"Data source '{data_source.name}' created or updated")
except Exception as e:
print(f"Failed to create or update data source: {e}")
# Initialize the SearchIndexerClient with a credential
indexer_client = SearchIndexerClient(SEARCH_SERVICE_ENDPOINT, azure_search_credential)
# Call the function to create or update the data source
create_or_update_data_source(indexer_client, BLOB_CONTAINER_NAME, BLOB_RESOURCE_ID, INDEX_NAME)
Why these parameters were chosen:
container_name
: Specifies the blob container name in your Azure Blob Storage where the CSV file is stored.resource_id
: The Resource ID is essential for setting up the connection securely and ensuring the correct blob storage resource is targeted.index_name
: Used to name the data source connection uniquely to avoid conflicts with other data sources.
For more details, refer to Search over Azure Blob Storage content -Azure AI Search.
Creating and Configuring the Search Index
Define the Search Index Fields
def create_fields():
"""Creates the fields for the search index based on the specified schema."""
return [
SimpleField(name="id", type=SearchFieldDataType.String, key=True, filterable=True),
SearchField(name="title", type=SearchFieldDataType.String, searchable=True),
SearchField(name="description", type=SearchFieldDataType.String, searchable=True),
SearchField(name="label", type=SearchFieldDataType.String, facetable=True,filterable=True),
SearchField(name="label_int", type=SearchFieldDataType.Int32, sortable=True, filterable=True, facetable=True),
SearchField(
name="vector",
type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
vector_search_dimensions=3072,
vector_search_profile_name="myHnswProfile",
hidden=False,
stored=True
),
]
Create the Vector Search Configuration
def create_vector_search_configuration():
"""Creates the vector search configuration."""
return VectorSearch(
algorithms=[
HnswAlgorithmConfiguration(
name="myHnsw",
parameters=HnswParameters(
m=4,
ef_construction=400,
ef_search=500,
metric=VectorSearchAlgorithmMetric.COSINE,
),
),
ExhaustiveKnnAlgorithmConfiguration(
name="myExhaustiveKnn",
parameters=ExhaustiveKnnParameters(
metric=VectorSearchAlgorithmMetric.COSINE,
),
),
],
profiles=[
VectorSearchProfile(
name="myHnswProfile",
algorithm_configuration_name="myHnsw",
vectorizer="myOpenAI",
),
VectorSearchProfile(
name="myExhaustiveKnnProfile",
algorithm_configuration_name="myExhaustiveKnn",
vectorizer="myOpenAI",
),
],
vectorizers=[
AzureOpenAIVectorizer(
name="myOpenAI",
kind="azureOpenAI",
azure_open_ai_parameters=AzureOpenAIParameters(
resource_uri=AZURE_OPENAI_ENDPOINT,
deployment_id=AZURE_OPENAI_EMBEDDING_DEPLOYED_MODEL_NAME,
api_key=AZURE_OPENAI_API_KEY,
model_name=AzureOpenAIModelName.TEXT_EMBEDDING3_LARGE
),
),
],
)
Create the Semantic Ranker Configuration
def create_semantic_search_configuration():
"""Creates the semantic search configuration."""
return SemanticSearch(configurations=[
SemanticConfiguration(
name="mySemanticConfig",
prioritized_fields=SemanticPrioritizedFields(
title_field=SemanticField(field_name="title"),
content_fields=[SemanticField(field_name="description")]
),
)
])
Create the Search Index
def create_search_index(index_name, fields, vector_search, semantic_search):
"""Creates or updates the search index."""
index = SearchIndex(
name=index_name,
fields=fields,
vector_search=vector_search,
semantic_search=semantic_search
)
try:
result = index_client.create_or_update_index(index)
print(f"{result.name} created")
except Exception as e:
print(f"Failed to create or update index: {e}")
index_client = SearchIndexClient(endpoint=SEARCH_SERVICE_ENDPOINT, credential=azure_search_credential)
fields = create_fields()
vector_search = create_vector_search_configuration()
semantic_search = create_semantic_search_configuration()
Setting up and Running the Skillset
Creating an Embedding Skill
The embedding skill in Azure AI Search leverages Azure OpenAI to generate embeddings for text during indexing.
def create_embedding_skill(azure_openai_endpoint, azure_openai_embedding_deployment, azure_openai_key):
"""Defines the embedding skill for generating embeddings via Azure OpenAI."""
return AzureOpenAIEmbeddingSkill(
description="Skill to generate embeddings via Azure OpenAI",
context="/document",
resource_uri=azure_openai_endpoint,
deployment_id=azure_openai_embedding_deployment,
model_name=AzureOpenAIModelName.TEXT_EMBEDDING3_LARGE,
api_key=azure_openai_key,
inputs=[
InputFieldMappingEntry(name="text", source="/document/description"),
],
outputs=[
OutputFieldMappingEntry(name="embedding")
],
)
Why these parameters were chosen:
resource_uri
: Specifies the endpoint of your Azure OpenAI resource.deployment_id
: Identifies the specific deployment of the Azure OpenAI embedding model you are using.model_name
: The name of the embedding model (e.g.,text-embedding-3-large
) so Azure AI Search can perform validation.inputs
andoutputs
: Defines the source of the input text and where the generated embeddings will be stored.
To learn more about configuring the embedding skill, visit Azure OpenAI Embedding skill.
Create a skillset
def create_skillset(client, skillset_name, embedding_skill):
"""Creates or updates the skillset with an embedding skill."""
skillset = SearchIndexerSkillset(
name=skillset_name,
description="Skillset for generating embeddings",
skills=[embedding_skill],
)
try:
client.create_or_update_skillset(skillset)
print(f"{skillset.name} created")
except Exception as e:
print(f"Failed to create or update skillset {skillset_name}: {e}")
# Example usage
skillset_name = f"{INDEX_NAME}-skillset"
client = SearchIndexerClient(endpoint=SEARCH_SERVICE_ENDPOINT, credential=azure_search_credential)
embedding_skill = create_embedding_skill(AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_EMBEDDING_DEPLOYED_MODEL_NAME, AZURE_OPENAI_API_KEY)
create_skillset(client, skillset_name, embedding_skill)
Creating and Configuring the Indexer
The indexer automates data import and indexing from the data source into the search index. The create_and_run_indexer
function sets up and runs the indexer.
def create_and_run_indexer(
indexer_client, indexer_name, skillset_name, index_name, data_source_name
):
"""
Creates an indexer, applies it to a given index, and runs the indexing process.
"""
try:
indexer = SearchIndexer(
name=indexer_name,
description="Indexer to index documents and generate embeddings",
skillset_name=skillset_name,
target_index_name=index_name,
data_source_name=data_source_name,
# Indexing parameters to correctly parse CSV files
parameters=IndexingParameters(
batch_size=100, # Adjust based on your content size and requirements
configuration=IndexingParametersConfiguration(
parsing_mode=BlobIndexerParsingMode.DELIMITED_TEXT,
first_line_contains_headers=True,
query_timeout=None,
),
),
output_field_mappings=[FieldMapping(source_field_name="/document/embedding", target_field_name="vector")]
)
# Create or update the indexer
indexer_client.create_or_update_indexer(indexer)
print(f"{indexer_name} created or updated.")
# Run the indexer
indexer_client.run_indexer(indexer_name)
print(
f"{indexer_name} is running. If queries return no results, please wait a bit and try again."
)
except Exception as e:
print(f"Failed to create or run indexer {indexer_name}: {e}")
data_source_name = f"{INDEX_NAME}-blob"
indexer_name = f"{INDEX_NAME}-indexer"
indexer_client = SearchIndexerClient(
endpoint=SEARCH_SERVICE_ENDPOINT, credential=azure_search_credential
)
create_and_run_indexer(
indexer_client, indexer_name, skillset_name, INDEX_NAME, data_source_name
)
Why these parameters were chosen:
batch_size
: Specifies the number of documents to process in each batch, optimizing performance based on content size.parsing_mode
: Set toBlobIndexerParsingMode.DELIMITED_TEXT
to correctly parse CSV files.first_line_contains_headers
: Indicates that the first line of the CSV file contains headers, crucial for correctly mapping CSV columns to index fields.output_field_mappings
: Maps the generated embeddings to the appropriate vector field in the search index.
For more details on configuring indexers, refer to Create an indexerin Azure AI Search.
Polling for Indexer Completion
indexer_last_result = indexer_client.get_indexer_status(indexer_name).last_result
indexer_status = IndexerExecutionStatus.IN_PROGRESS if indexer_last_result is None else indexer_last_result.status
while(indexer_status == IndexerExecutionStatus.IN_PROGRESS):
indexer_last_result = indexer_client.get_indexer_status(indexer_name).last_result
indexer_status = IndexerExecutionStatus.IN_PROGRESS if indexer_last_result is None else indexer_last_result.status
print(f"Indexer '{indexer_name}' is still running. Current status: '{indexer_status}'.")
print(f"Indexer '{indexer_name}' finished with status '{indexer_status}'.")
Perform a simple vector search
# Pure Vector Search
query = "What did Prime Minister Tony Blair say about climate change?"
search_client = SearchClient(SEARCH_SERVICE_ENDPOINT, INDEX_NAME, credential=azure_search_credential)
vector_query = VectorizableTextQuery(text=query, k_nearest_neighbors=1, fields="vector", exhaustive=True)
# Use the below query to pass in the raw vector query instead of the query vectorization
# vector_query = VectorizedQuery(vector=generate_embeddings(query), k_nearest_neighbors=3, fields="vector")
results = search_client.search(
search_text=None,
vector_queries= [vector_query],
top=1
)
for result in results:
print(f"title: {result['title']}")
print(f"description: {result['description']}")
print(f"label: {result['label']}")
title: World Briefings
description: BRITAIN: BLAIR WARNS OF CLIMATE THREAT Prime Minister Tony Blair urged the international community to consider global warming a dire threat and agree on a plan of action to curb the quot;alarming quot; growth of greenhouse gases.
label: World
Perform RAG Using Your Data and GPT-4
import openai
client = openai.AzureOpenAI(
azure_endpoint=AZURE_OPENAI_ENDPOINT,
api_key=AZURE_OPENAI_API_KEY,
api_version="2024-02-01",
)
completion = client.chat.completions.create(
model=AZURE_OPENAI_CHAT_COMPLETION_DEPLOYED_MODEL_NAME,
messages=[
{
"role": "user",
"content": query, # Using the same query from the vector search/cod
},
],
extra_body={
"data_sources": [
{
"type": "azure_search",
"parameters": {
"endpoint": SEARCH_SERVICE_ENDPOINT,
"index_name": INDEX_NAME,
"authentication": {
"type": "api_key",
"key": SEARCH_SERVICE_API_KEY,
},
"query_type": "vector_semantic_hybrid",
"embedding_dependency": {
"type": "deployment_name",
"deployment_name": AZURE_OPENAI_EMBEDDING_DEPLOYED_MODEL_NAME,
},
"semantic_configuration": "mySemanticConfig",
},
}
],
},
)
import textwrap
if completion.choices:
message_content = completion.choices[0].message.content
wrapped_message_content = textwrap.fill(message_content, width=100)
print(f"AI Assistant (GPT-4): {wrapped_message_content}")
AI Assistant (GPT-4): Prime Minister Tony Blair urged the international community to consider global warming a dire threat
and to agree on a plan of action to curb the "alarming" growth of greenhouse gases [doc1].
For more information on how to use your Azure AI Search Data Source for RAG with Azure OpenAI Service, please visit Use your own data with Azure OpenAI Service - Azure OpenAI | Microsoft Learn
Conclusion
Summarizing the Workflow
In this tutorial, we explored how to use Azure AI Search's integrated vectorization feature to perform advanced searches over a CSV file. We walked through the process from setting up the environment, uploading the CSV file to Azure Blob Storage, creating and configuring the search index, setting up and running the skillset, and performing a simple vector search. This approach allows for seamless data orchestration and advanced retrieval, making it easier to integrate with applications like GPT-4.