Skip to main content

Command Palette

Search for a command to run...

Using Perplexity Contextualized Embeddings with Hybrid Search in Azure AI Search

Updated
β€’6 min read
F
Principal Product Manager at Microsoft CoreAI leading Retrieval-Augmented Generation (RAG) and knowledge retrieval capabilities, powering enterprise AI applications with grounded, production-scale search and retrieval.

Perplexity just launched new SOTA embedding models β€” pplx-embed. In this blog, I'll examine what I think is novel about them and show you how to use them with Azure AI Search.

The headline feature is contextualized embeddings. If you've built a RAG pipeline, you know the pain: you chunk a PDF, embed each chunk independently, and your search retrieves the wrong chunk because the embedding model had zero awareness of surrounding context. A chunk that reads "The results are shown in Table 2" embeds as a generic table reference β€” the model never saw what Table 2 contains.

contextual performance

Perplexity's approach is to embed all chunks from a document together in a single API call. Each chunk's vector encodes its relationship to every other chunk β€” what Perplexity calls the "golden chunk" insight. We paired this with Azure AI Search's hybrid retrieval (BM25 + vector + semantic reranker) to build a search pipeline over two arxiv papers. Here's how.

What Are Contextualized Embeddings?

Standard embedding models take a single string and return a vector:

# OpenAI: each chunk embedded independently
embed("chunk about attention heads")  # -> vector in isolation

Perplexity's API takes a nested array β€” all chunks from a document, in reading order:

# Perplexity: chunks share document context
embed([
    "introduction",
    "related work",
    "chunk about attention heads",
    "experiments and results"
])
# -> each vector *knows* its position in the document

The chunk about attention heads now produces a different (better) vector because the model saw that it follows related work and precedes experiments. A query like "how does multi-head attention work?" retrieves the right chunk even when the text alone is ambiguous.

Multi-Document Support

The API supports multiple documents in a single call via nested arrays:

input = [
    [doc1_chunk1, doc1_chunk2, doc1_chunk3],  # Paper A
    [doc2_chunk1, doc2_chunk2],                # Paper B
]

Chunks within the same inner array share context. Paper A's chunks don't "see" Paper B's chunks. This is architecturally correct β€” they're separate documents.

Why Not Just Use Larger Chunks?

You might think: "I'll just use bigger chunks so each one has more context." That trades precision for recall. A 4,000-token chunk might contain the answer, but it also contains a lot of noise that dilutes the embedding and wastes your LLM's context window during generation.

Contextualized embeddings give you both: small, precise chunks with document-wide context baked into the vector.

The Pipeline

PDFs β†’ Page-based chunks β†’ Contextualized embeddings β†’ Azure AI Search β†’ Hybrid search

We indexed two arxiv papers to test cross-document retrieval:

  • "Attention Is All You Need" β€” the original Transformer paper (16 chunks)

  • "From Local to Global: A Graph RAG Approach" β€” Microsoft's GraphRAG paper (31 chunks)

Step 1: Chunk the PDFs

We use one chunk per page with a sentence-boundary split for pages exceeding 4,000 characters. Order matters β€” the model relies on sequential context:

from pypdf import PdfReader

def chunk_pdf(pdf_path):
    reader = PdfReader(str(pdf_path))
    chunks = []
    for page_num, page in enumerate(reader.pages, start=1):
        text = (page.extract_text() or "").strip()
        if not text:
            continue
        chunks.append({
            "page_number": page_num,
            "chunk_index": 0,
            "text": text,
        })
    return chunks

Step 2: Generate Contextualized Embeddings

Perplexity returns INT8 quantized embeddings as base64 strings by default. We decode to float32 for Azure AI Search:

import base64
import numpy as np

def decode_base64_int8(b64_string: str) -> list[float]:
    raw_bytes = base64.b64decode(b64_string)
    int8_array = np.frombuffer(raw_bytes, dtype=np.int8)
    return int8_array.astype(np.float32).tolist()

The embedding call sends each document's chunks as a nested array. For free tier users, we batch into groups of 15 chunks with 60-second pauses to stay within the token-per-minute limit:

from perplexity import Perplexity

client = Perplexity(api_key=API_KEY)
CHUNKS_PER_BATCH = 15

for doc in docs:
    texts = [c["text"] for c in doc["chunks"]]
    for batch_start in range(0, len(texts), CHUNKS_PER_BATCH):
        batch = texts[batch_start:batch_start + CHUNKS_PER_BATCH]
        response = client.contextualized_embeddings.create(
            input=[batch],  # nested array: one document
            model="pplx-embed-context-v1-0.6b",
            dimensions=1024,
        )
        vectors = [decode_base64_int8(e.embedding) for e in response.data[0].data]
        # ... collect vectors

Step 3: Create the Azure AI Search Index

One critical detail: Perplexity embeddings are unnormalized. You must use cosine as the HNSW metric β€” using dotProduct will give wrong results:

from azure.search.documents.indexes.models import (
    HnswAlgorithmConfiguration, HnswParameters,
)

vector_config = HnswAlgorithmConfiguration(
    name="pplx-hnsw",
    parameters=HnswParameters(
        m=4,
        ef_construction=400,
        ef_search=500,
        metric="cosine",  # Required for unnormalized embeddings
    ),
)

We also configure a semantic ranker to get the full hybrid retrieval pipeline β€” BM25 text matching, cosine vector similarity, and Microsoft's neural reranker all working together.

Azure AI Search Portal Search Explorer

Queries are embedded the same way β€” wrapped as [[query]] (a single-chunk, single-document input) to stay in the same vector space:

from azure.search.documents.models import VectorizedQuery

# Embed the query
query_vector = client.contextualized_embeddings.create(
    input=[["How does multi-head attention work?"]],
    model="pplx-embed-context-v1-0.6b",
    dimensions=1024,
).data[0].data[0].embedding

# Hybrid search: BM25 + vector + semantic reranker
results = search_client.search(
    search_text="How does multi-head attention work?",
    vector_queries=[VectorizedQuery(
        vector=decode_base64_int8(query_vector),
        k_nearest_neighbors=5,
        fields="content_vector",
    )],
    query_type="semantic",
    semantic_configuration_name="pplx-semantic-ranker-config",
    top=5,
)

With two papers indexed, queries correctly route to the right document β€” Transformer questions surface Transformer chunks, GraphRAG questions surface GraphRAG chunks β€” and ambiguous queries like "What are the key innovations compared to previous approaches?" return results from both.

Gotchas We Hit

Unnormalized embeddings: This one's easy to miss. OpenAI pre-normalizes, so dotProduct and cosine are equivalent. Perplexity doesn't normalize, so dotProduct silently gives worse results. Always use cosine.

Free tier rate limits: The free tier has both a request-per-minute limit (20 RPM) and a token-per-minute limit. A 31-chunk paper (~22K tokens) will hit the TPM limit even if you're under 20 RPM. The fix is batching into ~15 chunks per call with 60-second pauses.

INT8 decoding: The base64 string decodes to signed int8 ([-128, 127]), not unsigned. Use np.int8, not np.uint8.

Perplexity vs. OpenAI Embeddings

Feature Perplexity pplx-embed OpenAI text-embedding-3-large
Contextualized Chunks share document context Each chunk independent
Multi-doc in one call Nested arrays per document Flat array of strings
MRL (flexible dims) 128–2560 256–3072
Default quantization INT8 (base64) float32
Normalization Unnormalized (cosine only) Pre-normalized (dot product OK)

The right choice depends on your data. If you're embedding structured documents β€” PDFs, research papers, legal filings, medical records β€” where inter-chunk context matters, Perplexity's contextualized embeddings are a clear win. For independent text snippets like product descriptions or FAQ entries, OpenAI works great and is simpler to integrate.

Try It Yourself

The full notebook is in the Azure AI Search - Perplexity AI Contextual Embeddings Cookbook repo. It downloads two arxiv papers, generates contextualized embeddings, indexes them in Azure AI Search, and runs hybrid queries β€” all in one notebook.

Happy coding!

More from this blog

F

FullStackFarzzy's Blog

24 posts