Index file content and metadata by using Azure AI Search after chunking

Tho Le 40 Reputation points
2024-09-13T03:08:37.5733333+00:00

Hi, I have been following the approach provided at Index file content and metadata by using Azure AI Search to combine 2 sources of data: file content from blob/adls-gen2, and metadata from Azure Table storage. The idea of this approach is that

  1. Use a blob indexer to build an index for the file content, where each file/document is unique identify by its storage path. This unique storage path is the key in the resulting index.
  2. Then use a table indexer to build an index for the file metadata, where we use the unique file storage path to map the metadata of each file to its exact file content.

However, I am running into the issue where the file content is large and hence in the blob indexer I have to use Text Split skill to break text into chunks of text before performing embedding on these chunks. This process automatically generate multiple "chunk_id" for each original document. Now the chunk_id become the unique key in the resulting index. This now makes it impossible for me to map the metadata from the Azure Table into the same index, since the metadata table stills use the original file storage path as the unique key, while the index is now using chunk_id, of which we have no control over how these chunk_id are generated.

For example, here is an error message saying that the whole mapping from Table indexer to the existing index no longer works due to this chunk_id:

User's image

Could anyone provide some possible solutions for this scenario? I would really appreciate! Thanks in advance.

Azure AI Search
Azure AI Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
0 comments No comments
{count} vote

1 answer

Sort by: Most helpful
  1. Amira Bedhiafi 39,106 Reputation points Volunteer Moderator
    2025-10-20T18:01:28.31+00:00

    Hello Tho Le !

    Thank you for posting on Microsoft Learn Q&A.

    I think you should you keep one indexer (Blob/ADLS) and pull the table metadata during enrichment so each chunk gets the file metadata.

    You need to add a Web API skill that receives metadata_storage_path and looks up metadata in your Azure Table (via an Azure Function) then use a ShaperSkill to merge the returned metadata object onto each chunk object.

    Output field mappings write both chunk fields (text/vector) and the replicated metadata fields into the same index doc so the result: one index, one indexer, chunk-level docs with all the metadata and no key conflict.

    You can store metadata with the content source so you can still use a single blob indexer:

    • option A: put file level metadata in blob system metadata or custom headers
    • option B: drop a sidecar JSON per file that your skillset reads and merges before or while chunking and again you end with one indexer and chunk docs that already include the metadata

    You can chunk the documents before indexing (Databricks/ADF/Azure Function) and write a JSONL/Parquet where each row is:

    key: "<metadata_storage_path>#<chunk_no>"
    path: "<metadata_storage_path>"
    chunk_no: <n>
    text: ...
    vector: ...
    … + all metadata columns
    

    Point a single indexer at this dataset. You fully control the key (path#chunk_no) and avoid joining altogether.

    If you must keep two indexes you need to chunk index documents include the original path as a field in addition to chunk_id and keep a separate metadata index keyed by metadata_storage_path.

    At query time you can have search vectors on the chunk index then take top chunks and join to the metadata index in your app by metadata_storage_path this way you avoids re-indexing but the join is in application code.

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.