Index file content and metadata by using Azure AI Search after chunking

Question

Index file content and metadata by using Azure AI Search after chunking

Tho Le 40

Hi, I have been following the approach provided at Index file content and metadata by using Azure AI Search to combine 2 sources of data: file content from blob/adls-gen2, and metadata from Azure Table storage. The idea of this approach is that

Use a blob indexer to build an index for the file content, where each file/document is unique identify by its storage path. This unique storage path is the key in the resulting index.
Then use a table indexer to build an index for the file metadata, where we use the unique file storage path to map the metadata of each file to its exact file content.

However, I am running into the issue where the file content is large and hence in the blob indexer I have to use Text Split skill to break text into chunks of text before performing embedding on these chunks. This process automatically generate multiple "chunk_id" for each original document. Now the chunk_id become the unique key in the resulting index. This now makes it impossible for me to map the metadata from the Azure Table into the same index, since the metadata table stills use the original file storage path as the unique key, while the index is now using chunk_id, of which we have no control over how these chunk_id are generated.

For example, here is an error message saying that the whole mapping from Table indexer to the existing index no longer works due to this chunk_id:

User's image

Could anyone provide some possible solutions for this scenario? I would really appreciate! Thanks in advance.

1 answer

Your answer

Answer 1

Hello Tho Le !

Thank you for posting on Microsoft Learn Q&A.

I think you should you keep one indexer (Blob/ADLS) and pull the table metadata during enrichment so each chunk gets the file metadata.

You need to add a Web API skill that receives metadata_storage_path and looks up metadata in your Azure Table (via an Azure Function) then use a ShaperSkill to merge the returned metadata object onto each chunk object.

Output field mappings write both chunk fields (text/vector) and the replicated metadata fields into the same index doc so the result: one index, one indexer, chunk-level docs with all the metadata and no key conflict.

You can store metadata with the content source so you can still use a single blob indexer:

option A: put file level metadata in blob system metadata or custom headers
option B: drop a sidecar JSON per file that your skillset reads and merges before or while chunking and again you end with one indexer and chunk docs that already include the metadata

You can chunk the documents before indexing (Databricks/ADF/Azure Function) and write a JSONL/Parquet where each row is:

key: "<metadata_storage_path>#<chunk_no>"
path: "<metadata_storage_path>"
chunk_no: <n>
text: ...
vector: ...
… + all metadata columns

Point a single indexer at this dataset. You fully control the key (path#chunk_no) and avoid joining altogether.

If you must keep two indexes you need to chunk index documents include the original path as a field in addition to chunk_id and keep a separate metadata index keyed by metadata_storage_path.

At query time you can have search vectors on the chunk index then take top chunks and join to the metadata index in your app by metadata_storage_path this way you avoids re-indexing but the join is in application code.

Share via

Index file content and metadata by using Azure AI Search after chunking

1 answer

Your answer