Hello PGS-7643,
Welcome to the Microsoft Q&A and thank you for posting your questions here.
Reading through the question and response:
Document Intelligence does not universally auto-split repeated same-type instances within a single file unless you use the classifier/splitting features correctly or pre-process to create one invoice per input. Therefore, use Document Intelligence Classification + Splitting pipeline (recommended) The goal is to detect invoice boundaries inside the yearly PDF, split into individual invoice documents (page ranges), then run invoice extraction per invoice. However, if classification+split isn't practical: pre-process using OCR text detection + heuristic splitting (fallback) That is if you can’t train a classifier (lack of labeled multi-document inputs) and invoices have clear repeating headers/keywords (e.g., “Invoice No:” appears at top of each invoice), a robust heuristic pre-splitter can work.
So, in a nutshell:
- Don’t simply send the whole yearly PDF to a single extractor expecting one invoice per month design a split + extract pipeline. (Use classification+split where possible.) and ensure you train the classifier with multi-invoice files, so it learns page-range boundaries. https://free.blessedness.top/en-us/azure/ai-services/document-intelligence/train/custom-model?view=doc-intel-4.0.0
- Also, for each detected invoice page-range, run a dedicated invoice extractor and apply post-processing (table joining, date→month mapping, confidence checks). https://free.blessedness.top/en-us/azure/ai-services/document-intelligence/prebuilt/invoice?view=doc-intel-4.0.0
- Lastly, build monitoring, fallback human review, and batching to handle limits and template drift. https://www.webnethelper.com/2024/11/azure-ai-document-intelligence.html
I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.
Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.