Azure Document Intelligence - Automating Monthly Invoice Extraction from Yearly PDFs

Question

Azure Document Intelligence - Automating Monthly Invoice Extraction from Yearly PDFs

PGS-7643 1

Extracting invoice data from multi-page yearly PDFs—where each set of pages corresponds to a different month. The PDF can have images. Is there a better option than Split pdf based on fixed page numbers and send to Document Intelligence. The challenge with this option is when the invoice template changes.

Sridhar M 1,220 Reputation points Microsoft External Staff Moderator

2025-10-20T04:02:55.2066667+00:00
Hi PGS-7643,

If you're working with yearly PDFs that contain multiple monthly invoices, and each invoice has a different layout or number of pages, splitting the PDF by fixed page numbers is not a reliable solution. It can break when the invoice format or length changes.

Fixed page splitting assumes each invoice is the same length. But in reality, invoices often vary. Some months may have more items or a different layout. This makes the fixed approach hard to maintain and easy to break.

Custom Model in Azure Document Intelligence

Instead of splitting the PDF, you can use Azure Document Intelligence’s Custom Model to extract the data directly. This model learns from your actual invoice samples, so it works better with different layouts and structures.

If your PDF has scanned images, don’t worry. Azure’s built-in OCR (Optical Character Recognition) can read and extract text from images, allowing the model to still find key data like invoice numbers and totals.

Steps to Set It Up

Prepare your invoices – Collect samples with different formats.

Train a custom model – Use Document Intelligence Studio to upload, label fields (like Invoice Date, Total), and train the model.

Test and refine – Run tests with other PDFs to make sure it works well.

Use the model in your workflow – Upload full PDFs and let the model extract the invoice data automatically

Reference:

https://free.blessedness.top/en-us/azure/ai-services/document-intelligence/prebuilt/invoice?view=doc-intel-4.0.0&wt.mc_id=knowledgesearch_inproduct_azure-cxp-community-insider

https://free.blessedness.top/en-us/azure/ai-services/document-intelligence/?view=doc-intel-4.0.0

https://free.blessedness.top/en-us/azure/ai-services/document-intelligence/overview?view=doc-intel-4.0.0&wt.mc_id=knowledgesearch_inproduct_azure-cxp-community-insider#models-and-development-options

Thank you!
Sridhar M 1,220 Reputation points Microsoft External Staff Moderator

2025-10-21T08:08:42.7+00:00

Hi PGS-7643

Did you get any chance to review the above response.

Thank you!
Sridhar M 1,220 Reputation points Microsoft External Staff Moderator

2025-10-22T08:27:58.3833333+00:00

Hi PGS-7643

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

1 answer

Your answer

Sridhar M 1,220 Reputation points Microsoft External Staff Moderator

2025-10-21T08:08:42.7+00:00

Hi PGS-7643

Did you get any chance to review the above response.

Thank you!
Sridhar M 1,220 Reputation points Microsoft External Staff Moderator

2025-10-22T08:27:58.3833333+00:00

Hi PGS-7643

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

Answer 1

Hello PGS-7643,

Welcome to the Microsoft Q&A and thank you for posting your questions here.

Reading through the question and response:

Document Intelligence does not universally auto-split repeated same-type instances within a single file unless you use the classifier/splitting features correctly or pre-process to create one invoice per input. Therefore, use Document Intelligence Classification + Splitting pipeline (recommended) The goal is to detect invoice boundaries inside the yearly PDF, split into individual invoice documents (page ranges), then run invoice extraction per invoice. However, if classification+split isn't practical: pre-process using OCR text detection + heuristic splitting (fallback) That is if you can’t train a classifier (lack of labeled multi-document inputs) and invoices have clear repeating headers/keywords (e.g., “Invoice No:” appears at top of each invoice), a robust heuristic pre-splitter can work.

So, in a nutshell:

Don’t simply send the whole yearly PDF to a single extractor expecting one invoice per month design a split + extract pipeline. (Use classification+split where possible.) and ensure you train the classifier with multi-invoice files, so it learns page-range boundaries. https://free.blessedness.top/en-us/azure/ai-services/document-intelligence/train/custom-model?view=doc-intel-4.0.0
Also, for each detected invoice page-range, run a dedicated invoice extractor and apply post-processing (table joining, date→month mapping, confidence checks). https://free.blessedness.top/en-us/azure/ai-services/document-intelligence/prebuilt/invoice?view=doc-intel-4.0.0
Lastly, build monitoring, fallback human review, and batching to handle limits and template drift. https://www.webnethelper.com/2024/11/azure-ai-document-intelligence.html

I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.

Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

Share via

Azure Document Intelligence - Automating Monthly Invoice Extraction from Yearly PDFs

1 answer

Your answer