Azure Document Intelligence not processing all pages when there is a blank page

Question

Azure Document Intelligence not processing all pages when there is a blank page

Ewan Davies 45

I am using Document Intelligence to parse a markdown version of PDF documents to then be passed to an LLM. I noticed for one of the documents only had 2 pages of 12 processed, which turns out to be because page 3 is blank, and presumably this indicates the end of the file to document intelligence.

Is this intended behaviour? It seems like an issue to me since surely it should recognise that there is content on the remaining pages.

I can get Document Intelligence to process the whole document if I specify the pages to skip the blank one ("1-2,4-12"), but I'm not sure I want to go down the route of trying to identify blank pages before processing. Do you have any other suggested workarounds?

I'm considering using Content Understanding as this doesn't have the same issue, but it's not ideal because of the limited data centres.

Here's a document I'm able to share which reproduces the issue (it's presumably any pdf with a blank page) Lorem Ipsum.pdf

1 answer

Your answer

Answer 1

Hi ,

Thanks for reaching out to Microsoft Q&A.

No, this is not the intended behaviour of Document Intelligence but seems to be a known limitation in certain layout or prebuilt models. When a completely blank page (page 3 in your PDF?) appears, the service may assume it is the end of meaningful content, especially if the blank page occurs early in the document. This would have caused the result in premature truncation during the processing.

Think that the doc int’s page segmentation logic may mistakenly treat the blank page as a signal to stop processing, especially when using specific models that are optimized for structured documents like invoices, contracts, or layouts.

Workarounds that you can try :

Preprocessing PDF to remove blank pages (recommended)

Use a preprocessing step to automatically remove blank pages before uploading:

With PyMuPDF, pdfplumber, or pdfminer.six in Python, you can detect and drop pages with no text or pixel content. This keeps your pipeline dynamic and avoids hardcoding page ranges.

Switch to Read API (Layout model)

If you are not using custom models and only need text, try the Read API or Layout model, which tends to be more tolerant of blank pages and processes all pages unless explicitly told to skip.

Use OCR fallback logic

If only some pages are missed, run a secondary OCR pass (Azure’s Read API or Tesseract) on missing pages and stitch results manually.

Please 'Upvote'(Thumbs-up) and 'Accept' as answer if the reply was helpful. This will be benefitting other community members who face the same issue.

Share via

Azure Document Intelligence not processing all pages when there is a blank page

1 answer

Your answer