Full PDF Not Being Extracted (Large Handwritten Scanned Lab Notebooks)

Siddhant Kumta 60 Reputation points
2025-10-22T14:47:17.5033333+00:00

I am currently working on a universal document text extractor. The goal is that there are a ton of files, papers, etc, backed up that need to be able to be searched efficiently instead of manually going through them all when something is needed out of them. The current pipeline is that the documents are extracted using the read model on document intelligence, chunked into an AI Search Index which is then queried upon to get results. At the current stage, I am able to query upon the documents pretty successfully with what information is in the index. On further inspection though, I realized that only about 80% of the data per document was being parsed. For instance my current test documents are 8 handwritten lab notebooks with scientific information. They are unstructured, where not each lab notebook is written the same way and has things like notes at the bottom of the pages or writing in random places, etc. They are 180 page scanned PDF documents where each PDF page is 2 notebook pages. My pricing tier is S0. I have automatic scaling enabled.

Rate Limiting Implementation:

  • Gradual TPS ramp-up from 3 to 12 TPS over 30 seconds
  • Exponential backoff retry policy for 429 errors (up to 5 retries)
  • Separate rate limiting for analyze calls vs. polling GET requests
  • 2-second polling intervals with rate limit checks

Questions:

1)Is the way the extraction is being done the issue?

2)What is the maximum processing time limit for the prebuilt-read model before automatic termination?

  1. When poller.done returns true, does this guarantee processing is complete?

4)What is the recommended polling interval for documents with 100+ scanned pages?

5)Should I implement a maximum polling duration, and if so, what timeout is appropriate for 120-page documents?

6)Are there known issues with scanned PDFs that have dual-page layouts (2 physical pages per PDF page)?

7)Is there a recommended maximum file size or page count per API call?

8)For 150+-page scanned PDFs, what is the recommended approach:

  • Single API call with extended timeout?
  • Split document into smaller chunks?
  • Use batch processing API (if available)?

9)Could aggressive polling (every 2 seconds) cause the service to deprioritize or terminate long-running operations?

  1. *How can I figure out where the issue is may it be service-side, memory limit, processing error, etc?

Would appreciate answers on any of the following questions or any other recommendations outside of this. Please let me know if more information is needed. Thank you in advance!

Azure AI Document Intelligence
{count} votes

1 answer

Sort by: Most helpful
  1. Manas Mohanty 11,690 Reputation points Microsoft External Staff Moderator
    2025-10-22T17:02:45.4666667+00:00

    Hi Siddhant Kumta

    Thank you for sharing the context on your use case. A lot of queries for one thread though.

    Here are answer to your queries.

    1. “Is the way the extraction is being done the issue?”

    Possibly. For large, handwritten, scanned lab notebooks:

    • Ensure the documents meet input requirements: file size ≤ 500 MB (S0), dimensions 50×50 to 10,000×10,000 px, text height ≥ ~12 px at 1024×768 (≈ 8‑pt at 150 dpi). Handwritten is supported by Read/Layout. [free.blessedness.top]
    • Pre‑processing helps:
      • Deskew/denoise scans, increase DPI (≥ 300 dpi recommended for small handwriting), remove borders, normalize contrast.
      • Split very large PDFs into logical batches (e.g., 100–150 pages each) to avoid long single operations; feed page ranges to the API.
      • Remove blank pages when they cause truncation (see Q10 reference). [free.blessedness.top]
      • For unstructured notebook layouts, try prebuilt-read for pure OCR or layout if you need block/paragraph structure. The Read model runs at higher resolution than Vision Read and is optimized for dense text. [free.blessedness.top]

    1. “Maximum processing time limit before automatic termination? Does poller.done guarantee processing is complete?”
    • Azure’s SDK long‑running operation (LRO) poller completes when the server marks the operation finished (success or failure). So poller.done() means the operation has ended and results are ready (or an error is returned).
    • There isn’t a published hard “max time” per operation in docs. Observed timeouts with very large files are usually client‑side timeouts (SDK default timeouts, network) or service-side throttling/limits; a GitHub issue notes timeouts for ~390+ pages in older API versions when the client timeout wasn’t increased. Raise client timeouts appropriately. [github.com]

    Recommendations

    • Increase SDK client request timeout (e.g., azure.core.pipeline.policies or per call kwargs).
    • Implement overall operation timeout on your side (e.g., 20–40 minutes for 150–200 pages), and fallback to chunked processing if exceeded.
    • Log and handle HttpResponseError distinctly from 429/5xx.

    1. “Recommended polling interval for 100+ scanned pages?”
    • Start at 5–10 seconds between polls for large jobs. The service supports asynchronous analysis; too‑aggressive polling yields little benefit and adds load.
    • Use SDK defaults where possible—SDKs already implement efficient polling with exponential backoff patterns.

    1. “Should I implement a maximum polling duration? What timeout for 120 pages?”

    Yes. Implement an overall timeout guard to keep workers healthy. Typical ranges:

    • Text‑heavy scans (120 pages): try 15–25 minutes overall timeout; if exceeded, re‑submit by page ranges (e.g., 1–60, 61–120).
    • Record per‑page throughput metrics to refine this for your corpus (handwriting, noise level, DPI).

    1. “Known issues with dual‑page layouts (2 physical pages per PDF page)?”
    • Dual‑page scans aren’t an explicit limitation, but they can reduce effective text height and introduce layout artifacts (fold shadows, notebook binding). If small handwriting is under the minimum text height, OCR can drop content. Preprocess to increase DPI, crop to single pages, or split the PDF into single‑page images. [free.blessedness.top]
    • Also watch for blank intermediary pages causing truncation in some scenarios; if observed, skip or remove them. [free.blessedness.top]

    1. “Recommended maximum file size or page count per API call?”
    • S0 limits: 500 MB per file and up to 2,000 pages per PDF/TIFF. (Free tier only first 2 pages.) If near limits, prefer chunking (page ranges). [free.blessedness.top]

    1. “Best approach for 150+ page scanned PDFs?”

    1. “Could aggressive polling (every 2 seconds) cause deprioritization or termination?”
    • The service won’t “penalize” you for short polls, but over‑polling increases client/server load and offers no speed gains; it can contribute to 429s or network contention. Prefer 5–10 s polling and backoff on 429s (you already do this).

    1. "How to identify Server error from out of memory or size operation"

    If the smaller pdfs are also throwing 500 internal server error, there is outage or service health issue, you can verify against Azure status for verification.

    If the smaller quality pdf parses and bigger files fail, then it might be out of memory issue.

    Practical pipeline recommendations

    1. Pre‑processing
      • Deskew/denoise, binarize, increase DPI when handwriting is tiny.
      • Detect & drop blank pages proactively to avoid truncation. [free.blessedness.top]
      • Optionally split dual‑page scans into separate pages.
    2. Analysis model choice
      • Use prebuilt-read for highest OCR coverage of handwriting; if you need structure, also try prebuilt-layout and merge outputs. [free.blessedness.top]
    3. Batching & retries
      • Submit page ranges: e.g., pages="1-100", "101-200", …
      • Retries with exponential backoff for 429/5xx; respect rate limits across analyze vs. polling calls (you already separate these—good).
    4. Polling & timeouts
      • Poll every 5–10 s; set client request timeout to a generous value.
      • Add an overall operation timeout per batch and fall back to smaller ranges if exceeded. [github.com]
    5. Validation & diagnostics
      • Inspect result.pages, content, paragraphs/lines/words to find where content stops.
      • Log page counts processed vs. input; if a blank page triggers early stop, re‑run with pages skipping it. [free.blessedness.top]

    Example: Python (v4) with page‑range batching & robust polling

    from azure.ai.documentintelligence import DocumentIntelligenceClient
    from azure.core.credentials import AzureKeyCredential
    from azure.core.exceptions import HttpResponseError
    import time
    
    ENDPOINT = "<your_endpoint>"
    KEY = "<your_key>"
    
    client = DocumentIntelligenceClient(ENDPOINT, AzureKeyCredential(KEY))
    
    def analyze_pdf_in_batches(pdf_path, batch_ranges, model_id="prebuilt-read", poll_interval_sec=8, overall_timeout_sec=1800):
        results = []
        with open(pdf_path, "rb") as f:
            pdf_bytes = f.read()
    
        for pages in batch_ranges:
            start = time.time()
            try:
                poller = client.begin_analyze_document(
                    model_id=model_id,
                    document=pdf_bytes,
                    pages=pages,              # e.g., "1-100"
                    request_options={"timeout": 600}  # per-request timeout (seconds)
                )
    
                # Manual polling loop to enforce an overall timeout
                while not poller.done():
                    if time.time() - start > overall_timeout_sec:
                        raise TimeoutError(f"Batch {pages} exceeded {overall_timeout_sec}s")
                    time.sleep(poll_interval_sec)
    
                result = poller.result()
                results.append((pages, result))
                print(f"Completed pages {pages}: {len(result.pages)} pages processed")
    
            except (HttpResponseError, TimeoutError) as e:
                print(f"Batch {pages} failed: {e}")
                # Optional: fallback to smaller ranges, or re-queue
        return results
    
    # Example usage: split a 180-page notebook into 3 batches
    batches = ["1-60", "61-120", "121-180"]
    _ = analyze_pdf_in_batches("lab_notebook_scan.pdf", batches, model_id="prebuilt-read", poll_interval_sec=8, overall_timeout_sec=1800)
    
    

    Notes:

    • Use pages to control ranges; adjust timeout and poll_interval_sec.
    • If some batches return fewer pages than expected, investigate blank pages or dimension/text-height constraints; re-run with preprocessing or skip pages as needed. [free.blessedness.top], [free.blessedness.top]

    References


    Summary

    • Service limits: On paid S0, PDFs/TIFFs up to 2,000 pages and 500 MB are supported; free tier processes only first 2 pages. [free.blessedness.top]
    • Handwriting & scans: Use prebuilt-read (or layout) with appropriate input constraints; very large, noisy scans benefit from pre‑processing (deskew, denoise, split). [free.blessedness.top]
    • Polling: Use SDK pollers; poller.done() means the operation has completed (success or failure). Poll 5–10 s for 100+ pages, with overall timeout guards in your app.
    • Chunking strategy: For 150+ page scanned PDFs, prefer page-range batching (e.g., 1–100, 101–200, …) and parallelize prudently.
    • Dual-page scans: Not a problem per se, but layout irregularities, very small text, or blank pages can cause partial extraction—preprocess or skip blank pages if observed. [free.blessedness.top]
    • Diagnose missing content: Compare pages, content, lines/words in JSON across page ranges; raise SDK/client timeouts; check size/dimensions constraints. [free.blessedness.top]

    Please accept this answer or let me know if you have followed up query on it.

    Thank you


    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.