Thank you for sharing the context on your use case. A lot of queries for one thread though.
Here are answer to your queries.
- “Is the way the extraction is being done the issue?”
Possibly. For large, handwritten, scanned lab notebooks:
- Ensure the documents meet input requirements: file size ≤ 500 MB (S0), dimensions 50×50 to 10,000×10,000 px, text height ≥ ~12 px at 1024×768 (≈ 8‑pt at 150 dpi). Handwritten is supported by Read/Layout. [free.blessedness.top]
- Pre‑processing helps:
- Deskew/denoise scans, increase DPI (≥ 300 dpi recommended for small handwriting), remove borders, normalize contrast.
- Split very large PDFs into logical batches (e.g., 100–150 pages each) to avoid long single operations; feed page ranges to the API.
- Remove blank pages when they cause truncation (see Q10 reference). [free.blessedness.top]
- For unstructured notebook layouts, try prebuilt-read for pure OCR or layout if you need block/paragraph structure. The Read model runs at higher resolution than Vision Read and is optimized for dense text. [free.blessedness.top]
- “Maximum processing time limit before automatic termination? Does
poller.doneguarantee processing is complete?”
- Azure’s SDK long‑running operation (LRO) poller completes when the server marks the operation finished (success or failure). So
poller.done()means the operation has ended and results are ready (or an error is returned). - There isn’t a published hard “max time” per operation in docs. Observed timeouts with very large files are usually client‑side timeouts (SDK default timeouts, network) or service-side throttling/limits; a GitHub issue notes timeouts for ~390+ pages in older API versions when the client timeout wasn’t increased. Raise client timeouts appropriately. [github.com]
Recommendations
- Increase SDK client request timeout (e.g.,
azure.core.pipeline.policiesor per call kwargs). - Implement overall operation timeout on your side (e.g., 20–40 minutes for 150–200 pages), and fallback to chunked processing if exceeded.
- Log and handle HttpResponseError distinctly from 429/5xx.
- “Recommended polling interval for 100+ scanned pages?”
- Start at 5–10 seconds between polls for large jobs. The service supports asynchronous analysis; too‑aggressive polling yields little benefit and adds load.
- Use SDK defaults where possible—SDKs already implement efficient polling with exponential backoff patterns.
- “Should I implement a maximum polling duration? What timeout for 120 pages?”
Yes. Implement an overall timeout guard to keep workers healthy. Typical ranges:
- Text‑heavy scans (120 pages): try 15–25 minutes overall timeout; if exceeded, re‑submit by page ranges (e.g., 1–60, 61–120).
- Record per‑page throughput metrics to refine this for your corpus (handwriting, noise level, DPI).
- “Known issues with dual‑page layouts (2 physical pages per PDF page)?”
- Dual‑page scans aren’t an explicit limitation, but they can reduce effective text height and introduce layout artifacts (fold shadows, notebook binding). If small handwriting is under the minimum text height, OCR can drop content. Preprocess to increase DPI, crop to single pages, or split the PDF into single‑page images. [free.blessedness.top]
- Also watch for blank intermediary pages causing truncation in some scenarios; if observed, skip or remove them. [free.blessedness.top]
- “Recommended maximum file size or page count per API call?”
- S0 limits: 500 MB per file and up to 2,000 pages per PDF/TIFF. (Free tier only first 2 pages.) If near limits, prefer chunking (page ranges). [free.blessedness.top]
- “Best approach for 150+ page scanned PDFs?”
- Batch by page ranges (e.g., 75–150 pages batches), submit parallel jobs within your S0 throughput envelope. This improves resiliency and simplifies retries.
- Single call with extended timeout can work but is riskier (one failure loses the whole job; harder to diagnose partial misses).
- Use batch processing API (if available)? https://free.blessedness.top/en-us/azure/ai-services/document-intelligence/prebuilt/batch-analysis?view=doc-intel-4.0.0#batch-analysis-limits - Max 10000 files - Please refer service limit documentation for file and other limit
- “Could aggressive polling (every 2 seconds) cause deprioritization or termination?”
- The service won’t “penalize” you for short polls, but over‑polling increases client/server load and offers no speed gains; it can contribute to 429s or network contention. Prefer 5–10 s polling and backoff on 429s (you already do this).
- "How to identify Server error from out of memory or size operation"
If the smaller pdfs are also throwing 500 internal server error, there is outage or service health issue, you can verify against Azure status for verification.
If the smaller quality pdf parses and bigger files fail, then it might be out of memory issue.
Practical pipeline recommendations
- Pre‑processing
- Deskew/denoise, binarize, increase DPI when handwriting is tiny.
- Detect & drop blank pages proactively to avoid truncation. [free.blessedness.top]
- Optionally split dual‑page scans into separate pages.
- Analysis model choice
- Use prebuilt-read for highest OCR coverage of handwriting; if you need structure, also try prebuilt-layout and merge outputs. [free.blessedness.top]
- Batching & retries
- Submit page ranges: e.g.,
pages="1-100","101-200", … - Retries with exponential backoff for 429/5xx; respect rate limits across analyze vs. polling calls (you already separate these—good).
- Submit page ranges: e.g.,
- Polling & timeouts
- Poll every 5–10 s; set client request timeout to a generous value.
- Add an overall operation timeout per batch and fall back to smaller ranges if exceeded. [github.com]
- Validation & diagnostics
- Inspect
result.pages,content,paragraphs/lines/wordsto find where content stops. - Log page counts processed vs. input; if a blank page triggers early stop, re‑run with
pagesskipping it. [free.blessedness.top]
- Inspect
Example: Python (v4) with page‑range batching & robust polling
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.core.credentials import AzureKeyCredential
from azure.core.exceptions import HttpResponseError
import time
ENDPOINT = "<your_endpoint>"
KEY = "<your_key>"
client = DocumentIntelligenceClient(ENDPOINT, AzureKeyCredential(KEY))
def analyze_pdf_in_batches(pdf_path, batch_ranges, model_id="prebuilt-read", poll_interval_sec=8, overall_timeout_sec=1800):
results = []
with open(pdf_path, "rb") as f:
pdf_bytes = f.read()
for pages in batch_ranges:
start = time.time()
try:
poller = client.begin_analyze_document(
model_id=model_id,
document=pdf_bytes,
pages=pages, # e.g., "1-100"
request_options={"timeout": 600} # per-request timeout (seconds)
)
# Manual polling loop to enforce an overall timeout
while not poller.done():
if time.time() - start > overall_timeout_sec:
raise TimeoutError(f"Batch {pages} exceeded {overall_timeout_sec}s")
time.sleep(poll_interval_sec)
result = poller.result()
results.append((pages, result))
print(f"Completed pages {pages}: {len(result.pages)} pages processed")
except (HttpResponseError, TimeoutError) as e:
print(f"Batch {pages} failed: {e}")
# Optional: fallback to smaller ranges, or re-queue
return results
# Example usage: split a 180-page notebook into 3 batches
batches = ["1-60", "61-120", "121-180"]
_ = analyze_pdf_in_batches("lab_notebook_scan.pdf", batches, model_id="prebuilt-read", poll_interval_sec=8, overall_timeout_sec=1800)
Notes:
- Use
pagesto control ranges; adjusttimeoutandpoll_interval_sec.- If some batches return fewer pages than expected, investigate blank pages or dimension/text-height constraints; re-run with preprocessing or skip pages as needed. [free.blessedness.top], [free.blessedness.top]
References
- Read model OCR / input requirements / limits (v4): Read model OCR data extraction
- FAQ & capabilities: Document Intelligence FAQ (v4)
- Blank page truncation scenario: DI not processing all pages with blank page
- Free tier only 2 pages: Form Recognizer just read page 1–2
- Large-file timeouts (client-side): GitHub issue – begin_analyze_document timeout on large files
Summary
- Service limits: On paid S0, PDFs/TIFFs up to 2,000 pages and 500 MB are supported; free tier processes only first 2 pages. [free.blessedness.top]
- Handwriting & scans: Use prebuilt-read (or layout) with appropriate input constraints; very large, noisy scans benefit from pre‑processing (deskew, denoise, split). [free.blessedness.top]
- Polling: Use SDK pollers;
poller.done()means the operation has completed (success or failure). Poll 5–10 s for 100+ pages, with overall timeout guards in your app. - Chunking strategy: For 150+ page scanned PDFs, prefer page-range batching (e.g., 1–100, 101–200, …) and parallelize prudently.
- Dual-page scans: Not a problem per se, but layout irregularities, very small text, or blank pages can cause partial extraction—preprocess or skip blank pages if observed. [free.blessedness.top]
- Diagnose missing content: Compare
pages,content,lines/wordsin JSON across page ranges; raise SDK/client timeouts; check size/dimensions constraints. [free.blessedness.top]
Please accept this answer or let me know if you have followed up query on it.
Thank you