Hello Riddhi Patel,
This is a common need when working with scanned PDFs that contain both text and visual content. You are absolutely right that traditional OCR tools like Tesseract are limited to text extraction and don't handle visual segmentation of diagrams or images.
Thanks, our community member Jerald Felix for prompt response.
Azure provides Document Intelligence Layout Model; it can be a workaround for you:
Step 1: Use Document Intelligence Layout Analysis
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.core.credentials import AzureKeyCredential
client = DocumentIntelligenceClient(
endpoint="your-document-intelligence-endpoint",
credential=AzureKeyCredential("your-key")
)
# Analyze with layout model to get text regions and figures
poller = client.begin_analyze_document(
"prebuilt-layout",
document=your_pdf_bytes,
output_content_format="text",
output=["figures"] # This extracts image regions
)
result = poller.result()
Key Capabilities:
- Detects text regions with precise bounding boxes
- Identifies figure/image areas automatically
- Provides coordinates for each detected element
- Extracts images as separate files (available via UI and API)
- Supports scanned PDFs with high accuracy
Step 2: Process Detected Regions
# Extract text regions
for page in result.pages:
for line in page.lines:
print(f"Text: {line.content}")
print(f"Bounding box: {line.polygon}")
# Extract figure/image regions
for figure in result.figures:
print(f"Figure caption: {figure.caption}")
print(f"Bounding region: {figure.bounding_regions}")
# Figure content can be downloaded separately
For reference:
Please accept the answer and upvote. 😊
For remediation of other Q&A community members with similar Issue.