How to differentiate text and diagrams/images from a scanned PDF and crop the images?

Question

How to differentiate text and diagrams/images from a scanned PDF and crop the images?

Riddhi Patel 30

I'm working with scanned PDF files and want to process them using computer vision techniques to:

Differentiate between text and diagrams/images.

Accurately detect and crop the images/diagrams from the PDF.

Optionally, keep track of the position of each image so that I can later replace it in the text with a reference or URL.

I want to do this using vision-based approaches, not just OCR like Tesseract (which only gives me the text). Are there any proven methods, models, or open-source tools (in Python or any language) that can help identify and extract visual (non-text) elements from a scanned PDF?

Any insights or code samples would be really helpful!

4 answers

Your answer

Answer 1

Hello Riddhi Patel,

This is a common need when working with scanned PDFs that contain both text and visual content. You are absolutely right that traditional OCR tools like Tesseract are limited to text extraction and don't handle visual segmentation of diagrams or images.
Thanks, our community member Jerald Felix for prompt response.

Azure provides Document Intelligence Layout Model; it can be a workaround for you:

Step 1: Use Document Intelligence Layout Analysis

from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.core.credentials import AzureKeyCredential
client = DocumentIntelligenceClient(
    endpoint="your-document-intelligence-endpoint",
    credential=AzureKeyCredential("your-key")
)
# Analyze with layout model to get text regions and figures
poller = client.begin_analyze_document(
    "prebuilt-layout", 
    document=your_pdf_bytes,
    output_content_format="text",
    output=["figures"]  # This extracts image regions
)
result = poller.result()

Key Capabilities:

Detects text regions with precise bounding boxes
Identifies figure/image areas automatically
Provides coordinates for each detected element
Extracts images as separate files (available via UI and API)
Supports scanned PDFs with high accuracy

Step 2: Process Detected Regions

# Extract text regions
for page in result.pages:
    for line in page.lines:
        print(f"Text: {line.content}")
        print(f"Bounding box: {line.polygon}")

# Extract figure/image regions  
for figure in result.figures:
    print(f"Figure caption: {figure.caption}")
    print(f"Bounding region: {figure.bounding_regions}")  
# Figure content can be downloaded separately

For reference:

Please accept the answer and upvote. 😊
For remediation of other Q&A community members with similar Issue.

Nikhil Jha (Accenture International Limited) 2,220 Reputation points Microsoft External Staff Moderator

2025-09-08T03:29:08.8633333+00:00

Hello Riddhi Patel,
I hope this has been helpful! We appreciate hearing from you and would love to help others who may have the same question. Accepting answers helps increase visibility of this question for other members of the Microsoft Q&A community.
Thank you for helping to improve Microsoft Q&A!
Nikhil Jha (Accenture International Limited) 2,220 Reputation points Microsoft External Staff Moderator

2025-09-09T01:30:03.2633333+00:00

Following up to see if you have a chance to check my previous response and helped, do let me know if you have any further questions on this.

Answer 2

Hi there,

Great question! Differentiating text vs diagrams/images in scanned PDFs using Azure AI Document Intelligence (formerly Form Recognizer) depends on the approach you use.

🧠 Option 1: Layout Model (Prebuilt)

The Layout model can analyze scanned PDFs and images to extract:

Text (lines, words, tables)

Bounding box coordinates for each line or word

Information about selection marks and reading order

However, it does not directly tag images or diagrams. But you can infer non-text regions (i.e., diagrams/images) by detecting areas without extracted text, especially large blank bounding boxes or content with no recognized OCR.
👉 Use this API:

https://<endpoint>/formrecognizer/documentModels/prebuilt-layout:analyze
🧠 Option 2: Custom Neural Model with Image Classifier (Hybrid Approach)

If you want to go further:

Combine Azure Document Intelligence for text extraction AND

Use Azure Computer Vision or Custom Vision to classify image areas (detect diagrams, logos, illustrations, etc.)

For example:

Use Document Intelligence to get page layout and text bounding boxes.

Use that info to crop the non-text zones.

Send those cropped areas to Azure Vision APIs to classify them as diagrams or images.

🧠 Option 3: Use Page Content Tags (if using PDF SDK or AI Indexing)

Some advanced pipelines (like Azure AI Search + Cognitive Skills) allow "image content detection" by chaining:

OCR Skill

Layout Skill

Image Analysis Skill

This may help tag and extract diagram zones, especially in scanned engineering or academic documents.

🧪 Tip:

When working with scanned PDFs, always ensure that the PDF is readable and OCR-enabled (or set "readingOrder": "natural" in the layout API).

Let me know if you'd like a working Python or REST sample showing how to extract and infer these regions. And if this helps, please click “Accept Answer” so others can benefit too 😊

Best Regards,

Jerald Felix

Answer 3

Deleted

This answer has been deleted due to a violation of our Code of Conduct. The answer was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.

Comments have been turned off. Learn more

Answer 4

Deleted

This answer has been deleted due to a violation of our Code of Conduct. The answer was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.

Comments have been turned off. Learn more

Share via

How to differentiate text and diagrams/images from a scanned PDF and crop the images?

4 answers

Your answer