How to differentiate text and diagrams/images from a scanned PDF and crop the images?

Riddhi Patel 30 Reputation points
2025-07-06T06:38:51.5133333+00:00

I'm working with scanned PDF files and want to process them using computer vision techniques to:

Differentiate between text and diagrams/images.

Accurately detect and crop the images/diagrams from the PDF.

Optionally, keep track of the position of each image so that I can later replace it in the text with a reference or URL.

I want to do this using vision-based approaches, not just OCR like Tesseract (which only gives me the text). Are there any proven methods, models, or open-source tools (in Python or any language) that can help identify and extract visual (non-text) elements from a scanned PDF?

Any insights or code samples would be really helpful!

Computer Vision
Computer Vision
An Azure artificial intelligence service that analyzes content in images and video.
0 comments No comments
{count} votes

4 answers

Sort by: Most helpful
  1. Nikhil Jha (Accenture International Limited) 2,220 Reputation points Microsoft External Staff Moderator
    2025-09-05T10:00:25.34+00:00

    Hello Riddhi Patel,

    This is a common need when working with scanned PDFs that contain both text and visual content. You are absolutely right that traditional OCR tools like Tesseract are limited to text extraction and don't handle visual segmentation of diagrams or images.
    Thanks, our community member Jerald Felix for prompt response.

    Azure provides Document Intelligence Layout Model; it can be a workaround for you:

    Step 1: Use Document Intelligence Layout Analysis

    from azure.ai.documentintelligence import DocumentIntelligenceClient
    from azure.core.credentials import AzureKeyCredential
    client = DocumentIntelligenceClient(
        endpoint="your-document-intelligence-endpoint",
        credential=AzureKeyCredential("your-key")
    )
    # Analyze with layout model to get text regions and figures
    poller = client.begin_analyze_document(
        "prebuilt-layout", 
        document=your_pdf_bytes,
        output_content_format="text",
        output=["figures"]  # This extracts image regions
    )
    result = poller.result()
    
    

    Key Capabilities:

    • Detects text regions with precise bounding boxes
    • Identifies figure/image areas automatically
    • Provides coordinates for each detected element
    • Extracts images as separate files (available via UI and API)
    • Supports scanned PDFs with high accuracy

    Step 2: Process Detected Regions

    # Extract text regions
    for page in result.pages:
        for line in page.lines:
            print(f"Text: {line.content}")
            print(f"Bounding box: {line.polygon}")
    
    # Extract figure/image regions  
    for figure in result.figures:
        print(f"Figure caption: {figure.caption}")
        print(f"Bounding region: {figure.bounding_regions}")  
    # Figure content can be downloaded separately
    

    For reference:

    1. Azure AI Document Intelligence
    2. Q&A thread

    Please accept the answer and upvote. 😊
    For remediation of other Q&A community members with similar Issue.

    1 person found this answer helpful.

  2. Jerald Felix 7,910 Reputation points
    2025-07-06T14:09:57.7+00:00

    Hi there,

    Great question! Differentiating text vs diagrams/images in scanned PDFs using Azure AI Document Intelligence (formerly Form Recognizer) depends on the approach you use.

    🧠 Option 1: Layout Model (Prebuilt)

    The Layout model can analyze scanned PDFs and images to extract:

    Text (lines, words, tables)

    Bounding box coordinates for each line or word

    Information about selection marks and reading order

    However, it does not directly tag images or diagrams. But you can infer non-text regions (i.e., diagrams/images) by detecting areas without extracted text, especially large blank bounding boxes or content with no recognized OCR.
    👉 Use this API:

    https://<endpoint>/formrecognizer/documentModels/prebuilt-layout:analyze
    🧠 Option 2: Custom Neural Model with Image Classifier (Hybrid Approach)

    If you want to go further:

    Combine Azure Document Intelligence for text extraction AND

    Use Azure Computer Vision or Custom Vision to classify image areas (detect diagrams, logos, illustrations, etc.)

    For example:

    Use Document Intelligence to get page layout and text bounding boxes.

    Use that info to crop the non-text zones.

    Send those cropped areas to Azure Vision APIs to classify them as diagrams or images.

    🧠 Option 3: Use Page Content Tags (if using PDF SDK or AI Indexing)

    Some advanced pipelines (like Azure AI Search + Cognitive Skills) allow "image content detection" by chaining:

    OCR Skill

    Layout Skill

    Image Analysis Skill

    This may help tag and extract diagram zones, especially in scanned engineering or academic documents.

    🧪 Tip:

    When working with scanned PDFs, always ensure that the PDF is readable and OCR-enabled (or set "readingOrder": "natural" in the layout API).

    Let me know if you'd like a working Python or REST sample showing how to extract and infer these regions. And if this helps, please click “Accept Answer” so others can benefit too 😊

    Best Regards,

    Jerald Felix

    0 comments No comments

  3. Deleted

    This answer has been deleted due to a violation of our Code of Conduct. The answer was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.


    Comments have been turned off. Learn more

  4. Deleted

    This answer has been deleted due to a violation of our Code of Conduct. The answer was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.


    Comments have been turned off. Learn more

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.