How to create a perfect searchable PDF from Azure Document Intelligence JSON when letters have irregular spacing?

Nehoray Marziano 5 Reputation points
2025-09-25T12:48:38.89+00:00

I'm working on a project where I need to create a searchable PDF from a scanned document. My workflow is:

  1. Take a scanned PDF (image only).

Send it to Azure Document Intelligence (prebuilt-read model).

Crucially, I must use the JSON output that gives me word-level content and their bounding polygons. I cannot use Azure's direct "output searchable PDF" option.

Use this JSON to create a new searchable PDF by adding an invisible text layer on top of the original scanned image.

This works fine for "normal" text. However, I'm running into a big problem with documents that have irregular spacing between letters in a word.

For example, a word like "EXAMPLE" might appear in the scan as "E  X  A  M  P  L  E".

Azure's JSON output is incredibly accurate. It gives me a single word element for "EXAMPLE" with a tight 4-point polygon [[x0,y0], [x1,y1], [x2,y2], [x3,y3]] that perfectly encloses the entire stretched-out word.

My goal is to place the text "EXAMPLE" invisibly so that when a user searches for it in a PDF viewer, the highlight rectangle perfectly matches the visual word on the page.

The Problem I'm Facing

My approach has been to take the word's bounding box and try to fit the text into it. I'm using Python with libraries like PyMuPDF (fitz). My logic is something like this:

Get the word's bounding rectangle from the polygon.

Calculate the required fontsize to make the word (e.g., "EXAMPLE") fit the rectangle's width.

Insert the text invisibly (render_mode=3) at that font size.

This fails with letter-spaced words. Because the font's natural letter spacing doesn't match the weird spacing in the image, the text either overflows the box or is too small. When I search the final PDF, the highlight is offset and looks sloppy—it might only cover "E X A M" or be shifted to the side.

User's image

snippet of a script that draws the coordinates of each word, directly from the response json

User's image

one of my attempts, with as visible text layer

User's image

incorrect highlights when searching for 'ro' because of the offsets

The Big Question: How does Azure do it so well?

Here's the kicker. If I do request the searchable PDF directly from Azure (which I'm not allowed to use for my final output), it's flawless. The search highlights are perfect, even on these stretched-out words. This proves it's possible using the same underlying data.

I suspect they aren't just fitting text with a font size. They must be using a more advanced PDF technique, maybe applying a transformation matrix (Tm) to each word to stretch the text object itself to fit the exact polygon.

Has anyone here successfully tackled this?

How can I use the 4-point polygon from Azure's JSON to perfectly map my text string onto it?

Is there a way in Python (or another language) to define an affine transformation for each text object that says "map this string to this exact quadrilateral"?

Am I thinking about this the right way with transformation matrices, or is there another PDF-native trick I'm missing?

Any code snippets (especially with PyMuPDF/fitz, pikepdf, or reportlab) or high-level guidance would be a massive help. This problem is driving me crazy because I can see the "perfect" output from Azure, but I have to replicate it myself from the JSON

Azure AI Document Intelligence
0 comments No comments
{count} vote

1 answer

Sort by: Most helpful
  1. Jerald Felix 7,520 Reputation points
    2025-10-25T11:31:27.74+00:00

    Hello Nehoray Marziano •,

    You’re working on creating a perfect searchable PDF from Azure Document Intelligence JSON output, but running into problems where words with irregular letter spacing (e.g., “E X A M P L E”) are not being highlighted correctly on search in the resulting PDF. The bounding box from Azure is accurate, but simply adjusting font size doesn’t solve the mismatch since the PDF viewer’s highlight doesn’t fit the stretched-out word.

    What Azure does differently: Azure’s “output searchable PDF” feature produces flawless highlights even on stretched or spaced text, indicating it doesn’t just fit text by font size—rather, it applies a more advanced PDF technique, likely using the text transformation matrix (Tm). This matrix can scale, stretch, and map text exactly to a 4-point polygon so the highlight always matches the visual text.

    How you can replicate this:

    Affine Transformations: In PDF, you can use a transformation matrix to map text onto an exact quadrilateral, stretching letters as needed. This is beyond basic font placement and requires low-level PDF drawing.

    PyMuPDF/Fitz Approach: As of now, PyMuPDF doesn’t directly expose APIs for per-word affine text placement. It can use rectangles but not arbitrary quadrilaterals/matrices per text object. You could fake the effect by placing each letter individually at coordinates spaced according to the word’s bounding polygon, but this is complex and fragile.

    PDF Libraries: Libraries like reportlab or pikepdf allow more fine-grained control by using PDF commands directly (e.g., Tm). You can construct PDF objects that use a transformation matrix so your invisible text layer matches the shape of the word exactly.

    Reverse Engineering: You can inspect Azure’s generated PDFs (with a PDF inspector) and see the matrix used for text objects. Try to replicate this by manipulating the low-level PDF structure in your workflow.

    High-Level Steps:

    Convert each word’s polygon to an affine transformation matrix (map from horizontal text-space to polygon).

    Use a PDF library capable of writing custom matrices for text (e.g., pikepdf or directly with PDF syntax using Tm).

    Insert your invisible text with the matrix for each word.

    References & Guidance:

    Review answers and examples in advanced PDF forums and look for “custom text matrices” or “PDF Tm command for stretching text.”

    See related Microsoft documentation and community suggestions for OCR search accuracy and invisible text mapping.learn.microsoft

    This approach is non-trivial and may require custom code for polygon-to-matrix conversion and PDF object manipulation, but it’s the path Azure uses internally for perfect searchable highlights.

    Best Regards,

    Jerald Felix

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.