Hello Nehoray Marziano •,
You’re working on creating a perfect searchable PDF from Azure Document Intelligence JSON output, but running into problems where words with irregular letter spacing (e.g., “E X A M P L E”) are not being highlighted correctly on search in the resulting PDF. The bounding box from Azure is accurate, but simply adjusting font size doesn’t solve the mismatch since the PDF viewer’s highlight doesn’t fit the stretched-out word.
What Azure does differently: Azure’s “output searchable PDF” feature produces flawless highlights even on stretched or spaced text, indicating it doesn’t just fit text by font size—rather, it applies a more advanced PDF technique, likely using the text transformation matrix (Tm). This matrix can scale, stretch, and map text exactly to a 4-point polygon so the highlight always matches the visual text.
How you can replicate this:
Affine Transformations: In PDF, you can use a transformation matrix to map text onto an exact quadrilateral, stretching letters as needed. This is beyond basic font placement and requires low-level PDF drawing.
PyMuPDF/Fitz Approach: As of now, PyMuPDF doesn’t directly expose APIs for per-word affine text placement. It can use rectangles but not arbitrary quadrilaterals/matrices per text object. You could fake the effect by placing each letter individually at coordinates spaced according to the word’s bounding polygon, but this is complex and fragile.
PDF Libraries: Libraries like reportlab or pikepdf allow more fine-grained control by using PDF commands directly (e.g., Tm). You can construct PDF objects that use a transformation matrix so your invisible text layer matches the shape of the word exactly.
Reverse Engineering: You can inspect Azure’s generated PDFs (with a PDF inspector) and see the matrix used for text objects. Try to replicate this by manipulating the low-level PDF structure in your workflow.
High-Level Steps:
Convert each word’s polygon to an affine transformation matrix (map from horizontal text-space to polygon).
Use a PDF library capable of writing custom matrices for text (e.g., pikepdf or directly with PDF syntax using Tm).
Insert your invisible text with the matrix for each word.
References & Guidance:
Review answers and examples in advanced PDF forums and look for “custom text matrices” or “PDF Tm command for stretching text.”
See related Microsoft documentation and community suggestions for OCR search accuracy and invisible text mapping.learn.microsoft
This approach is non-trivial and may require custom code for polygon-to-matrix conversion and PDF object manipulation, but it’s the path Azure uses internally for perfect searchable highlights.
Best Regards,
Jerald Felix