Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Several indexer-supported data sources, including Azure Blob Storage, Azure Data Lake Storage Gen2, and SharePoint, contain standalone files or embedded objects of various content types. Many of those content types have metadata properties that can be useful to index. Just as you can create search fields for standard blob properties like metadata_storage_name, you can create fields in a search index for metadata properties that are specific to a document format.
Supported document formats
Azure AI Search supports blob indexing and SharePoint document indexing for the following document formats:
- CSV (see Indexing CSV blobs)
- EML
- EPUB
- GZ
- HTML
- JSON (see Indexing JSON blobs)
- KML (XML for geographic representations)
- Microsoft Office formats: DOCX/DOC/DOCM, XLSX/XLS/XLSM, PPTX/PPT/PPTM, MSG (Outlook emails), XML (both 2003 and 2006 WORD XML)
- Open Document formats: ODT, ODS, ODP
- Plain text files (see also Indexing plain text)
- RTF
- XML
- ZIP
Document format properties
The following table summarizes processing for each document format, and describes the metadata properties extracted by a blob indexer and the SharePoint Online indexer.
| Document format / content type | Extracted metadata | Processing details |
|---|---|---|
| CSV (text/csv) | metadata_content_typemetadata_content_encoding |
Extract text NOTE: If you need to extract multiple document fields from a CSV blob, see Index CSV blobs |
| DOC (application/msword) | metadata_content_typemetadata_authormetadata_character_countmetadata_creation_datemetadata_last_modifiedmetadata_page_countmetadata_word_count |
Extract text, including embedded documents |
| DOCM (application/vnd.ms-word.document.macroenabled.12) | metadata_content_typemetadata_authormetadata_character_countmetadata_creation_datemetadata_last_modifiedmetadata_page_countmetadata_word_count |
Extract text, including embedded documents |
| DOCX (application/vnd.openxmlformats-officedocument.wordprocessingml.document) | metadata_content_typemetadata_authormetadata_character_countmetadata_creation_datemetadata_last_modifiedmetadata_page_countmetadata_word_count |
Extract text, including embedded documents |
| EML (message/rfc822) | metadata_content_typemetadata_message_frommetadata_message_tometadata_message_ccmetadata_creation_datemetadata_subject |
Extract text, including attachments |
| EPUB (application/epub+zip) | metadata_content_typemetadata_authormetadata_creation_datemetadata_titlemetadata_descriptionmetadata_languagemetadata_keywordsmetadata_identifiermetadata_publisher |
Extract text from all documents in the archive |
| GZ (application/gzip) | metadata_content_type |
Extract text from all documents in the archive |
| HTML (text/html or application/xhtml+xml) | metadata_content_encodingmetadata_content_typemetadata_languagemetadata_descriptionmetadata_keywordsmetadata_title |
Strip HTML elements and extract text |
| JSON (application/json) | metadata_content_typemetadata_content_encoding |
Extract text NOTE: If you need to extract multiple document fields from a JSON blob, see Index JSON blobs |
| KML (application/vnd.google-earth.kml+xml) | metadata_content_typemetadata_content_encodingmetadata_language |
Strip XML elements and extract text |
| MSG (application/vnd.ms-outlook) | metadata_content_typemetadata_message_frommetadata_message_from_emailmetadata_message_tometadata_message_to_emailmetadata_message_ccmetadata_message_cc_emailmetadata_message_bccmetadata_message_bcc_emailmetadata_creation_datemetadata_last_modifiedmetadata_subject |
Extract text, including text extracted from attachments. metadata_message_to_email, metadata_message_cc_email, and metadata_message_bcc_email are string collections. The rest of the fields are strings. |
| ODP (application/vnd.oasis.opendocument.presentation) | metadata_content_typemetadata_authormetadata_creation_datemetadata_last_modifiedmetadata_title |
Extract text, including embedded documents |
| ODS (application/vnd.oasis.opendocument.spreadsheet) | metadata_content_typemetadata_authormetadata_creation_datemetadata_last_modified |
Extract text, including embedded documents |
| ODT (application/vnd.oasis.opendocument.text) | metadata_content_typemetadata_authormetadata_character_countmetadata_creation_datemetadata_last_modifiedmetadata_page_countmetadata_word_count |
Extract text, including embedded documents |
| PDF (application/pdf) | metadata_content_typemetadata_languagemetadata_authormetadata_titlemetadata_creation_date |
Extract text, including embedded documents (excluding images) |
| Plain text (text/plain) | metadata_content_typemetadata_content_encodingmetadata_language |
Extract text |
| PPT (application/vnd.ms-powerpoint) | metadata_content_typemetadata_authormetadata_creation_datemetadata_last_modifiedmetadata_slide_countmetadata_title |
Extract text, including embedded documents |
| PPTM (application/vnd.ms-powerpoint.presentation.macroenabled.12) | metadata_content_typemetadata_authormetadata_creation_datemetadata_last_modifiedmetadata_slide_countmetadata_title |
Extract text, including embedded documents |
| PPTX (application/vnd.openxmlformats-officedocument.presentationml.presentation) | metadata_content_typemetadata_authormetadata_creation_datemetadata_last_modifiedmetadata_slide_countmetadata_title |
Extract text, including embedded documents |
| RTF (application/rtf) | metadata_content_typemetadata_authormetadata_character_countmetadata_creation_datemetadata_last_modifiedmetadata_page_countmetadata_word_count |
Extract text |
| WORD 2003 XML (application/vnd.ms-wordml) | metadata_content_typemetadata_authormetadata_creation_date |
Strip XML elements and extract text |
| WORD XML (application/vnd.ms-word2006ml) | metadata_content_typemetadata_authormetadata_character_countmetadata_creation_datemetadata_last_modifiedmetadata_page_countmetadata_word_count |
Strip XML elements and extract text |
| XLS (application/vnd.ms-excel) | metadata_content_typemetadata_authormetadata_creation_datemetadata_last_modified |
Extract text, including embedded documents |
| XLSM (application/vnd.ms-excel.sheet.macroenabled.12) | metadata_content_typemetadata_authormetadata_creation_datemetadata_last_modified |
Extract text, including embedded documents |
| XLSX (application/vnd.openxmlformats-officedocument.spreadsheetml.sheet) | metadata_content_typemetadata_authormetadata_creation_datemetadata_last_modified |
Extract text, including embedded documents |
| XML (application/xml) | metadata_content_typemetadata_content_encodingmetadata_language |
Strip XML elements and extract text |
| ZIP (application/zip) | metadata_content_type |
Extract text from all documents in the archive |