Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
You can configure settings for each eDiscovery case to control the following functionality:
- Near duplicates and email threading
- Themes
- Autogenerated review set query
- Ignore text
- Optical character recognition
Tip
Get started with Microsoft Security Copilot to explore new ways to work smarter and faster using the power of AI. Learn more about Microsoft Security Copilot in Microsoft Purview.
Configure analytics settings for a case
To configure search and analytics settings for a case:
- Go to the Microsoft Purview portal and sign in with the credentials for a user account assigned eDiscovery permissions.
- Select the eDiscovery solution card, then select Cases in the left nav.
- Select a case, then select Case settings.
- On Case settings, select Search & analytics.
- The case Search & analytics page appears. These settings apply to all review sets in a case.
- After selecting the applicable search and analytics options, select Save.
The following sections describe the analytics settings that you can configure for a case.
Near duplicates and email threading
In this section, set parameters for duplicate detection, near duplicate detection, and email threading.
- Near duplicates/email threading: When you turn on this setting, the workflow includes duplicate detection, near duplicate detection, and email threading when you run analytics on the data in a review set.
- Document and email similarity threshold: If the similarity level for two documents is over the threshold, both documents are in the same near duplicate set.
- Minimum/maximum number of words: These settings specify that near duplicates and email threading analysis are performed only on documents that have at least the minimum number of words and at most the maximum number of words.
Near duplicate detection
Consider a set of documents to review where a subset of documents uses the same template and mostly the same boilerplate language, with a few differences. If a reviewer can identify this subset, review one of them thoroughly, and review the differences for the rest, they don't miss any unique information while taking only a fraction of the time it would take to read all documents cover to cover. Near duplicate detection groups textually similar documents together to help you make your review process more efficient.
When you run near duplicate detection, the system parses every document with text. Then, it compares every document against each other to determine whether their similarity is greater than the set threshold. If it is, the system groups the documents together. Once all documents are compared and grouped, the system marks a document from each group as the "pivot"; in reviewing your documents, you can review a pivot first and review the other documents in the same near duplicate set, focusing on the difference between the pivot and the document that is in review.
Email threading
Consider an email conversation that goes on for a while. In most cases, the last message in the email thread includes the contents of all the preceding messages. Therefore, reviewing the last message gives a complete context of the conversation that happened in the thread. Email threading identifies such messages so that reviewers can review a fraction of collected documents without losing any context.
Email threading in eDiscovery is the process of organizing a sequence of related emails that are part of the same conversation. This sequence includes the initial email and all subsequent replies and forwards linked to the original email. By grouping these emails into threads, reviewers see the entire context of a conversation, making it easier to understand the flow of communication. This approach helps reviewers identify relevant information more efficiently and eliminates the need to review each email individually. Email messages included in the analytics process have the following metadata populated:
- Is Inclusive: This field identifies whether an email contains all the unique content from a thread, including all previous replies. It ensures that only the most comprehensive email in a thread is reviewed, which is essential for understanding the full context of the conversation without having to review each individual reply.
- Has Unique Attachments: This field marks emails that contain attachments not found in other emails within the same thread. Even if the email content is duplicated, unique attachments are flagged to ensure that all relevant documents are reviewed. This aspect is important in the legal review process to ensure that no unique evidence is overlooked, even if the email body itself is not unique.
How is it different from conversations in Outlook?
At a glance, this process sounds similar to conversation groupings in Outlook. However, there are some important distinctions. Consider an email conversation that forks into two conversations. For instance, someone responds to an email that isn't the latest in the conversation so the last two emails in the conversation both have unique content.
Outlook still groups the emails into a single conversation. Reading only the last email might miss the context of the second-to-last email, which also contains unique content. Because email threading parses out each email into individual components and compares them, email threading marks both of the last two emails as inclusive, ensuring that you don't miss any context as long as you read all emails marked as inclusive.
Let's also consider an email thread with multiple replies, where some replies include inline responses that modify the quoted content. If an inline reply alters part of the previous email, the latest reply doesn't fully encompass the content of the earlier email. Both the latest reply and the earlier email with unique content are marked as inclusive. This approach ensures that any unique information from the inline reply is preserved and not overlooked.
Themes
In this section, you can set the following parameters for themes:
- Themes: When turned on, the workflow performs themes clustering when you run analytics on the data in a review set.
- Maximum number of themes: Specifies the maximum number of themes that the workflow can generate when you run analytics on the data in a review set.
- Include numbers in themes: When turned on, the workflow includes numbers that identify a theme when generating themes.
- Adjust maximum number of themes dynamically: In certain situations, there might not be enough documents in a review set to produce the desired number of themes. When this setting is enabled, eDiscovery adjusts the maximum number of themes dynamically rather than attempting to enforce the maximum number of themes.
When you create a new document, you generally start with one or more ideas that you want to convey in the document, and then compose the document using words that align with these ideas. The more prevalent an idea is, the more frequent the words that are related to that idea tend to be. This method also aligns to how readers consume documents. The important things to understand from reading a document are the main ideas that the document is trying to convey. This understanding also includes which ideas appear where and what the relationships between the ideas are.
This process can be extended to how an eDiscovery reviewer wants to consume a set of documents in a case. They want to see which ideas are present in the review sets and which documents discuss those ideas. If they find a particular document of interest, they want to be able to see documents that discuss similar ideas.
The Themes functionality in eDiscovery attempts to mimic how humans reason about documents, by analyzing the themes that are discussed in a review set and assigning a theme to documents in the review set. In eDiscovery, Themes goes one step further and identifies the dominant theme in each review set and document. The dominant theme is the one that appears the most often in a document.
How do themes work?
The Themes functionality analyzes documents with text in a review set to parse out common themes that appear across all the documents in the review set. eDiscovery assigns those themes to the documents in which they appear. It also labels each theme with the words used in the documents that are representative of the theme. Because a document can contain various types of subject matter, eDiscovery often assigns multiple themes to review sets and documents. This assignment is referred to as the Themes list. The theme that appears most prominently in a review set or document is designated as its dominant theme.
Configuring Themes
Themes are supported for cases and apply to all the review sets within them. You can configure the settings for themes when you create a new case or you can update the theme settings for an existing case.
To configure themes in a case, complete the following steps:
- Go to the Microsoft Purview portal and sign in with the credentials for a user account assigned eDiscovery permissions.
- Select the eDiscovery solution card and then select Cases (preview) in the left nav.
- Select a case, then select Case settings.
- On Case settings, select Search & analytics.
- Select the following theme options as applicable:
- Max number of themes: Specifies the maximum number of themes that the workflow can generate when you run analytics on the data in review sets included in a case. For more information on limits, see Limits in eDiscovery.
- Include numbers in themes: Numbers that identify a theme are included when generating themes.
- Adjust maximum number of themes dynamically: In certain situations, there might not be enough documents in a review set to produce the desired number of themes for the case. When this setting is enabled, the maximum number of themes is adjusted dynamically rather than attempting to enforce the maximum number of themes.
- If you need to exclude keywords associated with themes, enter the text or regular expression needed in the Ignore text field. In the Apply to field, select Themes to apply the text or regular expression to all themes.
- Select Save.
After you create a new case, the workflow automatically runs analytics on the data when you add the review sets to the case. The workflow generates themes for the review sets as part of the analytics processing.
Review set query
If you select the Automatically create a For Review saved search after analytics checkbox, eDiscovery autogenerates a review set query named For Review.
This query filters out duplicate items from the review set, so you can quickly review the unique items in the review set. This query is created only when you run analytics for a review set in the case. For more information about review set queries, see Query the data in a review set.
Ignore text
Certain text can diminish the quality of analytics, such as lengthy disclaimers that get added to email messages regardless of the content of the email. If you know of text that should be ignored, you can exclude it from analytics by specifying the text string and the analytics functionality (near-duplicates, email threading, themes, and relevance) that the text should be excluded for. Using regular expressions (RegEx) for ignored text is also supported.
Optical character recognition (OCR)
When you turn on this setting, OCR processing runs on image files. When OCR is applied to image files, text in these files is available in search results. OCR runs only on items processed during Advanced indexing (if you select this option in the search query).
For example, if a large PDF file that is partially indexed or had other indexing errors is processed during Advanced indexing, OCR is applied. OCR processing only occurs on files that are reindexed during the Advanced indexing process. This means there might be situations where content are added to a review set, but some email attachments aren't processed for OCR because these files aren't processed during Advanced indexing.
After you add data to a review set, you can review, search, tag, and analyze image text. You can view the extracted text in the Text viewer of the selected image file in the review set. For more information, see: