Share via


Exchange Online data resiliency

Exchange is one of the most heavily utilized Microsoft online services. It also serves as the long-term data storage for many other Microsoft 365 services such as Teams. For this reason, Exchange is robustly architected to ensure high resiliency in terms of data integrity and availability in the face of unforeseen disruptions.

Operational resiliency

Database Availability Groups

Every mailbox database in Microsoft 365 is hosted in a database availability group (DAG) and replicated to geographically separate datacenters within the same territory. The most common configuration is three database copies in three datacenters; however, some territories have fewer datacenters (two datacenters in Australia and Japan). But in all cases, every mailbox database has at least three copies that are distributed across multiple datacenters, thereby ensuring that mailbox data is protected from software, hardware, and even datacenter failures.

All of these copies are configured as highly available, including the lagged copy, which is configured as a highly available lagged database copy. The lagged database copy isn't intended for individual mailbox recovery or mailbox item recovery. Its purpose is to provide a recovery mechanism for the rare event of system-wide, catastrophic logical corruption.

Exchange Online uses available lag copies, which combine the resilience of traditional lagged copies with the activation readiness of HA copies. These copies are still highly available. The lag copy maintains snapshots, allowing us to take the database back to certain restore points over the last seven days. This model improves availability while still supporting recovery for rare logical corruption scenarios. Unlike traditional lagged copies, available lag copies are part of the active redundancy set and can be promoted automatically.

Note

If a mailbox is hard deleted, available lag doesn't help. We rely on Store to keep the mailbox soft deleted to catch bad mailbox deletions.

Transport Resilience

Exchange Online includes two primary transport resilience features: Shadow Redundancy and Safety Net. Shadow Redundancy keeps a redundant copy of a message while it is in transit. Safety Net keeps a redundant copy of a message after the message is successfully delivered.

With Shadow Redundancy, each Exchange Online transport server makes a copy of each message it receives before it acknowledges successfully receiving the message to the sending server. This approach makes all messages in the transport pipeline redundant while in transit. If Exchange Online determines the original message was lost in transit, a redundant copy of the message is redelivered.

Safety Net is a transport queue that is associated with the Transport service on a Mailbox server. This queue stores copies of messages that the server successfully processes. When a mailbox database or server failure requires activating an out-of-date copy of the mailbox database, messages in the Safety Net queue are automatically resubmitted to the new active copy of the mailbox database. Safety Net is also redundant, thereby eliminating transport as a single point of failure. It uses the concept of a Primary Safety Net and a Shadow Safety Net. If the Primary Safety Net is unavailable for more than 12 hours, resubmit requests become shadow resubmit requests, and messages are redelivered from the Shadow Safety Net.

Message resubmissions from Safety Net are automatically initiated by the Active Manager component of the Microsoft Exchange Replication service that manages DAGs and mailbox database copies. No manual actions are required to resubmit messages from Safety Net.

Failover

Failovers occur at different levels. At the database level, different signals are used to automatically move the active copy of the database. Those signals include heartbeats, network throughput, hardware errors, or administrative actions. Rack-level failovers disable all active databases on a physical rack. Site switchovers happen when an entire site's racks switch over, typically manually during disaster recovery.

Site switchover

Site switchover is EXO’s recovery mechanism to maintain continuity of service in large-scale datacenter failure scenarios. For incidents that impact entire datacenters, all racks in a site often need to be manually switched over at the same time. Examples where site switchovers are necessary include fiber cuts, power outage, or cooling failures that impact the entire location.

Modern services like Exchange Online are hosted in an Active/Active cloud architecture in which all sites are taking traffic/load simultaneously, data is replicated constantly, load balancing automatically distributes load, and each site can handle the majority of load within the Geo / DAG. This means that a single site or territory being offline doesn’t significantly impact service availability and thus shouldn't impact the end user’s experience.

We conduct location switchover exercises to prepare for disaster scenarios. Conducting these exercises allows us to identify bugs, validate the resilience of our service, and measure actual performance against expectations. These exercises ensure confidence in our preparedness and our ability to handle disasters when they strike.

Corruption prevention and correction

An In-Place Hold preserves all mailbox content, including deleted items and original versions of modified items. All such mailbox items are returned in an In-Place eDiscovery search. When you place an In-Place Hold on a user's mailbox, the contents in the corresponding archive mailbox (if it's enabled) are also placed on hold and returned in an eDiscovery search.

Two types of corruption can affect an Exchange database: physical corruption and logical corruption. Physical corruption is typically caused by hardware problems, especially storage hardware. Logical corruption occurs due to other factors. Generally, two types of logical corruption can occur within an Exchange database:

  • Database logical corruption - The database page checksum matches, but the data on the page is wrong logically. This corruption can occur when the database engine (the Extensible Storage Engine (ESE)) attempts to write a database page and even though the operating system returns a success message, the data is either never written to the disk or it's written to the wrong place. This issue is referred to as a lost flush. ESE includes numerous features and safeguards that are designed to prevent physical corruption of a database and other data loss scenarios. To prevent lost flushes from losing data, ESE includes a lost flush detection mechanism in the database along with a feature (single page restore) to correct it.
  • Store logical corruption - Data is added, deleted, or manipulated in a way that the user doesn't expect. These cases are caused by third-party applications. It's usually corruption in the sense that the user views it as corruption. The Exchange store considers the transaction that produced the logical corruption to be a series of valid MAPI operations. The In-Place Hold features in Exchange Online provides protection from store logical corruption because it prevents content from being permanently deleted by a user or an application.

Exchange Online performs several consistency checks on replicated log files during both log inspection and log replay. These consistency checks prevent physical corruption from being replicated by the system. For example, during log inspection, there's a physical integrity check that verifies the log file and validates that the checksum recorded in the log file matches the checksum generated in memory. In addition, the log file header is examined to make sure the log file signature recorded in the log header matches that of the log file. During log replay, the log file undergoes further scrutiny. For example, the database header also contains the log signature that is compared with the log file's signature to ensure they match.

Protection against corruption of mailbox data in Exchange Online is achieved by using Exchange Native Data Protection, a resiliency strategy that uses application-level replication across multiple servers and multiple datacenters along with other features that help protect data from being lost due to corruption or other reasons. These features include native features that are managed by Microsoft or the Exchange Online application itself, such as:

  • Data Availability Groups
  • Single Bit Correction
  • Online Database Scanning
  • Lost Flush Detection
  • Single Page Restore
  • Mailbox Replication Service
  • Log File Checks
  • Deployment on Resilient File System

For more information on the native features listed previously, select the hyperlinks, and see the following for additional information and for details on items without hyperlinks. In addition to these native features, Exchange Online also includes data resiliency features that customers can manage, such as:

Single-bit correction

ESE includes a mechanism to detect and resolve single-bit CRC errors (also known as single-bit flips) that result from hardware errors. These errors represent physical corruption. When these errors occur, ESE automatically corrects them and logs an event in the event log.

Online database scanning

Online database scanning (also known as database check summing) is the process where ESE uses a database consistency checker to read each page and check for page corruption. The primary purpose is to detect physical corruption and lost flushes that transactional operations might not detect. Database scanning also performs post-store crash operations. Crashes can cause space leaks, and online database scanning finds and recovers lost space. The system is designed with the expectation that every database is fully scanned once every seven days.

Lost flush detection

A lost flush occurs when a database write operation that the disk subsystem or operating system returns as completed didn't actually get written to disk, or was written in the wrong location. Lost flush incidents can result in database logical corruption. To prevent lost flushes from resulting in lost data, ESE includes a lost flush detection mechanism. As database pages are written to passive copies, the system checks for lost flushes on the active copy. If it detects a lost flush, ESE repairs the process by using a page patching process.

Single page restore

Single page restore, also known as page patching, is an automatic process where corrupt database pages are replaced by healthy copies from a healthy replica. The repair process for a corrupt page depends on whether the database copy is active or passive. When an active database copy encounters a corrupted page, it can copy a page from one of its replicas, provided the page it copies is up to date. This process is accomplished by putting a request for the page into the log stream, which is the basis of mailbox database replication. As soon as a replica encounters the page request, it responds by sending a copy of the page to the requesting database copy. Single page restore also provides an asynchronous communication mechanism for the active to request a page from replicas, even if the replicas are currently offline.

If there's corruption in a passive database copy, including a lagged database copy, because these copies are always behind their active copy, it's always safe to copy any page from the active copy to a passive copy. A passive database copy is by nature highly available, so during the page patching process, log replaying is suspended, but log copying continues. The passive database copy retrieves a copy of the corrupted page from the active copy, waits until the log file that meets the maximum required log generation requirement is copied and inspected, and then patches the corrupt page. Once the page is patched, log replay resumes. The process is the same for the lagged database copy, except that the lagged database first replays all log files that are necessary to achieve a patchable state.

Mailbox Replication Service

Moving mailboxes is a key part of managing a large-scale email service. There are always updated technologies and hardware and version upgrades to deal with, so having a robust, throttled system that enables our engineers to accomplish this work while keeping the mailbox moves transparent to users (by making sure they stay online throughout the process) is key and making sure that the process scales up gracefully as mailboxes get larger and larger.

The Exchange Mailbox Replication Service (MRS) is responsible for moving mailboxes between databases. During the move, MRS performs a consistency check on all items within the mailbox. If it finds a consistency issue, MRS either corrects the problem or skips the corrupted items, thereby removing the corruption from the mailbox.

Because MRS is a component of Exchange Online, we can make changes in its code to address new forms of corruption that are detected in the future. For example, if we detect a consistency issue that MRS isn't able to fix, we can analyze the corruption, change the MRS code, and correct the inconsistency (if we understand how to).

Log file checks

All transaction log files generated by an Exchange database undergo several forms of consistency checks. When a log file is created, the system first writes a bit pattern and then performs a series of log writes. This structure enables Exchange Online to execute a series of checks (lost flush, CRC, and other checks) to validate each log file as it's written, and again as it's replicated.

Deployment on Resilient File System

To help prevent corruption at the file system level, Exchange Online is deployed on Resilient File System (ReFS) partitions. This deployment provides improved recovery capabilities. ReFS is a file system in Windows Server 2012 and later that's designed to be more resilient against data corruption, thereby maximizing data availability and integrity. Specifically, ReFS brings improvements in the way that metadata is updated, which offers better protection for data and reduces data corruption cases. It also uses checksums to verify the integrity of file data and metadata, ensuring that data corruption is easily found and repaired.

Exchange Online takes advantage of several ReFS benefits:

  • More resiliency in data integrity means fewer data corruption incidents. Reducing the number of corruption incidents means fewer unnecessary database reseeds.
  • Checksum running on metadata enables detections of corruption cases sooner and more deterministically, allowing us to fix customer data corruption before grey failures occur on data volumes.
  • Designed to work well with large data sets—petabytes and larger—without performance impact.
  • Support for other features used by Exchange Online, such as BitLocker encryption.

Exchange Online also benefits from other ReFS features:

  • Integrity (Integrity Streams) - ReFS stores data in a way that protects it from many of the common errors that can normally cause data loss. Microsoft 365 Search uses Integrity Streams to help with early disk corruption detection and checksums of file content. The feature also reduces corruption incidents caused by 'Torn Writes' (when a write operation doesn't complete due to power outages, etc.).
  • Availability (Salvage) - ReFS prioritizes the availability of data. Historically, file systems were often susceptible to data corruption that would require the system to be taken offline for repair. Although rare, if corruption does occur, ReFS implements salvage, a feature that removes the corrupt data from the namespace on a live volume and ensures that good data isn't adversely affected by nonrepairable corrupt data. Applying the Salvage feature and isolating data corruption to Exchange Online database volumes means that we can keep nonaffected databases on a corrupted volume healthy between the time of corruption and repair action. This structure increases the availability of databases that would normally be affected by such disk corruption issues.