Share via


Data resiliency in Microsoft 365

Given the complex nature of cloud computing, Microsoft is mindful that it's not a case of if things go wrong, but rather when. We design our cloud services to maximize reliability and minimize the negative effects on customers when things do go wrong. We moved beyond the traditional strategy of relying on complex physical infrastructure, and we built redundancy directly into our cloud services. We use a combination of less complex physical infrastructure and more intelligent software that builds data resiliency into our services and delivers high availability to our customers.

Resiliency and recoverability are built in

Building in resiliency and recovery starts with the assumption that underlying infrastructure and processes fail at some point: hardware (infrastructure) fails, humans make mistakes, and software has bugs. While it would be incorrect to say that software developers weren't thinking about these things before the cloud, how these issues were handled in a typical IT implementation was different before the cloud:

  • First, hardware, and infrastructure protections were significant. This structure meant having datacenters with 99.99% reliability required significant power and network redundancy, and servers were implemented with hardware-based clustering, dual power supplies, dual network interfaces, and the like.
  • Second, process was paramount. Operations teams maintained rigorous procedures, change windows were employed, and there was often significant project management overhead.
  • Third, deployment took place at a glacial pace. Deploying code without owning the source meant waiting for patch releases, and major version releases involved hardware replacement and significant capital outlay. Moreover, the only way to correct a problem was to roll back. Thus, most IT organizations deployed only major releases to avoid the work to keep up to date.
  • Finally, the scale of deployed systems and the level of their interconnectedness was historically much smaller than it is now.

Today, customers expect continuous innovation from Microsoft without compromising quality, and this expectation is one of the reasons why Microsoft's services and software are built with resiliency and recoverability in mind.

Microsoft 365 data resiliency principles

Resiliency refers to the ability of a cloud-based service to withstand certain types of failures and yet remain fully functional from the customers' perspective. Data resiliency means that no matter what failures occur within Microsoft 365, critical customer data remains intact and unaffected. To that end, Microsoft 365 services are designed around five specific resiliency principles:

  • There's critical and noncritical data. Noncritical data (for example, whether a message was read) can be dropped in rare failure scenarios. Protect critical data (for example, customer data such as email messages) at extreme cost. As a design goal, delivered mail messages are always critical, and things like whether a message has been read is noncritical.
  • Separate copies of customer data into different fault zones or as many fault domains as possible (for example, datacenters, accessible by single credentials (process, server, or operator)) to provide failure isolation.
  • Monitor critical customer data for failing any part of Atomicity, Consistency, Isolation, Durability (ACID).
  • Protect customer data from corruption. Actively scan or monitor it, make it repairable, and recoverable.
  • Most data loss results from customer actions, so allow customers to recover on their own by using a GUI that enables them to restore accidentally deleted items.

By building our cloud services to these principles, coupled with robust testing and validation, Microsoft 365 meets and exceeds the requirements of customers while ensuring a platform for continuous innovation and improvement.