Bookkeeping Service Providers

  • Accounting
  • Bookkeeping
  • US Taxation
  • Financial Planning
  • Accounting Software
  • Small Business Finance
You are here: Home / CLOUD / Advancing Azure Active Directory availability

Advancing Azure Active Directory availability

December 17, 2019 by cbn Leave a Comment

“Continuing our Azure reliability series to be as transparent as possible about key initiatives underway to keep improving availability, today we turn our attention to Azure Active Directory. Microsoft Azure Active Directory (Azure AD) is a cloud identity service that provides secure access to over 250 million monthly active users, connecting over 1.4 million unique applications and processing over 30 billion daily authentication requests. This makes Azure AD not only the largest enterprise Identity and Access Management solution, but easily one of the world’s largest services. The post that follows was written by Nadim Abdo, Partner Director of Engineering, who is leading these efforts.” – Mark Russinovich, CTO, Azure


 

Our customers trust Azure AD to manage secure access to all their applications and services. For us, this means that every authentication request is a mission critical operation. Given the critical nature and the scale of the service, our identity team’s top priority is the reliability and security of the service. Azure AD is engineered for availability and security using a truly cloud-native, hyper-scale, multi-tenant architecture and our team has a continual program of raising the bar on reliability and security.

Azure AD: Core availability principles

Engineering a service of this scale, complexity, and mission criticality to be highly available in a world where everything we build on can and does fail is a complex task.

Our resilience investments are organized around the set of reliability principles below:

o	Azure AD resilience investments are organized around this set of reliability principles

Our availability work adopts a layered defense approach to reduce the possibility of customer visible failure as much as possible; if a failure does occur, scope down the impact of that failure as much as possible, and finally, reduce the time it takes to recover and mitigate a failure as much as possible.

Over the coming weeks and months, we dive deeper into how each of the principles is designed and verified in practice, as well as provide examples of how they work for our customers.

Highly redundant

Azure AD is a global service with multiple levels of internal redundancy and automatic recoverability. Azure AD is deployed in over 30 datacenters around the world leveraging Azure Availability Zones where present. This number is growing rapidly as additional Azure Regions are deployed.

For durability, any piece of data written to Azure AD is replicated to at least 4 and up to 13 datacenters depending on your tenant configuration. Within each data center, data is again replicated at least 9 times for durability but also to scale out capacity to serve authentication load. To illustrate—this means that at any point in time, there are at least 36 copies of your directory data available within our service in our smallest region. For durability, writes to Azure AD are not completed until a successful commit to an out of region datacenter.

This approach gives us both durability of the data and massive redundancy—multiple network paths and datacenters can serve any given authorization request, and the system automatically and intelligently retries and routes around failures both inside a datacenter and across datacenters.

To validate this, we regularly exercise fault injection and validate the system’s resiliency to failure of the system components Azure AD is built on. This extends all the way to taking out entire datacenters on a regular basis to confirm the system can tolerate the loss of a datacenter with zero customer impact.

No single points of failure (SPOF)

As mentioned, Azure AD itself is architected with multiple levels of internal resilience, but our principle extends even further to have resilience in all our external dependencies. This is expressed in our no single point of failure (SPOF) principle.

Given the criticality of our services we don’t accept SPOFs in critical external systems like Distributed Name Service (DNS), content delivery networks (CDN), or Telco providers that transport our multi-factor authentication (MFA), including SMS and Voice. For each of these systems, we use multiple redundant systems configured in a full active-active configuration.

Much of that work on this principle has come to completion over the last calendar year, and to illustrate, when a large DNS provider recently had an outage, Azure AD was entirely unaffected because we had an active/active path to an alternate provider.

Elastically scales

Azure AD is already a massive system running on over 300,000 CPU Cores and able to rely on the massive scalability of the Azure Cloud to dynamically and rapidly scale up to meet any demand. This can include both natural increases in traffic, such as a 9AM peak in authentications in a given region, but also huge surges in new traffic served by our Azure AD B2C which powers some of the world’s largest events and frequently sees rushes of millions of new users.

As an added level of resilience, Azure AD over-provisions its capacity and a design point is that the failover of an entire datacenter does not require any additional provisioning of capacity to handle the redistributed load. This gives us the flexibility to know that in an emergency we already have all the capacity we need on hand.

Safe deployment

Safe deployment ensures that changes (code or configuration) progress gradually from internal automation to internal to Microsoft self-hosting rings to production. Within production we adopt a very graduated and slow ramp up of the percentage of users exposed to a change with automated health checks gating progression from one ring of deployment to the next. This entire process takes over a week to fully rollout a change across production and can at any time rapidly rollback to the last well-known healthy state.

This system regularly catches potential failures in what we call our ‘early rings’ that are entirely internal to Microsoft and prevents their rollout to rings that would impact customer/production traffic.

Modern verification

To support the health checks that gate safe deployment and give our engineering team insight into the health of the systems, Azure AD emits a massive amount of internal telemetry, metrics, and signals used to monitor the health of our systems. At our scale, this is over 11 PetaBytes a week of signals that feed our automated health monitoring systems. Those systems in turn trigger alerting to automation as well as our team of 24/7/365 engineers that respond to any potential degradation in availability or Quality of Service (QoS).

Our journey here is expanding that telemetry to provide optics of not just the health of the services, but metrics that truly represent the end-to-end health of a given scenario for a given tenant. Our team is already alerting on these metrics internally and we’re evaluating how to expose this per-tenant health data directly to customers in the Azure Portal.

Partitioning and fine-grained fault domains

A good analogy to better understand Azure AD are the compartments in a submarine, designed to be able to flood without affecting either other compartments or the integrity of the entire vessel.

The equivalent for Azure AD is a fault domain, the scale units that serve a set of tenants in a fault domain are architected to be completely isolated from other fault domain’s scale units. These fault domains provide hard isolation of many classes of failures such that the ‘blast radius’ of a fault is contained in a given fault domain.

Azure AD up to now has consisted of five separate fault domains. Over the last year, and completed by next summer, this number will increase to 50 fault domains, and many services, including Azure Multi-Factor Authentication (MFA), are moving to become fully isolated in those same fault domains.

This hard-partitioning work is designed to be a final catch all that scopes any outage or failure to no more than 1/50 or ~2% of our users. Our objective is to increase this even further to hundreds of fault domains in the following year.

A preview of what’s to come

The principles above aim to harden the core Azure AD service. Given the critical nature of Azure AD, we’re not stopping there—future posts will cover new investments we’re making including rolling out in production a second and completely fault-decorrelated identity service that can provide seamless fallback authentication support in the event of a failure in the primary Azure AD service.

Think of this as the equivalent to a backup generator or uninterruptible power supply (UPS) system that can provide coverage and protection in the event the primary power grid is impacted. This system is completely transparent and seamless to end users and is now in production protecting a portion of our critical authentication flows for a set of M365 workloads. We’ll be rapidly expanding its applicability to cover more scenarios and workloads.

We look forward to sharing more on our Azure Active Directory Identity Blog, hearing your questions and topics of interest for future posts.

Share on FacebookShare on TwitterShare on Google+Share on LinkedinShare on Pinterest

Filed Under: CLOUD

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Archives

  • September 2025
  • August 2025
  • July 2025
  • June 2025
  • May 2025
  • April 2025
  • March 2025
  • February 2025
  • January 2025
  • December 2024
  • November 2024
  • October 2024
  • July 2024
  • June 2024
  • May 2024
  • April 2024
  • March 2024
  • February 2024
  • January 2024
  • December 2023
  • October 2023
  • September 2023
  • August 2023
  • July 2023
  • June 2023
  • May 2023
  • April 2023
  • March 2023
  • February 2023
  • January 2023
  • December 2022
  • November 2022
  • October 2022
  • September 2022
  • August 2022
  • July 2022
  • June 2022
  • May 2022
  • April 2022
  • March 2022
  • February 2022
  • January 2022
  • December 2021
  • November 2021
  • October 2021
  • September 2021
  • August 2021
  • May 2021
  • April 2021
  • September 2020
  • August 2020
  • July 2020
  • June 2020
  • May 2020
  • April 2020
  • March 2020
  • February 2020
  • January 2020
  • December 2019
  • November 2019
  • October 2019
  • September 2019
  • August 2019
  • July 2019
  • June 2019
  • May 2019
  • April 2019
  • March 2019
  • February 2019
  • January 2019
  • December 2018
  • November 2018
  • October 2018
  • September 2018
  • August 2018
  • July 2018
  • June 2018
  • May 2018
  • April 2018
  • March 2018
  • February 2018
  • January 2018
  • December 2017
  • November 2017
  • October 2017
  • September 2017
  • August 2017
  • July 2017
  • May 2017
  • April 2017
  • March 2017
  • February 2017
  • January 2017
  • March 2016

Recent Posts

  • FabCon Vienna: Build data-rich agents on an enterprise-ready foundation
  • Agent Factory: Connecting agents, apps, and data with new open standards like MCP and A2A
  • Azure mandatory multifactor authentication: Phase 2 starting in October 2025
  • Microsoft Cost Management updates—July & August 2025
  • Protecting Azure Infrastructure from silicon to systems

Recent Comments

    Categories

    • Accounting
    • Accounting Software
    • BlockChain
    • Bookkeeping
    • CLOUD
    • Data Center
    • Financial Planning
    • IOT
    • Machine Learning & AI
    • SECURITY
    • Uncategorized
    • US Taxation

    Categories

    • Accounting (145)
    • Accounting Software (27)
    • BlockChain (18)
    • Bookkeeping (205)
    • CLOUD (1,321)
    • Data Center (214)
    • Financial Planning (345)
    • IOT (260)
    • Machine Learning & AI (41)
    • SECURITY (620)
    • Uncategorized (1,284)
    • US Taxation (17)

    Subscribe Our Newsletter

     Subscribing I accept the privacy rules of this site

    Copyright © 2025 · News Pro Theme on Genesis Framework · WordPress · Log in