Bookkeeping Service Providers

  • Accounting
  • Bookkeeping
  • US Taxation
  • Financial Planning
  • Accounting Software
  • Small Business Finance
You are here: Home / CLOUD / Advancing Azure Virtual Machine availability monitoring with Project Flash

Advancing Azure Virtual Machine availability monitoring with Project Flash

February 14, 2022 by cbn Leave a Comment

“As we head into the fourth calendar year of the Advancing Reliability blog series, empowering organizations to run their workloads reliably on Azure remains one of our top priorities. We continually invest in evolving the Azure platform to help achieve this on a daily basis. Your ability to monitor virtual machine (VM) availability in a robust and comprehensive way is paramount to ensuring that your applications are available and resilient. For today’s post in the series, I have asked Program Manager, Pujitha Desiraju, from our Azure Core Platform Fundamentals Engineering team to talk about the latest observability enhancements for VM availability monitoring, as well as planned investments to deliver the best monitoring experience.”—Mark Russinovich, CTO, Azure


 

This post was co-authored by Principal Software Engineering Manager, Gaurav Jagtiani.

Flash, as the project is internally known, is a collection of efforts across Azure Engineering, that aims to evolve Azure’s virtual machine (VM) availability monitoring ecosystem into a centralized, holistic, and intelligible solution customers can rely on to meet their specific observability needs. Today, we’re excited to announce the completion of the project’s first two milestones—the preview of VM availability data in Azure Resource Graph, and the private preview of a VM availability metric in Azure Monitor.

What is Project Flash?

Project Flash derives its name from our commitment to building robust and rapid ways to monitor virtual machine (VM) availability as comprehensively as possible—a key prerequisite for efficient application performance. It’s our mission to ensure you can:

  • Consume accurate and actionable data on VM availability disruptions (for example, VM reboots and restarts, application freezes due to network driver updates, and 30-second host OS updates), along with precise failure details (for example, platform versus user-initiated, reboot versus freeze, planned versus unplanned).
  • Analyze and alert on trends in VM availability for quick debugging and month-over-month reporting.
  • Periodically monitor data at scale and build custom dashboards to stay updated on the latest availability states of all resources.
  • Receive automated root cause analyses (RCAs) detailing impacted VMs, downtime cause and duration, consequent fixes, and similar—all to enable targeted investigations and post-mortem analyses.
  • Receive instantaneous notifications on critical changes in VM availability to quickly trigger remediation actions and prevent end-user impact.
  • Dynamically tailor and automate platform recovery policies, based on ever-changing workload sensitivities and failover needs.

With these goals in mind, we’ve divided our execution strategy into two phases—a near-term phase to meet critical current needs, and a long-term phase to deliver the best VM availability monitoring experience. This two-phased approach helps us continually bridge gaps, iterate on service quality, and learn from your feedback at every step along the way.

Announcing new monitoring options

For the first phase, we are providing different options to enable convenient access to VM availability data to address a range of observability needs. We aim to maintain data consistency with similar rigorous quality standards across all of these existing features and solutions, like Resource Health or Activity Log, to deliver a consistent view agnostic of the solution you choose.

Introducing at-scale analysis for VM availability

Today, we’re excited to reach our first Project Flash milestone—with the preview release of VM availability states in Azure Resource Graph for at-scale programmatic consumption.

Azure Resource Graph is a service in Azure that is extensively adopted for its efficient ability to query across many subscriptions, all at once and at low latencies. We’re currently emitting VM availability states (Available, Unavailable, and Unknown) to the Health Resources table in Azure Resource Graph, so you can perform complex Kusto Query Language (KQL) queries for sieving through large datasets at once. This functionality is handy for tracking historical changes in VM availability, for building custom dashboards, and for performing detailed investigations across numerous resource properties spread across multiple tables.

Azure Resource Graph Explorer Window with query and results, to demonstrate fetching data from the HealthResources table.

Figure 1: Azure Resource Graph Explorer Window with query and results, to demonstrate fetching data from the HealthResources table.

    We are planning to add failure details and degraded VM scenarios to the Health Resources table in Azure Resource Graph, later this year. These details will ensure you are properly informed on the cause and impact of any failures—so you can either failover, reboot in place, or take the appropriate mitigations to prevent end-user impact.

    Navigate to Azure Resource Graph Explorer on the Azure portal to get started with any of the KQL queries published for the Health Resources table.

    Introducing VM availability metric in Azure Monitor

    We’re also pleased to announce the private preview of an out-of-box VM availability metric in Azure Monitor, for a curated metric alerting and monitoring experience.

    Metrics in Azure Monitor are great for monitoring and analyzing time series representations of VM availability for quick and easy debugging, receiving scoped alerts on concerning trends, catching early indicators of degraded availability, correlating with other platform metrics, and more.

    The metric allows you to track the pulse of your VMs—during expected behavior, the metric displays a value of 1. In response to any VM availability disruptions, the metric dips to a 0 for the duration of impact. In case of an Azure infrastructure outage, we will emit nulls represented as a dotted line on the portal.

    Screenshot of VM availability metric as seen on Metrics Explorer in the Azure portal, with occasional dips to reflect VM availability disruptions.

    Figure 2: Screenshot of VM availability metric as seen on Metrics Explorer in the Azure portal, with occasional dips to reflect VM availability disruptions.

    We released the private preview of the metric as phase one of our rollout plan, and are currently collecting customer feedback, to further improve our offering. We are planning to add failure details such as metric dimensions and platform logs next year, to allow you to precisely alert on failure scenarios that are impactful.

    Coming soon

    The two monitoring options introduced above are just the beginning for Project Flash! We will continue to build upon our existing solutions by improving data quality and failure attribution. In parallel, we are designing two new monitoring offerings to meet your latency and mitigation needs, while also investing heavily in the underlying platform to make our fault detection more resilient and comprehensive.

    Azure Event Grid for instantaneous notifications

    Successfully running business-critical applications requires hyper-awareness of any VM availability impacting event, so remediation actions can be triggered instantaneously to prevent end-user impact. To support you in your daily operations, we are planning to design a notification mechanism that leverages the low-latency technology of Azure Event Grid. This will allow you to simply subscribe to an Event Grid system topic, and route scoped events via event handlers to any downstream tooling, instantaneously.

    Automate and tailor platform recovery policies

    Considering the numerous ongoing investments to improve your VM availability monitoring experience, Project Flash intends to empower you even further by providing you knobs to customize recovery policies triggered by the platform, in response to cases of VM availability disruptions.

    One such knob we are designing is the ability to opt-out of Service Healing for single-instance VMs, in response to a specific set of unanticipated Availability disruptions. This knob will be made available via the portal or at the time of VM deployment and can be updated dynamically. Note that leveraging this feature will render the usual Azure Virtual Machine availability SLAs ineffective.

    In the future, we will explore introducing knobs to also opt-out of other applicable recovery policies (for example, Live Migration or Tardigrade), to ensure you can easily adapt to your ever-changing mitigation needs.

    Ongoing platform quality investments

    While the first phase is designed to meet your current observability needs, we remain focused on our long-term goal of delivering a world-class observability experience surrounding VM availability. We are extremely excited for all the data enrichments and technology advancements that will contribute to this experience, so here’s an early look at our roadmap of planned investments:

    1. Fault detection and attribution: We are continuously evolving our underlying infrastructure to detect and attribute failures both precisely and instantaneously—so that we can reduce unknown or missing health status reports, emit actionable failure details, and handle platform recovery customizations. This remains our top investment area on which we continue to iterate every cycle.
    2. Root cause analysis (RCA) automation: We are planning to implement easy tracking mechanisms for every unique VM downtime, along with automatic construction and emission of detailed downtime RCA statements to reduce manual tracking and churn on your end.
    3. AIOps integration: We are looking to leverage the tremendous advancements being made in AIOps across Microsoft, for enabling smart insights and anomaly detection and diagnosis across the multitude of data points on VM Availability.
    4. Centralized and cohesive user experience: We acknowledge that a consequence of our near-term approach is that across our different services we have multiple monitoring, alerting, and recovery tools which may lead to a confusing and disparate experience for you. This is a problem we intend to solve with our final phase. Our north star goal is to provide end-users access to distinct and necessary representations of VM availability, consolidated within Azure Monitor, and categorized according to common usage patterns for discoverability, ease of use and intuitive onboarding.

    Learn more

    This list is certainly not exhaustive as we have multiple enrichments planned as part of our long-term strategy. To reiterate, our intention with Project Flash is to make VM availability monitoring extremely intuitive, comprehensive, and seamless—so you are always prepared for and informed about any changes in the health of your workloads, ultimately to maintain your own SLAs and business promises.

    We will continue to share updates on Project Flash through blogs like this, to ensure you stay up to date on the latest. Stay tuned!

    Share on FacebookShare on TwitterShare on Google+Share on LinkedinShare on Pinterest

    Filed Under: CLOUD

    Leave a Reply Cancel reply

    Your email address will not be published. Required fields are marked *

    Archives

    • September 2025
    • August 2025
    • July 2025
    • June 2025
    • May 2025
    • April 2025
    • March 2025
    • February 2025
    • January 2025
    • December 2024
    • November 2024
    • October 2024
    • July 2024
    • June 2024
    • May 2024
    • April 2024
    • March 2024
    • February 2024
    • January 2024
    • December 2023
    • October 2023
    • September 2023
    • August 2023
    • July 2023
    • June 2023
    • May 2023
    • April 2023
    • March 2023
    • February 2023
    • January 2023
    • December 2022
    • November 2022
    • October 2022
    • September 2022
    • August 2022
    • July 2022
    • June 2022
    • May 2022
    • April 2022
    • March 2022
    • February 2022
    • January 2022
    • December 2021
    • November 2021
    • October 2021
    • September 2021
    • August 2021
    • May 2021
    • April 2021
    • September 2020
    • August 2020
    • July 2020
    • June 2020
    • May 2020
    • April 2020
    • March 2020
    • February 2020
    • January 2020
    • December 2019
    • November 2019
    • October 2019
    • September 2019
    • August 2019
    • July 2019
    • June 2019
    • May 2019
    • April 2019
    • March 2019
    • February 2019
    • January 2019
    • December 2018
    • November 2018
    • October 2018
    • September 2018
    • August 2018
    • July 2018
    • June 2018
    • May 2018
    • April 2018
    • March 2018
    • February 2018
    • January 2018
    • December 2017
    • November 2017
    • October 2017
    • September 2017
    • August 2017
    • July 2017
    • May 2017
    • April 2017
    • March 2017
    • February 2017
    • January 2017
    • March 2016

    Recent Posts

    • How Azure Cobalt 100 VMs are powering real-world solutions, delivering performance and efficiency results
    • FabCon Vienna: Build data-rich agents on an enterprise-ready foundation
    • Agent Factory: Connecting agents, apps, and data with new open standards like MCP and A2A
    • Azure mandatory multifactor authentication: Phase 2 starting in October 2025
    • Microsoft Cost Management updates—July & August 2025

    Recent Comments

      Categories

      • Accounting
      • Accounting Software
      • BlockChain
      • Bookkeeping
      • CLOUD
      • Data Center
      • Financial Planning
      • IOT
      • Machine Learning & AI
      • SECURITY
      • Uncategorized
      • US Taxation

      Categories

      • Accounting (145)
      • Accounting Software (27)
      • BlockChain (18)
      • Bookkeeping (205)
      • CLOUD (1,322)
      • Data Center (214)
      • Financial Planning (345)
      • IOT (260)
      • Machine Learning & AI (41)
      • SECURITY (620)
      • Uncategorized (1,284)
      • US Taxation (17)

      Subscribe Our Newsletter

       Subscribing I accept the privacy rules of this site

      Copyright © 2025 · News Pro Theme on Genesis Framework · WordPress · Log in