Bookkeeping Service Providers

  • Accounting
  • Bookkeeping
  • US Taxation
  • Financial Planning
  • Accounting Software
  • Small Business Finance
You are here: Home / CLOUD / Azure empowers easy-to-use, high-performance, and hyperscale model training using DeepSpeed

Azure empowers easy-to-use, high-performance, and hyperscale model training using DeepSpeed

July 26, 2022 by cbn Leave a Comment

This blog was written in collaboration with the DeepSpeed team, the Azure ML team, and the Azure HPC team at Microsoft.

Large-scale transformer-based deep learning models trained on large amounts of data have shown great results in recent years in several cognitive tasks and are behind new products and features that augment human capabilities. These models have grown several orders of magnitude in size during the last five years. Starting from a few million parameters of the original transformer model all the way to the latest 530 billion-parameter Megatron-Turing (MT-NLG 530B) model as shown in Figure 1. There is a growing need for customers to train and fine-tune large models at an unprecedented scale.

Hardware is unable to match 200+ times growth in AI models. DeepSpeed enables to scale AI training on thousands of nodes to achieve 4000+ times speedup.

Figure 1: Landscape of large models and hardware capabilities.

Azure Machine Learning (AzureML) brings large fleets of the latest GPUs powered by the InfiniBand interconnect to tackle large-scale AI training. We already train some of the largest models including Megatron/Turing and GPT-3 on Azure. Previously, to train these models, users needed to set up and maintain a complex distributed training infrastructure that usually required several manual and error-prone steps. This led to a subpar experience both in terms of usability and performance.

Today, we are proud to announce a breakthrough in our software stack, using DeepSpeed and 1024 A100s to scale the training of a 2T parameter model with a streamlined user experience at 1K+ GPU scale. We are bringing these software innovations to you through AzureML (including a fully optimized PyTorch environment) that offers great performance and an easy-to-use interface for large-scale training.

Customers can now use DeepSpeed on Azure with simple-to-use training pipelines that utilize either the recommended AzureML recipes or via bash scripts for VMSS-based environments. As shown in Figure 2, Microsoft is taking a full stack optimization approach where all the necessary pieces including the hardware, the OS, the VM image, the Docker image (containing optimized PyTorch, DeepSpeed, ONNX Runtime, and other Python packages), and the user-facing Azure ML APIs have been optimized, integrated, and well-tested for excellent performance and scalability without unnecessary complexity.

Stack diagram of different layers in Azure AI software.

Figure 2: Microsoft full-stack optimizations for scalable distributed training on Azure.

This optimized stack enabled us to efficiently scale training of large models using DeepSpeed on Azure. We are happy to share our performance results supporting 2x larger model sizes (2 trillion vs. 1 trillion parameters), scaling to 2x more GPUs (1024 vs. 512), and up to 1.8x higher compute throughput/GPU (150 TFLOPs vs. 81 TFLOPs) compared to those published on other cloud providers.

We offer near-linear scalability both in terms of an increase in model size as well as increase in number of GPUs. As shown in Figure 3a, together with the DeepSpeed ZeRO-3, its novel CPU offloading capabilities, and a high-performance Azure stack powered by InfiniBand interconnects and A100 GPUs, we were able to maintain an efficient throughput/GPU (>157 TFLOPs) in a near-linear fashion as the model size increased from 175 billion parameters to 2 trillion parameters. On the other hand, for a given model size, for example, 175B, we achieve near-linear scaling as we increase the number of GPUs from 128 all the way to 1024 as shown in Figure 3b. The key takeaway from the results presented in this blog is that Azure and DeepSpeed together are breaking the GPU memory wall and enabling our customers to easily and efficiently train trillion-parameter models at scale.

Throughput/GPU measured at ~157TFLOPS through model sizes 175 billion and 2 trillion parameters exhibiting near-perfect scaling.Training throughput scales linearly with number of GPUs exhibiting near-perfect scaling efficiency on 1K GPUs.

(a)                                                                                          (b)

Figure 3: (a) Near-perfect throughput/GPU as we increase the model size from 175 billion to 2 trillion parameters (BS/GPU=8), (b) Near-perfect performance scaling with the increase in number of GPU devices for the 175B model (BS/GPU=16). The sequence length is 1024 for both cases.

Learn more

To learn more about the optimizations, technologies, and detailed performance trends presented above, please refer to our extended technical blog.

  • Learn more about DeepSpeed, which is part of Microsoft’s AI at Scale initiative.
  • Learn more about Azure HPC + AI.
  • To get started with DeepSpeed on Azure, please follow our getting started tutorial.
  • The results presented in this blog were produced on Azure by following the recipes and scripts published as part of the Megatron-DeepSpeed repository. The recommended and most easy-to-use method to run the training experiments is to utilize the AzureML recipe.
  • If you are running experiments on a custom environment built using Azure VMs or VMSS, please refer to the bash scripts we provide in Megatron-DeepSpeed.
Share on FacebookShare on TwitterShare on Google+Share on LinkedinShare on Pinterest

Filed Under: CLOUD

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Archives

  • September 2025
  • August 2025
  • July 2025
  • June 2025
  • May 2025
  • April 2025
  • March 2025
  • February 2025
  • January 2025
  • December 2024
  • November 2024
  • October 2024
  • July 2024
  • June 2024
  • May 2024
  • April 2024
  • March 2024
  • February 2024
  • January 2024
  • December 2023
  • October 2023
  • September 2023
  • August 2023
  • July 2023
  • June 2023
  • May 2023
  • April 2023
  • March 2023
  • February 2023
  • January 2023
  • December 2022
  • November 2022
  • October 2022
  • September 2022
  • August 2022
  • July 2022
  • June 2022
  • May 2022
  • April 2022
  • March 2022
  • February 2022
  • January 2022
  • December 2021
  • November 2021
  • October 2021
  • September 2021
  • August 2021
  • May 2021
  • April 2021
  • September 2020
  • August 2020
  • July 2020
  • June 2020
  • May 2020
  • April 2020
  • March 2020
  • February 2020
  • January 2020
  • December 2019
  • November 2019
  • October 2019
  • September 2019
  • August 2019
  • July 2019
  • June 2019
  • May 2019
  • April 2019
  • March 2019
  • February 2019
  • January 2019
  • December 2018
  • November 2018
  • October 2018
  • September 2018
  • August 2018
  • July 2018
  • June 2018
  • May 2018
  • April 2018
  • March 2018
  • February 2018
  • January 2018
  • December 2017
  • November 2017
  • October 2017
  • September 2017
  • August 2017
  • July 2017
  • May 2017
  • April 2017
  • March 2017
  • February 2017
  • January 2017
  • March 2016

Recent Posts

  • How Azure Cobalt 100 VMs are powering real-world solutions, delivering performance and efficiency results
  • FabCon Vienna: Build data-rich agents on an enterprise-ready foundation
  • Agent Factory: Connecting agents, apps, and data with new open standards like MCP and A2A
  • Azure mandatory multifactor authentication: Phase 2 starting in October 2025
  • Microsoft Cost Management updates—July & August 2025

Recent Comments

    Categories

    • Accounting
    • Accounting Software
    • BlockChain
    • Bookkeeping
    • CLOUD
    • Data Center
    • Financial Planning
    • IOT
    • Machine Learning & AI
    • SECURITY
    • Uncategorized
    • US Taxation

    Categories

    • Accounting (145)
    • Accounting Software (27)
    • BlockChain (18)
    • Bookkeeping (205)
    • CLOUD (1,322)
    • Data Center (214)
    • Financial Planning (345)
    • IOT (260)
    • Machine Learning & AI (41)
    • SECURITY (620)
    • Uncategorized (1,284)
    • US Taxation (17)

    Subscribe Our Newsletter

     Subscribing I accept the privacy rules of this site

    Copyright © 2025 · News Pro Theme on Genesis Framework · WordPress · Log in