Bookkeeping Service Providers

  • Accounting
  • Bookkeeping
  • US Taxation
  • Financial Planning
  • Accounting Software
  • Small Business Finance
You are here: Home / CLOUD / Microsoft makes it easier to build popular language representation model BERT at large scale

Microsoft makes it easier to build popular language representation model BERT at large scale

July 17, 2019 by cbn Leave a Comment

This post is co-authored by Rangan Majumder, Group Program Manager, Bing and Maxim Lukiyanov, Principal Program Manager, Azure Machine Learning.

Today we are announcing the open sourcing of our recipe to pre-train BERT (Bidirectional Encoder Representations from Transformers) built by the Bing team, including code that works on Azure Machine Learning, so that customers can unlock the power of training custom versions of BERT-large models using their own data. This will enable developers and data scientists to build their own general-purpose language representation beyond BERT.

The area of natural language processing has seen an incredible amount of innovation over the past few years with one of the most recent being BERT. BERT, a language representation created by Google AI language research, made significant advancements in the ability to capture the intricacies of language and improved the state of the art for many natural language applications, such as text classification, extraction, and question answering. The creation of this new language representation enables developers and data scientists to use BERT as a stepping-stone to solve specialized language tasks and get much better results than when building natural language processing systems from scratch.

The broad applicability of BERT means that most developers and data scientists are able to use a pre-trained variant of BERT rather than building a new version from the ground up with new data. While this is a reasonable solution if the domain’s data is similar to the original model’s data, it will not deliver best-in-class accuracy when crossing over to a new problem space. For example, training a model for the analysis of medical notes requires a deep understanding of the medical domain, providing career recommendations depend on insights from a large corpus of text about jobs and candidates, and legal document processing requires training on legal domain data. In these cases, to maximize the accuracy of the Natural Language Processing (NLP) algorithms one needs to go beyond fine-tuning to pre-training the BERT model.

Additionally, to advance language representation beyond BERT’s accuracy, users will need to change the model architecture, training data, cost function, tasks, and optimization routines. All these changes need to be explored at large parameter and training data sizes. In the case of BERT-large, this can be quite substantial as it has 340 million parameters and trained over 2.5 billion Wikipedia and 800 million BookCorpus words. To support this with Graphical Processing Units (GPUs), the most common hardware used to train deep learning-based NLP models, machine learning engineers will need distributed training support to train these large models. However, due to the complexity and fragility of configuring these distributed environments, even expert tweaking can end up with inferior results from the trained models.

To address these issues, Microsoft is open sourcing a first of a kind, end-to-end recipe for training custom versions of BERT-large models on Azure. Overall this is a stable, predictable recipe that converges to a good optimum for developers and data scientists to try explorations on their own.

“Fine-tuning BERT was really helpful to improve the quality of various tasks important for Bing search relevance,” says Rangan Majumder, Group Program Manager at Bing, who led the open sourcing of this work.  “But there were some tasks where the underlying data was different from the original corpus BERT was pre-trained on, and we wanted to experiment with modifying the tasks and model architecture.  In order to enable these explorations, our team of scientists and researchers worked hard to solve how to pre-train BERT on GPUs. We could then build improved representations leading to significantly better accuracy on our internal tasks over BERT.  We are excited to open source the work we did at Bing to empower the community to replicate our experiences and extend it in new directions that meet their needs.”

“To get the training to converge to the same quality as the original BERT release on GPUs was non-trivial,” says Saurabh Tiwary, Applied Science Manager at Bing.  “To pre-train BERT we need massive computation and memory, which means we had to distribute the computation across multiple GPUs. However, doing that in a cost effective and efficient way with predictable behaviors in terms of convergence and quality of the final resulting model was quite challenging. We’re releasing the work that we did to simplify the distributed training process so others can benefit from our efforts.”

Results

To test the code, we trained BERT-large model on a standard dataset and reproduced the results of the original paper on a set of GLUE tasks, as shown in Table 1. To give you estimate of the compute required, in our case we ran training on Azure ML cluster of 8xND40_v2 nodes (64 NVidia V100 GPUs total) for 6 days to reach listed accuracy in the table. The actual numbers you will see will vary based on your dataset and your choice of BERT model checkpoint to use for the upstream tasks.

 GLUE Test results, evaluated by the provided test script on the GLUE development set.

Table1. GLUE Test results, evaluated by the provided test script on the GLUE development set. The “Average” column is simple average over the table results. F1 scores are reported for QQP and MRPC, Spearman correlations are reported for STS-B, and accuracy scores are reported for the other tasks. The results for tasks with smaller dataset sizes have significant variation and may require multiple fine-tuning runs to reproduce the results.

The code is available in open source on the Azure Machine Learning BERT GitHub repo. Included in the repo is:

  • A PyTorch implementation of the BERT model from Hugging Face repo.
  • Raw and pre-processed English Wikipedia dataset.
  • Data preparation scripts.
  • Implementation of optimization techniques such as gradient accumulation and mixed precision.
  • An Azure Machine Learning service Jupyter notebook to launch pre-training of the model.
  • A set of pre-trained models that can be used in fine-tuning experiments.
  • Example code with a notebook to perform fine-tuning experiments.

With a simple “Run All” command, developers and data scientists can train their own BERT model using the provided Jupyter notebook in Azure Machine Learning service. The code, data, scripts, and tooling can also run in any other training environment.

Summary

We could not have achieved these results without leveraging the amazing work of the researchers before us, and we hope that the community can take our work and go even further. If you have any questions or feedback, please head over to our GitHub repo and let us know how we can make it better.

Learn how Azure Machine Learning can help you streamline the building, training, and deployment of machine learning models. Start free today.

Share on FacebookShare on TwitterShare on Google+Share on LinkedinShare on Pinterest

Filed Under: CLOUD

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Archives

  • September 2025
  • August 2025
  • July 2025
  • June 2025
  • May 2025
  • April 2025
  • March 2025
  • February 2025
  • January 2025
  • December 2024
  • November 2024
  • October 2024
  • July 2024
  • June 2024
  • May 2024
  • April 2024
  • March 2024
  • February 2024
  • January 2024
  • December 2023
  • October 2023
  • September 2023
  • August 2023
  • July 2023
  • June 2023
  • May 2023
  • April 2023
  • March 2023
  • February 2023
  • January 2023
  • December 2022
  • November 2022
  • October 2022
  • September 2022
  • August 2022
  • July 2022
  • June 2022
  • May 2022
  • April 2022
  • March 2022
  • February 2022
  • January 2022
  • December 2021
  • November 2021
  • October 2021
  • September 2021
  • August 2021
  • May 2021
  • April 2021
  • September 2020
  • August 2020
  • July 2020
  • June 2020
  • May 2020
  • April 2020
  • March 2020
  • February 2020
  • January 2020
  • December 2019
  • November 2019
  • October 2019
  • September 2019
  • August 2019
  • July 2019
  • June 2019
  • May 2019
  • April 2019
  • March 2019
  • February 2019
  • January 2019
  • December 2018
  • November 2018
  • October 2018
  • September 2018
  • August 2018
  • July 2018
  • June 2018
  • May 2018
  • April 2018
  • March 2018
  • February 2018
  • January 2018
  • December 2017
  • November 2017
  • October 2017
  • September 2017
  • August 2017
  • July 2017
  • May 2017
  • April 2017
  • March 2017
  • February 2017
  • January 2017
  • March 2016

Recent Posts

  • FabCon Vienna: Build data-rich agents on an enterprise-ready foundation
  • Agent Factory: Connecting agents, apps, and data with new open standards like MCP and A2A
  • Azure mandatory multifactor authentication: Phase 2 starting in October 2025
  • Microsoft Cost Management updates—July & August 2025
  • Protecting Azure Infrastructure from silicon to systems

Recent Comments

    Categories

    • Accounting
    • Accounting Software
    • BlockChain
    • Bookkeeping
    • CLOUD
    • Data Center
    • Financial Planning
    • IOT
    • Machine Learning & AI
    • SECURITY
    • Uncategorized
    • US Taxation

    Categories

    • Accounting (145)
    • Accounting Software (27)
    • BlockChain (18)
    • Bookkeeping (205)
    • CLOUD (1,321)
    • Data Center (214)
    • Financial Planning (345)
    • IOT (260)
    • Machine Learning & AI (41)
    • SECURITY (620)
    • Uncategorized (1,284)
    • US Taxation (17)

    Subscribe Our Newsletter

     Subscribing I accept the privacy rules of this site

    Copyright © 2025 · News Pro Theme on Genesis Framework · WordPress · Log in