Building Xbox game streaming with Site Reliability best practices

Last month, we started sharing the DevOps journey at Microsoft through the stories of several teams at Microsoft and how they approach DevOps adoption. As the next story in this series, we want to share the transition one team made from a classic operations role to a Site Reliability Engineering (SRE) role: the story of the Xbox Reliability Engineering and Operations (xREO) team.

This transition was not easy and came out of necessity when Microsoft decided to bring Xbox games to gamers wherever they are through cloud game streaming (project xCloud). In order to deliver cutting-edge technology with top-notch customer experience, the team had to redefine the way it worked—improving collaboration with the development team, investing in automation, and get involved in the early stages of the application lifecycle. In this blog, we’ll review some of the key learnings the team collected along the way. To explore the full story of the team, see the journey of the xREO team.

Consistent gameplay requirements and the need to collaborate

A consistent experience is crucial to a successful game streaming session. To ensure gamers experience a game streamed from the cloud, it has to feel like it is running on a nearby console. This means creating a globally distributed cloud solution that runs on many data centers, close to end users. Azure’s global infrastructure makes this possible, but operating a system running on top of so many Azure regions is a serious challenge.

The Xbox developers who have started architecting and building this technology understood that they could not just build this system and “throw it over the wall” to operations. Both teams had to come together and collaborate through the entire application lifecycle so the system can be designed from the start with considerations on how it will be operated in a production environment.

Architecting a cloud solution with operations in mind

In many large organizations, it is common to see development and operation teams working in silos. Developers don’t always consider operation when planning and building a system, while operations teams are not empowered to touch code even though they deploy it and operate it in production. With an SRE approach, system reliability is baked into the entire application lifecycle and the team that operates the system in production is a valued contributor in the planning phase. In a new approach, involving the xREO team in the design phase enabled a collaborative environment, making joint technology choices and architecting a system that could operate with the requirements needed to scale.

Leveraging containers to clearly define ownership

One of the first technological decisions the development and xREO teams made together was to implement a microservices architecture utilizing container technologies. This allowed the development teams to containerize .NET Core microservices they would own and remove the dependency from the cloud infrastructure that was running the containers and was to be owned by the xREO team.

Another technological decision both teams made early on, was to use Kubernetes as the underlying container orchestration platform. This allowed the xREO team to leverage Azure Kubernetes Service (AKS), a managed Kubernetes cloud platform that simplifies the deployment of Kubernetes clusters, removing a lot of the operational complexity the team would have to face running multiple clusters across several Azure regions. These joint choices made ownership clear—the developers are responsible for everything inside the containers and the xREO team is responsible for the AKS clusters and other Azure services make the cloud infrastructure hosting these containers. Each team owns the deployment, monitoring and operation of its respective piece in production.

This kind of approach creates clear accountability and allows for easier incident management in production, something that can be very challenging in a monolithic architecture where infrastructure and application logic have code dependencies and are hard to untangle when things go sideways.

Scaling through infrastructure automation

Another best practice the xREO team invested in was infrastructure automation. Deploying multiple cloud services manually on each Azure region was not scalable and would take too much time. Using a practice known as “infrastructure as code” (IaC) the team used Azure Resource Manager templates to create declarative definitions of cloud environments that allow deployments to multiple Azure regions with minimal effort.

With infrastructure managed as code, it can also be deployed using continuous integration and continuous delivery (CI/CD) to bring further automation to the process of deploying new Azure resources to existing data centers, updating infrastructure definitions or bringing online new Azure regions when needed. Both IaC and CI/CD, allowed the team to remain lean, avoid repetitive mundane work and remove most of the risk of human error that comes with manual steps. Instead of spending time on manual work and checklists, the team can focus on further improving the platform and its resilience.

Site Reliability Engineering in action

The journey of the xREO team started with a need to bring the best customer experience to gamers. This is a great example that shows how teams who want to delight customers with new experiences through cutting edge innovation must evolve the way they design, build, and operate software. Shifting their approach to operations and collaborating more closely with the development teams was the true transformation the xREO team has undergone.

With this new mindset in place, the team is now well positioned to continue building more resilience and further scale the system and by so, deliver the promise of cloud game streaming to every gamer.

Resources

The full story of the xREO team
Additional stories: The DevOps journey at Microsoft
Microsoft Game Stack