Office Licensing Service and Azure Cosmos DB part 1: Migrating the production workload

This post is part 1 of a two-part series about how organizations use Azure Cosmos DB to meet real world needs, and the difference it’s making to them. In part 1, we explore the challenges that led the Microsoft Office Licensing Service team to move from Azure Table storage to Azure Cosmos DB, and how it migrated its production workload to the new service. In part 2, we examine the outcomes resulting from the team’s efforts.

The challenge: Limited throughput and other capabilities

At Microsoft, the Office Licensing Service (OLS) supports activation of the Microsoft Office client on millions of devices around the world—including Windows, Mac, tablets, and mobile. It stores information such as machine ID, product ID, activation count, expiration date, and more. OLS is accessed by the Office client more than more than 240 million times per day by users around the world, with the first call coming from the client upon license activation and then every 2-3 days thereafter as the client checks to make sure the license is still valid.

Until recently, OLS relied on Azure Table storage for its backend data store, which contained about 5 TB of data spread across 18 tables—with separate tables used for different license categories such as consumer, enterprise, and OEM pre-installation.

In early 2018, after years of continued workload growth, the OLS service began approaching the point where it would require more throughput than Table storage could deliver. If the issue wasn’t addressed, the inherent throughput limit of Table storage would begin to threaten overall service quality to the detriment of millions of users worldwide.

Danny Cheng, a software engineer at Microsoft, who leads the OLS development team explains:

“Each Table storage account has a fixed maximum throughput and doesn’t scale past that. By 2018, OLS was running low on available storage throughput, and, given that we were already maintaining each table in its own Table storage account, there was no way for us to get more throughput to serve more requests from our customers. We were being throttled during peak usage hours for the OLS service, so we had to find a more scalable storage backend soon.

In looking for a long-term solution to its storage needs, the OLS team wanted more than just additional throughput. We wanted the ability to deploy OLS in different regions around the world, as a means of minimizing latency by putting copies of the service closer to where our users are. But with Table storage, geo-replication capabilities are fairly limited.”

The OLS team also wanted better disaster recovery. With Table storage, they were storing all data in multiple regions within the United States. All reads and writes went to the primary region, and there were no SLAs in place for replication to the two backup regions, which could take up to 60 minutes. If the primary region became unavailable, human intervention would be required and data loss would be likely.

“If a region were to go down, it would be a real panic situation—with 30 to 60 minutes of downtime and a similar window for data loss,” says Cheng.

The solution: A lift-and-shift migration to Azure Cosmos DB

The OLS team chose to move to Azure Cosmos DB, which offered a lift-and-shift migration path from Table storage—making it easy to swap-in a premium backend service with turnkey global distribution, low latency, virtually unlimited scalability, guaranteed high availability, and more.

“At first, when we realized we needed a new storage backend, it was intimidating in that we didn’t know how much new code would be needed,” says Cheng. “We looked at several storage options on Azure, and Azure Cosmos DB was the only one that met all our needs. And with its Table API, we wouldn’t even need to write much new code. In many ways, it was an ideal lift-and-shift—delivering the scalability we needed and lots of other benefits with little work.”

Design decisions

In preparing to deploy Azure Cosmos DB, the OLS team had to make a few basic design decisions:

Consistency level, which gave the team options for addressing the fundamental tradeoffs between read consistency and latency, availability, and throughput.

“We picked strong consistency because some of our business logic requires reading from storage immediately after writing to it,” explains Cheng.

Partition key, which dictates how items within an Azure Cosmos DB container are divided into logical partitions—and determines the ultimate scalability of the data store.

“With the Azure Cosmos DB Table API, partition keys naturally map to what we had in Table storage—so we were able to reuse the same partition key,” says Cheng.

Migration process

Although Azure Cosmos DB offered a data migration tool, its use at that time would have entailed some downtime for the OLS service, which wasn’t an option. (Note: Today you can do live migrations without downtime.) To address this, the OLS team built a data migration solution that consisted of three components:

A Data Migrator that moves current data from Table storage to Azure Cosmos DB.
A Dual Writer that writes new database changes to both Table storage and Azure Cosmos DB.
A Consistency Checker that catches any mismatches between Table storage and Azure Cosmos DB.

The Data Migrator component is based on the same one provided to Microsoft customers by the Azure Cosmos DB team.

“To solve the downtime problem, we added Dual Writer and Consistency Checker components, which run on the same production servers as the OLS service itself,” explains Cheng.

The OLS team completed the migration process in late 2019. Today, Azure Cosmos DB is deployed to the same three regions as Table storage, which the team did to mimic the Table storage topology as closely as possible during the migration. Similarly, North Central US is the primary (read/write) region while the other two regions are currently read-only. The Azure Cosmos DB environment has 18 tables containing 5 TB of data and consumes about 1 million request units per second (RU/s), which are the units used to reserve guaranteed database throughput in Azure Cosmos DB.

Now that migration is complete, the team plans to turn on multi-master capabilities, which will write-enable all regions instead of just the primary one. Tying into this, the team also plans to scale out globally by replicating its backend store to additional regions around the world—as a means of improving latency from the perspective of the Office client by putting copies of the OLS data closer to where its users are.

In part 2 of this series, we examine the outcomes resulting from the team’s efforts to build its new Office Licensing Service on Azure Cosmos DB.

Get started with Azure Cosmos DB today

Visit Azure Cosmos DB.

See Introduction to Azure Cosmos DB Table API.

Quelle: Azure

Office Licensing Service and Azure Cosmos DB part 2: Improved performance and availability

This post is part 2 of a two-part series about how organizations use Azure Cosmos DB to meet real world needs, and the difference it’s making to them. In part 1, we explored the challenges that led the Microsoft Office Licensing Service team to move from Azure Table storage to Azure Cosmos DB, and how it migrated its production workload to the new service. In part 2, we examine the outcomes resulting from the team’s efforts.

Strong benefits with minimal effort

The Microsoft Office Licensing Service (OLS) team’s migration from Azure Table storage to Azure Cosmos DB was simple and straightforward, enabling the team to meet all its needs with minimal effort.

An easy migration

In moving to Azure Cosmos DB, thanks to its Table API, the OLS team was able to reuse most of its data access code, and the migration engine they wrote to avoid any downtime was fast and easy to build.

Danny Cheng, a software engineer at Microsoft, who leads the OLS development team explains:

“The migration engine was the only real ‘new code’ we had to write. And the code samples for all three parts are publicly available, so it’s not like we had to start from scratch. All in all, the migration tooling we developed took three developers about four weeks each.”

Virtually unlimited throughput

Today, database throughput is no longer an issue for the OLS team. With Table storage, the team faced a throughput limit of 20,000 operations per second per storage account, which forced them to maintain each of their 18 tables in a different storage account to achieve maximum throughput. The team now maintains one Azure Cosmos DB account, which has no upper limit on throughput and can support more than 10 million operations per second per table—all dedicated and backed by SLAs.

Guaranteed high availability

Azure Cosmos DB gives the OLS team a 99.999 percent read availability SLA for all multi-region accounts. This has led to a significant increase in storage quality-of-service (QoS), as illustrated in the following metrics captured using internally developed tooling.

“During peak traffic hours, Azure Cosmos DB delivers much better storage QoS than we were seeing with Table storage,” says Cheng. “Today we’re seeing five nines, when in the past we were at about three nines.”

Automatic failover

The OLS team can now configure automatic or manual failovers to help protect against the unlikely event of a regional outage, with all SLAs maintained. The team can also prioritize failover order for its multi-region accounts and can manually trigger failover to test the end-to-end availability of OLS.

“We’ve configured automatic failover, but the service is so reliable that we haven’t needed it yet,” says Cheng.

Lower latency

Table storage provided the OLS team with no upper bounds on latency. In contrast, Azure Cosmos DB provides single-digit latency for reads and writes, backed with a guarantee of <10 millisecond latency for reads and writes at the 99th percentile, at any scale, anywhere in the world. The following metrics illustrate the differences in latency that the OLS service is seeing between Table storage and Azure Cosmos DB. (DbTable is Azure Table storage and CosmosDbTable is the Azure Cosmos DB Table API.)

Turnkey data distribution

With Table storage, options for global distribution were limited. What’s more, the OLS team couldn’t implement failover on its own. With Azure Cosmos DB, the team now enjoys distribution  to any number of regions—including multi-master capabilities, which when enabled will let any regions accept write operation.

“Just by clicking on the map, data can be automatically replicated to any Azure region in the world,” says Cheng. “This feature is very convenient, and we plan to put it to use soon.”

Other technical benefits

In addition to the above, Azure Cosmos DB provides the OLS team with some additional advantages over Table storage:

Automatic indexing. With Table storage, primary indexes are limited to PartitionKey and RowKey, and there are no secondary indexes. Azure Cosmos DB provides automatic and complete indexing on all properties by default, with no index management.

Faster query times. With Table storage, query execution uses the index for the primary key and scans otherwise. With Azure Cosmos DB, queries can take advantage of automatic indexing on all properties for faster query times.

Consistency. With Table storage, the OLS team was limited to strong consistency within the primary region and eventual consistency within the secondary region. With Azure Cosmos DB, they can choose from well-defined consistency levels, enabling them to optimize tradeoffs between read consistency and latency, availability, and throughput while they were designing the solution.

Get started with Azure Cosmos DB today

Visit Azure Cosmos DB.
See Introduction to Azure Cosmos DB Table API.

Quelle: Azure