Microsoft Azure News, Entwicklungen, Updates, HowTos - Seite 246 von 255

Azure DocumentDB, Microsoft’s globally replicated, low latency, NoSQL database, is pleased to announce updates to all four of its client-side SDKs. The biggest improvements were made to the Python SDK, including support for Python 3, connection pooling, consistency improvements, and Top/Order By support for partitioned collections.

This article describes the changes made to each of the new SDKs.

DocumentDB Python SDK 2.0.0

The Azure DocumentDB Python SDK now supports Python 3. The DocumentDB Python SDK previously supported Python 2.7, but it now supports Python 3.3, Python 3.4 and Python 3.5, in addition to Python 2.7. But that’s not all! Connection pooling is now built in, so instead of creating a new connection for each request, calls to the same host are now added to the same session, saving the cost of creating a new connection each time. We also added a few enhancements to consistency level support, and we added Top and Order By support for cross-partition queries, so you can retrieve the top results from multiple partitions and order those results based on the property you specify.

To get started, go to the DocumentDB Python SDK page to download the SDK, get the latest release notes, and browse to the API reference content.

DocumentDB .NET SDK 1.10.0

The new DocumentDB .NET SDK has a few specific improvements, the biggest of which is direct connectivity support for partitioned collections. If you're currently using a partitioned collection – this improvement is the go fast button! In addition, the .NET SDK improves performance for the Bounded Staleness consistency level, and adds LINQ support for StringEnumConverter, IsoDateTimeConverter and UnixDateTimeConverter while translating predicates.

You can download the latest DocumentDB .NET SDK, get the latest release notes, and browse to the API reference content from the DocumentDB .NET SDK page.

DocumentDB Java SDK 1.9.0

In the new DocumentDB Java SDK, we’ve changed Top and Order By support to include queries across partitions within a collection.

You can download the Java SDK, get the latest release notes, and browse to the API reference content from the DocumentDB Java SDK page.

DocumentDB Node.js SDK 1.10.0

In the new Node.js SDK, we also changed Top and Order By support to include queries across partitions within a collection.

You can download the Node.js SDK, get the latest release notes, and browse to the API reference content from the DocumentDB Node.js SDK page.

Please upgrade to the latest SDKs and take advantage of all these improvements. And as always, if you need any help or have questions or feedback regarding the new SDKs or anything related to Azure DocumentDB, please reach out to us on the developer forums on Stack Overflow. And stay up-to-date on the latest DocumentDB news and features by following us on Twitter @DocumentDB.
Quelle: Azure

5. Oktober 20165. Oktober 2016

da Agency

Real-Time Feature Engineering for Machine Learning with DocumentDB

Ever wanted to take advantage of your data stored in DocumentDB for machine learning solutions? This blog post demonstrates how to get started with event modeling, featurizing, and maintaining feature data for machine learning applications in Azure DocumentDB.

Machine Learning and RFM

The field of machine learning is pervasive – it is difficult to pinpoint all the ways in which machine learning affects our day-to-day lives. From the recommendation engines that power streaming music services to the models that forecast crop yields, machine learning is employed all around us to make predictions. Machine learning, a method for teaching computers how to think and recognize patterns in data, is increasingly being used to help garner insights from colossal datasets – feats humans do not have the memory capacity and computational power to perform.

In the world of event modeling and machine learning, RFM is no strange concept. Driven by three dimensions (Recency, Frequency, Monetary), RFM is a simple yet powerful method for segmenting customers used often in machine learning models. The reasoning behind RFM is intuitive and consistent across most scenarios: a customer who bought something yesterday is more likely to make another purchase than a customer who has not bought anything in a year. In addition, spendy customers who frequently make purchases are also categorized as valuable using the RFM technique.

Properties of RFM features:

RFM feature values can be calculated using basic database operations.
Raw values can be updated online as new events arrive.
RFM features are valuable in machine learning models.

Because insights drawn from raw data become less useful over time, being able to calculate RFM features in near real-time to aid in decision-making is important [1]. Thus, a general solution that enables one to send event logs and automatically featurize them in near real-time so the RFM features can be employed in a variety of problems is ideal.

Where does DocumentDB fit in?

Azure DocumentDB is a blazing fast, planet-scale NoSQL database service for highly available, globally distributed apps that scales seamlessly with guaranteed low latency and high throughput. Its language integrated, transactional execution of JavaScript permits developers to write stored procedures, triggers, and user defined functions (UDFs) natively in JavaScript.

Thanks to these capabilities, DocumentDB is able to meet the aforementioned time constraints and fill in the missing piece between gathering event logs and arriving at a dataset composed of RFM features in a format suitable for training a machine learning model that accurately segments customers. Because we implemented the featurization logic and probabilistic data structures used to aid the calculation of the RFM features with JavaScript stored procedures, this logic is shipped and executed directly on the database storage partitions. The rest of this post will demonstrate how to get started with event modeling and maintaining feature data in DocumentDB for a churn prediction scenario.

The end-to-end code sample of how to upload and featurize a list of documents to DocumentDB and update RFM feature metadata is hosted on our GitHub.

Scenario

The first scenario we chose to tackle to begin our dive into the machine learning and event modeling space is the problem from the 2015 KDD Cup, an annual Data Mining and Knowledge Discovery competition. The goal of the competition was to predict whether a student will drop out of a course based on his or her prior activities on XuetangX, one of the largest massive open course (MOOC) platforms in China.

The dataset is structured as follows:

Figure 1. We would like to gratefully acknowledge the organizers of KDD Cup 2015 as well as XuetangX for making the datasets available.

Each event details an action a student completed. Examples include watching a video or answering a particular question. All events consist of a timestamp, a course ID (cid), student ID (uid), and enrollment ID (eid) which is unique for each course-student pair.

Approach

Modeling Event Logs

The first step was to determine how to model the event logs as documents in DocumentDB. We considered two main approaches. In the first approach, we used the combination of <entity name, entity value, feature name> as the primary key for each document. An example primary key with this strategy is <”eid”, 1, “cat”>. This means that we created a separate document for each feature we wanted to keep track of when the student enrollment id is 1. In the case of a large number of features, this can result in a multitude of documents to insert. We took a bulk approach in the second iteration, using <entity name, entity value> instead as the primary key. An example primary key with this strategy is <”eid”, 1>. In this approach, we used a single document to keep track of all the feature data when the student enrollment id is 1.

The first approach minimizes the number of conflicts during insertion because there is the additional feature name attribute, making the primary key more unique. The resulting throughput is not optimal, however, in the case of a large number of features because an additional document needs to be inserted for each feature. The second approach maximizes throughput by featurizing and inserting event logs in a bulk manner, increasing the probability of conflicts. For this blog post, we chose to walk through the first approach, which provides for simpler code and fewer conflits.

Step 1

Create the stored procedure responsible for updating the RFM feature metadata.

private static async Task CreateSproc()
{
string scriptFileName = @"updateFeature.js";
string scriptName = "updateFeature";
string scriptId = Path.GetFileNameWithoutExtension(scriptFileName);

var client = new DocumentClient(new Uri(Endpoint), AuthKey);
Uri collectionLink = UriFactory.CreateDocumentCollectionUri(DbName, CollectionName);

var sproc = new StoredProcedure
{
Id = scriptId,
Body = File.ReadAllText(scriptFileName)
};
Uri sprocUri = UriFactory.CreateStoredProcedureUri(DbName, CollectionName, scriptName);

bool needToCreate = false;

try
{
await client.ReadStoredProcedureAsync(sprocUri);
}
catch (DocumentClientException de)
{
if (de.StatusCode != HttpStatusCode.NotFound)
{
throw;
}
else
{
needToCreate = true;
}
}

if (needToCreate)
{
await client.CreateStoredProcedureAsync(collectionLink, sproc);
}
}

Step 2

Featurize each event. In this example, each student action expands into 12 rows of the form { entity: { name: “ “, value: …}, feature: { name: “ “, value: …} } that must be inserted in your DocumentDB collection with the previously created stored procedure. We did this process in batches, the size of which can be configured.

private static string[] Featurize(RfmDoc doc)
{
List<string> result = new List<string>();

var entities = new Tuple<string, object>[] { new Tuple<string, object>("eid", doc.Eid), new Tuple<string, object>("cid", doc.Cid),
new Tuple<string, object>("uid", doc.Uid) };
var features = new Tuple<string, object>[] { new Tuple<string, object>("time", doc.Time), new Tuple<string, object>("src_evt", doc.SourceEvent),
new Tuple<string, object>("cat", doc.Cat), new Tuple<string, object>("obj", doc.Obj) };

foreach (var entity in entities)
{
foreach (var feature in features)
{
StringBuilder eb = new StringBuilder();
StringBuilder fb = new StringBuilder();
StringWriter eWriter = new StringWriter(eb);
StringWriter fWriter = new StringWriter(fb);

JsonSerializer s = new JsonSerializer();
s.Serialize(eWriter, entity.Item2);
string eValue = eb.ToString();

s.Serialize(fWriter, feature.Item2);
string fValue = fb.ToString();

var value = string.Format(CultureInfo.InvariantCulture, "{{"entity":{{"name":"{0}","value":{1}}},"feature":{{"name":"{2}","value":{3}}}}}",
entity.Item1, eValue, feature.Item1, fValue);
result.Add(value);
}
}

return result.ToArray();
}

Step 3

Execute the stored procedure created in step 1.

private static async Task<StoredProcedureResponse<string>> UpdateRFMMetadata(DocumentClient client, string metaDoc)
{
object metaDocObj = JsonConvert.DeserializeObject(metaDoc);

int retryCount = 100;
while (retryCount > 0)
{
try
{
Uri sprocUri = UriFactory.CreateStoredProcedureUri(DbName, CollectionName, "updateFeature");
var task = client.ExecuteStoredProcedureAsync<string>(
sprocUri,
metaDocObj);
return await task;
}
catch (DocumentClientException ex)

The stored procedure takes as input a row of the form { entity: { name: “ ”, value: …}, feature: { name: “ ”, value: …} } and updates the relevant feature metadata to produce a document of the form { entity: { name: "", value: "" }, feature: { name: "", value: …}, isMetadata: true, aggregates: { "count": …, "min": … } }. Depending on the name of the feature in the document that is being inserted into DocumentDB, a subset of predefined aggregates is updated. For example, if the feature name of the document is “cat” (category), the count_unique_hll aggregate is employed to keep track of the unique count of categories. Alternatively, if the feature name of the document is “time”, the minimum and maximum aggregates are utilized. The following code snippet demonstrates how the distinct count and minimum aggregates are updated. See the next section for a more detailed description of the data structures that we are using to maintain these aggregates.

case AGGREGATE.count_unique_hll:
if (aggData === undefined) aggData = metaDoc.aggregates[agg] = new CountUniqueHLLData();
aggData.hll = new HyperLogLog(aggData.hll.std_error, murmurhash3_32_gc, aggData.hll.M);

let oldValue = aggData.value = aggData.hll.count();
aggData.hll.count(doc.feature.value); // add entity to hll
aggData.value = aggData.hll.count();

if (aggData.value !== oldValue && !isUpdated) isUpdated = true;
break;
case AGGREGATE.min:
if (aggData === undefined) aggData = metaDoc.aggregates[agg] = new AggregateData();
if (aggData.value === undefined) aggData.value = doc.feature.value;
else if (doc.feature.value < aggData.value) {
aggData.value = doc.feature.value;
if (!isUpdated) isUpdated = true;
}
break;

Probabilistic Data Structures

We implemented the following three probabilistic data structures in JavaScript, each of which can be updated conditionally as part of the stored procedure created in the previous section.

HyperLogLog

Approximates the number of unique elements in a multiset by applying a hash function to each element in the multiset (obtaining a new multiset of uniformly distributed random numbers with the same cardinality as the original set) and calculating the maximum number of leading zeros in the binary representation of each number in the new set n. The estimated cardinality is 2^n [2].

BloomFilter

Tests whether an element is a member of a set. While false positives are possible, false negatives are not. Rather, a bloom filter either returns maybe in the set or definitely not in the set when asked if an element is a member of a set. To add an element to a bloom filter, the element is fed into k hash functions to arrive at k array positions. The bits at each of those positions are set to 1. To test whether an element is in the set, the element is again fed to each of the k hash functions to arrive at k array positions. If any one of the bits is 0, the element is definitely not in the set [3].

Count-Min Sketch

Ingests a stream of events and counts the frequency of distinct members in the set. The sketch may be queried for the frequency of a specific event type. Similar to the bloom filter, this data structure uses some number of hash functions to map events to values – however, it uses these hash functions to keep track of event frequencies instead of whether or not the event exists in the dataset [4].

Each of the above data structures returns an estimate within a certain range of the true value, with a certain probability. These probabilities are tunable, depending on how much memory you are willing to sacrifice. The following snippet shows how to retrieve the HyperLogLog approximation for the number of unique objects for the student with eid = 1.

private static void OutputResults()
{
var client = new DocumentClient(new Uri(Endpoint), AuthKey);
Uri collectionLink = UriFactory.CreateDocumentCollectionUri(DbName, CollectionName);

string queryText = "select c.aggregates.count_unique_hll["value"] from c where c.id = "_en=eid.ev=1.fn=obj"";
var query = client.CreateDocumentQuery(collectionLink, queryText);

Console.WriteLine("Result: {0}", query.ToList()[0]);
}

Conclusion

The range of scenarios where RFM features can have a positive impact extends far beyond churn prediction. Time and time again, a small number of RFM features have proven to be successful when used in a wide variety of machine learning competitions and customer scenarios.

Combining the power of RFM with DocumentDB’s server-side programming capabilities produces a synergistic effect. In this post, we demonstrate how to get started with event modeling and maintaining feature data with DocumentDB stored procedures. It is our hope that developers are now equipped with the tools to add functionality to our samples hosted on GitHub to maintain additional feature metadata on a case by case basis. Stay tuned for a future post that details how to integrate this type of solution with Azure Machine Learning where you can experiment with a wide variety of machine learning models on your data featurized by DocumentDB.

To learn more about how to write database program application logic that can be shipped and executed directly on the database storage partitions in DocumentDB, see DocumentDB server-side programming: Stored procedures, database triggers, and UDFs. Stay up-to-date on the latest DocumentDB news and features by following us on twitter @DocumentDB.

Lastly, please reach out to us at askdocdb@microsoft.com or leave a comment below for inquiries about additional ML support and to show us how you’re using DocumentDB for machine learning.

References

[1] Oshri, Gal. “RFM: A Simple and Powerful Approach to Event Modeling.” Cortana Intelligence and Machine Learning Blog (2016). https://blogs.technet.microsoft.com/machinelearning/2016/05/31/rfm-a-simple-and-powerful-approach-to-event-modeling/

[2] https://gist.github.com/terrancesnyder/3398489, http://stackoverflow.com/questions/5990713/loglog-and-hyperloglog-algorithms-for-counting-of-large-cardinalities

[4] https://github.com/mikolalysenko/count-min-sketch, The MIT License (MIT), Copyright © 2013 Mikola Lysenko
Quelle: Azure

4. Oktober 20164. Oktober 2016

da Agency

Microsoft Cloud coming to France

Earlier today we announced our plan to offer the Microsoft Cloud, including Microsoft Azure, Office 365, and Dynamics 365 from datacenters located in France with initial availability in 2017. Our local cloud regions help customers to innovate in their industries and move their businesses to the cloud while meeting European data sovereignty, security and compliance needs.

With 36 datacenter regions now announced, the Microsoft Cloud is the global coverage leader with more regions than any other cloud provider. We also offer the broadest range of options for customers to access cloud services locally within Europe from datacenter locations in Austria, Finland, Germany, Ireland, the Netherlands, and the United Kingdom.

In September, our two new United Kingdom regions became generally available to customers worldwide. We also opened two Azure Germany regions to all EU customers, offering a unique cloud model – a physically and logically separate cloud instance with customer data remaining in Germany under the management of a data trustee.

It’s exciting to see our efforts in Europe build on top of our global expansions, including the general availability of our cloud from Canada, new regions in the United States, and the announcement of plans to expand our cloud services from datacenters located in South Korea.

Our regions in Europe currently offer Azure services, Office 365 and Dynamics 365 as well as cloud suites like EMS, OMS, IoT, and additional solutions that are ready to deploy locally. Our datacenters, suites and services offer data residency and local compliance certifications including EU model clause 2010/87/EU, together with country specific certifications. With the largest and most comprehensive set of compliance certifications and attestations in the industry, the Microsoft Cloud gives our customers a trusted, open and complete cloud on which to run their business.

Customers in France, throughout Europe and the world, can get started with the Microsoft Cloud – and Microsoft Azure – today, from our six available regions in Europe. We’ll have more to share about the Microsoft Cloud in France from the Azure Blog in the future.

–Tom
Quelle: Azure

4. Oktober 20164. Oktober 2016

da Agency

Required practice for applications integrating with Azure Active Directory

This post is a follow-up from our previous announcement of the Azure Active Directory certificate rollover.

Continuing on our commitment to protect our customer’s data and building on the momentum of this August 15, 2016 rollover, we will be increasing the frequency with which we roll over Azure Active Directory’s global signing keys (previously referred to as “the Azure Active Directory certificates”).

What does the frequency increase mean for applications?

For applications that support automatic rollover, this frequency increase will have no impact on your application.
For applications that do not support automatic rollover you will have to establish a process to periodically monitor the keys and perform a manual rollover.

The next rollover, scheduled to start on October 10, 2016, is the last rollover we will be announcing.

Going forward, there will not be any announcements and we will only go through the usual steps of making new key available in the metadata and then gradually switching over using that new key. As outlined above, applications that support automatic rollover will seamlessly handle this while applications with the monitoring process will perform a manual rollover when the new key is available.

The guidance for assessing impact remains the same as that from our August rollover.

We do not expect any impact for:

Applications that support automatic rollover as per our best practices
Client applications
Applications added from the Azure Active Directory App Gallery (including “Custom”)
On-premises applications published via Application Proxy
Applications in Azure Active Directory B2C tenants

Put simply, if your application was not impacted by the August rollover, it will not be impacted by the October rollover or any subsequent rollovers.

Application impact

The applications take a dependency on the signing key and are not configured to automatically update the key from the metadata. Follow the information below to assess the impact of the rollover to your applications and how to update them to handle the key rollover if necessary.

Sign in to the Azure classic portal using an administrator account.
Under the Active Directory tab, select your directory.
Select Applications my company owns from the Show dropdown menu then click the checkmark at the right to apply the filter.
Review each of the applications listed using the guidelines on the Signing key rollover in Azure Active Directory documentation and make the recommended changes if required.

If you experience unusual behaviors please contact Azure Support.
Quelle: Azure

3. Oktober 20163. Oktober 2016

da Agency

Redgate delivers efficient migrations for Azure SQL Data Warehouse

A defining characteristic of #cloud computing is elasticity – the ability to rapidly provision and release resources so users pay only for the resources they need for the task at hand. Such just-in-time provisioning can lead to significant savings for customers when their workloads are intermittent and heavily spiked. In the modern enterprise, there are few workloads that have as desperate a need for such elastic capabilities as data warehousing. Most traditional enterprise Data Warehouse (DW) systems are built on-premises with very expensive hardware and software, and have very low utilization except during peak periods of data loading, transformation, and report generation.

With the general availability of the Azure SQL Data Warehouse we are delivering the true promise of cloud elasticity to data warehousing. It is a fully managed, petabyte-scale Data Warehouse service that you can provision in minutes with just a few clicks in the Azure Portal. Our architecture separates compute and storage so that you can independently scale them and use just the right amount of each at any given time. A unique pause feature allows you to suspend or resume compute in seconds, while your data remains intact in Azure storage. And SQL Data Warehouse offers an availability SLA of 99.9% – the only public cloud data warehouse service that offers an availability SLA to customers.

To help users get started with Azure SQL Data Warehouse we have been working with Redgate, a long-time partner that delivers SQL Server tools. Redgate’s Data Platform Studio (DPS) provides a simple and reliable way to migrate on-premises SQL Server databases to Azure SQL Data Warehouse. DPS automates the data upload and applies the most appropriate compatibility fixes and optimizations. It reduces the timeframe for a first data migration from days to hours, giving companies an easy way to explore the potential of the SQL Data Warehouse.

09-01-2016 12 min, 47 sec

The development of Data Platform Studio is a result of Redgate’s own experience of migrating an on-premises database to SQL Data Warehouse. As David Bick, Product Portfolio Lead at Redgate, explains, “We like to think of ourselves as experts in the SQL Server space, but even we hit a few roadblocks on the way. Data Platform Studio removes those blocks because it’s engineered to make smart decisions and automate the migration process. It encapsulates everything we encountered on our own journey, and includes a lot of subsequent learning from Microsoft.”

By making migrations fast and easy, Data Platform Studio allows users to quickly experience how Azure SQL Data Warehouse scales storage and compute and to evaluate how it can meet their needs. Data Platform Studio is free to use for one-off migrations, and can be tried at www.dataplatformstudio.com.

Learn more

Check out the many resources for learning more about SQL Data Warehouse, including:

What is Azure SQL Data Warehouse?

SQL Data Warehouse best practices

Video library

MSDN forum

Stack Overflow forum

Redgate Data Platform Studio
Quelle: Azure