Microsoft and Docker collaborate on new ways to deploy containers on Azure

Now more than ever, developers need agility to meet rapidly increasing demands from customers. Containerization is one key way to increase agility. Containerized applications are built in a more consistent and repeatable way, by way of defining desired infrastructure, dependencies, and configuration as code for all stages of the lifecycle. Applications often start and stop faster at runtime too, which often helps quickly start, stop, scale out, and update in the cloud.

With this in mind, we announced a new partnership earlier today between Microsoft and Docker to integrate Docker Desktop more closely with Microsoft Azure and the Visual Studio line of products.  

Docker Desktop built-in tools, features, and command-line utilities will provide a way to natively set Azure as a context to run containers in the cloud with context and run in Azure containers in a few simple commands. The product integration begins with the ability to create Azure Container Instances (ACI), which is a solution for any scenario that can operate in isolated containers, without orchestration.

Let’s take a look at new product integrations using an example. We have a simple TCP-based Python game server app that is already building and running on the local developer machine using Docker Desktop. The app depends on a slim version of Linux and other dependencies in requirements.txt. The Docker tools extension in Visual Studio Code provides easy commands to do builds and runs on Docker Desktop, and then a push to a private container registry in Docker Hub. The experience is particularly fast using the new release of WSL2.

With the updated version of Docker Desktop, coming later this year, we see native commands for creating a docker context for Azure Container Instances:

$ docker context aci-create paulyuk/webapp-dev
$ docker context use paulyuk/webapp-dev

Contexts are useful to easily swap between one or more environments that have a Docker host. As an example, I can have contexts for local (the default), myapp-dev, myapp-qa. The entire Docker tools chain (including Docker.exe CLI) honors the context. This makes running a container in Azure easy and consistent with running locally, just using the same familiar command:

$ docker run paulyuk/pythontcpgame:1.1

Deploying a container to Azure is as simple as that, using the standard tools in Docker Desktop. Plus, you can bring the whole experience together using Docker Desktop + Visual Studio + WSL2 + GitHub to have a cloud-optimized desktop. I go into more detail about the integrations in this DockerCon LIVE 2020 session.

We are thrilled about expanding our collaboration with Docker and continuing to make the developer experience better for developers.

Learn more

To learn more about the partnership, you can read this press release and blog post from Docker. You can leverage the current VS Code Docker Extension with docker contexts, Docker Desktop with WSL2 integration today. The preview with the ACI integration in Docker Desktop will follow later this year.
Quelle: Azure

Migrating Apache Hadoop clusters to Google Cloud

Apache Hadoop and the big data ecosystem around it has served businesses well for years, offering a way to tackle big data problems and build actionable analytics. As these on-prem deployments of Hadoop and Apache Spark, Presto, and more moved out of experiments and into thousand-node clusters, cost, performance, and governance challenges emerged. While these challenges grew on-prem, Google Cloud emerged as a solution for many Hadoop admins looking to decouple compute from storage to increase performance while only paying for the resources they use. Managing costs and meeting data and analytics SLAs, while still providing secure and governed access to open source innovation, became a balancing act that the public cloud could solve without large upfront machine costs. How to think about on-prem Hadoop migration costsThere is no one right way to estimate Hadoop migration costs. Some folks will look at their footprint on-prem today, then try to compare directly with the cloud, byte for byte and CPU cycle for CPU cycle. There is nothing wrong with this approach, and when you consider opex and capex, and discounts such as those for sustained compute usage, the cost case will start to look pretty compelling. Cloud economics work! But what about taking a workload-centric approach? When you run your cloud-based Hadoop and Spark proof of concepts, consider the specific workload by measuring the units billed to run just that workload. Spoiler: it is quite easy when you spin up a cluster, run the data pipeline, and then tear down the cluster after you are finished. Now, consider making a change to that workload. For example, use a later version of Spark and then redeploy. This is a seemingly easy task—but how would you accomplish it today on your on-prem cluster, and what would have been the cost to plan and implement such a change? These are all things to consider when you are building a TCO analysis of migrating your entire, or just a piece of, your on-prem Hadoop cluster. Where to begin your on-prem Hadoop migrationIt’s important to note that you are not migrating a cluster, but rather the user and workloads, from a place where you shoulder the burden of maintaining and operating a cluster to a place where you share that responsibility with Google. Starting with these users and workloads allows you to build a better, more agile experience.Consider the data engineer who wants to update their pipeline to use the latest Spark APIs. When you migrate their code, you can choose to run it on its own ephemeral cluster—you are not forced to update the code for all your other workloads. They can run on their own cluster(s) and continue to leverage the previous version of the Spark APIs.Or for the data analyst who may need additional resources to run their Hive query in time to meet a reporting deadline, you can choose to enable autoscaling. Or for a data scientist who has been wanting to decrease their ML training job duration, you can provide them with a familiar notebook interface and spin up a cluster as needed with GPUs attached.These benefits all sound great, but there is hard work involved in migrating workloads and users. Where should you start? In the Migrating Data Processing Hadoop Workloads to Google Cloud blog post, we start the journey by helping data admins, architects, and engineers consider, plan, and run a data processing job. Spoilers: You can precisely select which APIs and versions are available for any specific workload. You can size and scale your cluster as needed to meet the workload’s requirements.Once you’re storing and processing your data in Google Cloud, you’ll want to think about enabling your analysis and exploration tools, wherever they are running, to work with your data. The work required here is all about proxies, networking, and security—but don’t worry, this is well-trodden ground. In Migrating Hadoop clusters to GCP – Visualization Security: Part I – Architecture, we’ll help your architects and admins to enable your analysts. For data science workloads and users, we have recently released Dataproc Hub, which enables your data scientists and IT admins to access on-demand clusters tailored to their specific data science needs as securely as possible. The Apache Hadoop ecosystem offers some of the best data processing and analytical capabilities out there. A successful migration is one in which we have unleashed them for your users and workloads; one in which the workload defines a cluster, and not the other way around. Get in touch with your Google Cloud contact and we’ll make your Hadoop migration a success together.
Quelle: Google Cloud Platform

How Kaggle solved a spam problem in 8 days using AutoML

Kaggle is a data science community of nearly 5 million users. In September of 2019, we found ourselves under a sudden siege of spam traffic that threatened to overwhelm visitors to our site. We had to come up with an effective solution, fast. Using AutoML Natural Language on Google Cloud, Kaggle was able to train, test, and deploy a spam detection model to production in just eight days. In this post, we’ll detail our success story about using machine learning to rapidly solve an urgent business dilemma.A spam dilemmaMalicious users were suddenly creating large numbers of Kaggle accounts in order to leave spammy search engine optimization (SEO) content in the user bio section. Search engines were indexing these bios, and our existing spam detection heuristics were failing to flag them. In short, we faced a growing and embarrassing predicament. Our problem was context. Kaggle is a community focused on data science and machine learning. As a result of our topical data-science focus, a user bio that seems harmless in isolation may be the work of a spammer. Here is a real example of one such bio:I am a personal injury lawyer in Chicago. I help individuals and families in cases involving serious injuries and wrongful death. Many of my cases involve car accidents, nursing home abuse, and medical malpractice.Such a bio may fit in on a forum of legal professionals, but on the Kaggle site it’s a mark of an SEO spammer. This content also lacks the typical keywords and unsavory topics that one might expect to find in spam. This context meant that stopping the spam required more than a generic model; we needed a solution that could take our Kaggle-specific context into account.We had the intuition that machine learning could handle this problem, but building natural language models to deal with spam was not anyone at Kaggle’s day job. We feared weeks of late nights slogging towards a good-enough solution—spam models require very high accuracy because of the high cost of miscategorizing a legitimate user. Even with a usable prototype running in R or Python, there was the looming frustration of deploying it in Kaggle’s C# codebase. As we planned out our options, we had an unconventional idea: what about trying AutoML?Enter AutoMLTrue to its name, AutoML performs automated machine learning: evaluating huge numbers of neural network architectures to determine the most effective model for a problem. We first witnessed the potential of the AutoML suite of products when a Google team used it to take second place at the 2019 KaggleDays hackathon. On a whim, we decided to pass our bio problem through the AutoML Natural Language Classification API. We could readily generate a labeled training dataset because we had existing examples of bios belonging to known-legitimate users:After uploading these bios, clicking the “Start Training” button, and waiting a few hours, we received an email that training was complete. Building models is normally a process that involves many failures, but the results were astoundingly impressive for a first attempt, with precision (how “accurate” the model is) and recall (how “thorough” the model is) above 99%.We manually inspected the performance, ran test examples through the model, and determined it would be immediately suitable to deploy in production. It successfully picked up on a wide variety of spammy content types (some identifying information and language is blurred out):Returning to our previous example on the importance of context, the model gives the personal injury lawyer a 98% confidence of being spam:Meanwhile, it has full confidence that the data scientist equivalent is allowable:On top of being accurate, AutoML afforded a major advantage when the time came to deploy the model. When training was finished, the model was already hosted and exposed via an API. Kaggle simply had to write a quick shim to call this API from our application.It took only eight days from when we started working on this problem to when we deployed a model serving live traffic. It required no advanced skills in deep learning or natural language processing. The model has since made thousands of correct decisions and greatly reduced our spam-related traffic.While this story was about spam detection, the takeaway isn’t just that you can use AutoML for spam. AutoML has the potential to replicate this success story across the thousands of bespoke image, text, or tabular problems that businesses face. AutoML can step in when off-the-shelf models are insufficient, when you want to test a hunch but don’t have months to dedicate to it, or if you’re simply not a deep learning expert. The combination of high accuracy, rapid iteration, and smooth deployment can make AutoML an attractive approach to developing machine learning solutions for a wide range of business problems and needs.
Quelle: Google Cloud Platform