What a year! Google Cloud Platform in 2017

By Alex Barrett and Barrett Williams, Google Cloud blog editors

The end of the year is a time for reflection . . . and making lists. As 2017 comes to a close, we thought we’d review some of the most memorable Google Cloud Platform (GCP) product announcements, white papers and how-tos, as judged by popularity with our readership.

As we pulled the data for this post, some definite themes emerged about your interests when it comes to GCP:

You love to hear about advanced infrastructure: CPUs, GPUs, TPUs, better network plumbing and more regions. 
 How we harden our infrastructure is endlessly interesting to you, as are tips about how to use our security services. 
 Open source is always a crowd-pleaser, particularly if it presents a cloud-native solution to an age-old problem. 
 You’re inspired by Google innovation — unique technologies that we developed to address internal, Google-scale problems.

So, without further ado, we present to you the most-read stories of 2017.

Cutting-edge infrastructure
If you subscribe to the “bigger is always better” theory of cloud infrastructure, then you were a happy camper this year. Early in 2017, we announced that GCP would be the first cloud provider to offer Intel Skylake architecture, GPUs for Compute Engine and Cloud Machine Learning became generally available and Shazam talked about why cloud GPUs made sense for them. In the spring, you devoured a piece on the performance of TPUs, and another about the then-largest cloud-based compute cluster. We announced yet more new GPU models and topping it all off, Compute Engine began offering machine types with a whopping 96 vCPUs and 624GB of memory.

It wasn’t just our chip offerings that grabbed your attention — you were pretty jazzed about Google Cloud network infrastructure too. You read deep dives about Espresso, our peering-edge architecture, TCP BBR congestion control and improved Compute Engine latency with Andromeda 2.1. You also dug stories about new networking features: Dedicated Interconnect, Network Service Tiers and GCP’s unique take on sneakernet: Transfer Appliance.

What’s the use of great infrastructure without somewhere to put it? 2017 was also a year of major geographic expansion. We started out the year with six regions, and ended it with 13, adding Northern Virginia, Singapore, Sydney, London, Germany, Sao Paolo and Mumbai. This was also the year that we shed our Earthly shackles, and expanded to Mars ;)

Security above all

Google has historically gone to great lengths to secure our infrastructure, and this was the year we discussed some of those advanced techniques in our popular Security in plaintext series. Among them: 7 ways we harden our KVM hypervisor, Fuzzing PCI Express and Titan in depth.

You also grooved on new GCP security services: Cloud Key Management and managed SSL certificates for App Engine applications. Finally, you took heart in a white paper on how to implement BeyondCorp as a more secure alternative to VPN, and support for the European GDPR data protection laws across GCP.

Open, hybrid development

When you think about GCP and open source, Kubernetes springs to mind. We open-sourced the container management platform back in 2014, but this year we showed that GCP is an optimal place to run it. It’s consistently among the first cloud services to run the latest version (most recently, Kubernetes 1.8) and comes with advanced management features out of the box. And as of this fall, it’s certified as a conformant Kubernetes distribution, complete with a new name: Google Kubernetes Engine.

Part of Kubernetes’ draw is as a platform-agnostic stepping stone to the cloud. Accordingly, many of you flocked to stories about Kubernetes and containers in hybrid scenarios. Think Pivotal Container Service and Kubernetes’ role in our new partnership with Cisco. The developers among you were smitten with Cloud Container Builder, a stand-alone tool for building container images, regardless of where you deploy them.

But our open source efforts aren’t limited to Kubernetes — we also made significant contributions to Spinnaker 1.0, and helped launch the Istio and Grafeas projects. You ate up our “Partnering on open source” series, featuring the likes of HashiCorp, Chef, Ansible and Puppet. Availability-minded developers loved our Customer Reliability Engineering (CRE) team’s missive on release canaries, and with API design: Choosing between names and identifiers in URLs, our Apigee team showed them a nifty way to have their proverbial cake and eat it too.

Google innovation

In distributed database circles, Google’s Spanner is legendary, so many of you were delighted when we announced Cloud Spanner and a discussion of how it defies the CAP Theorem. Having a scalable database that offers strong consistency and great performance seemed to really change your conception of what’s possible — as did Cloud IoT Core, our platform for connecting and managing “things” at scale. CREs, meanwhile, showed you the Google way to handle an incident.

2017 was also the year machine learning became accessible. For those of you with large datasets, we showed you how to use Cloud Dataprep, Dataflow, and BigQuery to clean up and organize unstructured data. It turns out you don’t need a PhD to learn to use TensorFlow, and for visual learners, we explained how to visualize a variety of neural net architectures with TensorFlow Playground. One Google Developer Advocate even taught his middle-school son TensorFlow and basic linear algebra, as applied to a game of rock-paper-scissors.

Natural language processing also became a mainstay of machine learning-based applications; here, we highlighted with a lighthearted and relatable example. We launched the Video Intelligence API and showed how Cloud Machine Learning Engine simplifies the process of training a custom object detector. And the makers among you really went for a post that shows you how to add machine learning to your IoT projects with Google AIY Voice Kit. Talk about accessible!

Lastly, we want to thank all our customers, partners and readers for your continued loyalty and support this year, and wish you a peaceful, joyful, holiday season. And be sure to rest up and visit us again Next year. Because if you thought we had a lot to say in 2017, well, hold onto your hats.

Quelle: Google Cloud Platform

Consequences of SLO violations — CRE life lessons

By Alex Bramley, Customer Reliability Engineer

Previous episodes of CRE life lessons have talked in detail about the importance of quantifying a service’s availability and using SLOs to manage the competing priorities of features-focused development teams (“devs”) versus a reliability-focused SRE team. Good SLOs can help reduce organizational friction and maintain development velocity without sacrificing reliability. But what should happen when SLOs are violated?

In this blogpost, we discuss why you should create a policy on how SREs and devs respond to SLO violations, and provide some ideas for the structure and components of that policy. Future posts will go over an example taken from an SRE team here at Google, and work through some scenarios that put that policy into action.

Features or reliability?

In the ideal world (assuming spherical SREs in a vacuum), an SLO represents the dividing line between two binary states: developing new features when there’s error budget to spare, and improving service reliability when there isn’t. Most real engineering organizations will instead vary their effort on a spectrum between these two extremes as business priorities dictate. Even when a service is operating well within its SLOs, choosing to do some proactive reliability work may reduce the risk of future outages, improve efficiency and provide cost savings; conversely it’s rare to find an organization that completely drops all in-flight feature development as soon as an SLO is violated.

Describing key inflection points from that spectrum in a policy document is an important part of the relationship between an SRE team and the dev teams with whom they partner. This ensures that all parts of the organization have roughly the same understanding around what is expected of them when responding to (soon to be) violated SLOs, and – most importantly – that the consequences of not responding are clearly communicated to all parties. The exact choice of inflection points and consequences will be specific to the organization and its business priorities.

Inflection points

Having a strong culture of blameless postmortems and fixing root causes should eventually mean that most SLO violations are unique – informally, “we are in the business of novel outages.” It follows that the response to each violation will also be unique; making judgement calls around these is a part of an SREs job when responding to the violation. But a large variance in the range of possible responses results in inconsistency of outcomes, people trying to game the system and uncertainty for the engineering organization.

For the purposes of an escalation policy, we recommend that SLO violations be grouped into a few buckets of increasing severity based on the cumulative impact of the violation over time (i.e., how much error budget has been burned over what time horizon), with clearly defined boundaries for moving from one bucket to another. It’s useful to have some business justification for why violations are grouped as they are, but this should be in an appendix to the main policy to keep the policy itself clear.

It’s a good idea to tie at least some of the bucket boundaries to any SLO-based alerting you have. For example, you may choose to page SREs to investigate when 10% of the weekly error budget has been burned in the past hour; this is an example of an inflection point tied to a consequence. It forms the boundary between buckets we might informally title “not enough error budget burned to notify anyone immediately” and “someone needs to investigate this right now before the service is out of its long-term SLO.” We’ll examine more concrete examples in our next post, where we look at a policy from an SRE team within Google.

Consequences

The consequences of a violation are the meat of the policy. They describe actions that will be taken to bring the service back into SLO, whether this is by root causing and fixing the relevant class of issue, automating any stop-gap mitigation tasks or by reducing the near-term risk of further deterioration. Again, the choice of consequence for a given threshold is going to be specific to the organization defining the policy, but there are several broad areas into which these fall. This list is not exhaustive!

Notify someone of potential or actual SLO violation

The most common consequence of any potential or actual SLO violation is that your monitoring systems tells a human that they need to investigate and take remedial action. For a mature, SRE-supported service, this will normally be in the form of a page to the oncall when a large quantity of error budget has been burned over a short window, or a ticket when there’s an elevated burn rate over a longer time horizon. It’s not a bad idea for that page to also create a ticket in which you can record debugging details, use as a centralized communication point and reference when escalating a serious violation.

The relevant dev team should also be notified. It’s OK for this to be a manual process; the SRE team can add value by filtering and aggregating violations and providing meaningful context. But ideally a small group of senior people in the dev team should be made aware of actual violations in an automated fashion (e.g., by CCing them on any tickets), so that they’re not surprised by escalations and can chime in if they have pertinent information.

Escalate the violation to the relevant dev team

The key difference between notification and escalation is the expectation of action on the part of the dev team. Many serious SLO violations require close cooperation between SREs and developers to find the root cause and prevent recurrence. Escalation is not an admission of defeat. SREs should escalate as soon as they’re reasonably sure that input from the dev team will meaningfully reduce the time to resolution. The policy should set an upper bound on the length of time an SLO violation (or near miss) can persist without escalation.

Escalation does not signify the end of SRE’s involvement with an SLO violation. The policy should describe the responsibilities of each team and a lower bound on the amount of engineering time they should divert towards investigating the violation and fixing the root cause. It will probably be useful to describe multiple levels of escalation, up to and including getting executive-level support to commandeer the engineering time of the entire dev team until the service is reliable.

Mitigate risk of service changes causing further impact to SLOs

Since a service in violation of its SLO is by definition making users unhappy, day-to-day operations that may increase the rate at which error budget is burned should be slowed or stopped completely. Usually, this means restricting the rate of binary releases and experiments, or stopping them completely until the service is again within SLO. This is where the policy needs to ensure all parties (SRE, development, QA/testing, product and execs) are on the same page. For some engineering organizations, the idea that SLO violations will impact their development and release velocity may be difficult to accept. Reaching a documented agreement on how and when releases will be blocked – and what fraction of engineers will be dedicated to reliability work when this occurs – is a key goal.

Revoke support for the service

If a service is shown to be incapable of meeting its agreed-upon SLOs over an extended time period, and the dev team responsible for that service is unwilling to commit to engineering improvements to its reliability, then SRE teams at Google have the option of handing back the responsibility for running that service in production. This is unlikely to be the consequence of a single SLO violation, rather the combination of multiple serious outages over an extended period of time, where postmortem AIs have been assigned to the dev team but not prioritized or completed.

This has worked well at Google, because it changes the incentives behind any conversation around engineering for reliability. Any dev team that neglects the reliability of a service knows that they will bear the consequences of that neglect. By definition, revoking SRE support for a service is a last resort, but stating the conditions that must be met for it to happen makes it a matter of policy, not an idle threat. Why should SRE care about service reliability if the dev team doesn’t?

Summary

Hopefully this post has helped you think about the trade-off between engineering for reliability and features, and how responding to SLO violations moves the needle towards reliability. In our next post, we’ll present an escalation policy from one of Google’s SRE teams, to show the choices they made to help the dev teams they partner with maintain a high development velocity.

Quelle: Google Cloud Platform

Introducing Preemptible GPUs: 50% Off

By Chris Kleban and Michael Basilyan, GCE Product Managers

In May 2015, Google Cloud introduced Preemptible VM instances to dramatically change how you think about (and pay for) computational resources for high-throughput batch computing, machine learning, scientific and technical workloads. Then last year, we introduced lower pricing for Local SSDs attached to Preemptible VMs, expanding preemptible cloud resources to high performance storage. Now we’re taking it even further by announcing the beta release of GPUs attached to Preemptible VMs.

You can now attach NVIDIA K80 and NVIDIA P100 GPUs to Preemptible VMs for $0.22 and $0.73 per GPU hour, respectively. This is 50% cheaper than GPUs attached to on-demand instances, which we also recently lowered. Preemptible GPUs will be a particularly good fit for large-scale machine learning and other computational batch workloads as customers can harness the power of GPUs to run distributed batch workloads at predictably affordable prices.

As a bonus, we’re also glad to announce that our GPUs are now available in our us-central1 region. See our GPU documentation for a full list of available locations.

Resources attached to Preemptible VMs are the same as equivalent on-demand resources with two key differences: Compute Engine may shut them down after providing you a 30-second warning, and you can use them for a maximum of 24 hours. This makes them a great choice for distributed, fault-tolerant workloads that don’t continuously require any single instance, and allows us to offer them at a substantial discount. But just like its on-demand equivalents, preemptible pricing is fixed. You’ll always get low cost, financial predictability and we bill on a per-second basis.
Any GPUs attached to a Preemptible VM instance will be considered Preemptible and will be billed at the lower rate. To get started, simply append –preemptible to your instance create command in gcloud, specify scheduling.preemptible to true in the REST API or set Preemptibility to “On” in the Google Cloud Platform Console and then attach a GPU as usual. You can use your regular GPU quota to launch Preemptible GPUs or, alternatively, you can request a special Preemptible GPUs quota that only applies to GPUs attached to Preemptible VMs.

For users looking to create dynamic pools of affordable GPU power, Compute Engine’s managed instance groups can be used to automatically re-create your preemptible instances when they’re preempted (if capacity is available). Preemptible VMs are also integrated into cloud products built on top of Compute Engine, such as Kubernetes Engine (GKE’s GPU support is currently in preview. The sign-up form can be found here).

Over the years we’ve seen customers do some very exciting things with preemptible resources: everything from solving problems in satellite image analysis, financial services, questions in quantum physics, computational mathematics and drug screening.

“Preemptible GPU instances from GCP give us the best combination of affordable pricing, easy access and sufficient scalability. In our drug discovery programs, cheaper computing means we can look at more molecules, thereby increasing our chances of finding promising drug candidates. Preemptible GPU instances have advantages over the other discounted cloud offerings we have explored, such as consistent pricing and transparent terms. This greatly improves our ability to plan large simulations, control costs and ensure we get the throughput needed to make decisions that impact our projects in a timely fashion.” 

— Woody Sherman, CSO, Silicon Therapeutics 

We’re excited to see what you build with GPUs attached to Preemptible VMs. If you want to share stories and demos of the cool things you’ve built with Preemptible VMs, reach out on Twitter, Facebook or G+.

For more details on Preemptible GPU resources, please check out the preemptible documentation, GPU documentation and best practices. For more pricing information, take a look at our Compute Engine pricing page or try out our pricing calculator. If you have questions or feedback, please visit our Getting Help page.

To get started using Preemptible GPUs today; sign up for Google Cloud Platform and get $300 in credits to try out Preemptible GPUs.

Quelle: Google Cloud Platform

Simplify Cloud VPC firewall management with service accounts

By Daniel Merino, Technical Program Manager and Srinath Padmanabhan, Product Marketing Manager 

Firewalls provide the first line of network defense for any infrastructure. On Google Cloud Platform (GCP), Google Cloud VPC firewalls do just that—controlling network access to and between all the instances in your VPC. Firewall rules determine who’s allowed to talk to whom and more importantly who isn’t. Today, configuring and maintaining IP-based firewall rules is a complex and manual process that can lead to unauthorized access if done incorrectly. That’s why we’re excited to announce a powerful new management feature for Cloud VPC firewall management: support for service accounts.

If you run a complex application on GCP, you’re probably already familiar with service accounts in Cloud Identity and Access Management (IAM) that provide an identity to applications running on virtual machine instances. Service accounts simplify the application management lifecycle by providing mechanisms to manage authentication and authorization of applications. They provide a flexible yet secure mechanism to group virtual machine instances with similar applications and functions with a common identity. Security and access control can subsequently be enforced at the service account level.

Using service accounts, when a cloud-based application scales up or down, new VMs are automatically created from an instance template and assigned the correct service account identity. This way, when the VM boots up, it gets the right set of permissions and within the relevant subnet, so firewall rules are automatically configured and applied.

Further, the ability to use Cloud IAM ACLs with service accounts allows application managers to express their firewall rules in the form of intent, for example, allow my “application x” servers to access my “database y.” This remediates the need to manually manage Server IP Address lists while simultaneously reducing the likelihood of human error.

This process is leaps-and-bounds simpler and more manageable than maintaining IP address-based firewall rules, which can neither be automated nor templated for transient VMs with any semblance of ease.

Here at Google Cloud, we want you to deploy applications with the right access controls and permissions, right out of the gate. Click here to learn how to enable service accounts. And to learn more about Cloud IAM and service accounts, visit our documentation for using service accounts with firewalls.
Quelle: Google Cloud Platform

Asynchronous refresh with the REST API for Azure Analysis Services

Azure Analysis Services unlocks datasets with potentially billions of rows for non-technical business users to perform interactive analysis. Such large datasets can benefit from features such as asynchronous refresh.

We are pleased to introduce the REST API for Azure Analysis Services. Using any programming language that supports REST calls, you can now perform asynchronous data-refresh operations. This includes synchronization of read-only replicas for query scale out. Please see the blog post Introducing query replica scale-out for Azure Analysis Services for more information on query scale out.

Data-refresh operations can take some time depending on various factors, including data volume and level of optimization using partitions. These operations have traditionally been invoked with existing methods such as using TOM (Tabular Object Model), PowerShell cmdlets for Analysis Services, or TMSL (Tabular Model Scripting Language). The traditional methods may require long-running HTTP connections. A lot of work has been done to ensure the stability of these methods, but given the nature of HTTP, it may be more reliable to avoid long-running HTTP connections from client applications.

The REST API for Azure Analysis Services enables data-refresh operations to be carried out asynchronously. It therefore does not require long-running HTTP connections from client applications. Additionally, there are other built-in features for reliability such as auto retries and batched commits.

Please visit our documention page for details on how to use the REST API for Azure Analysis Services. It covers how to perform asynchronous refreshes, check their status, and cancel them if necessary. Similar information is provided for query-replica synchronization. Additionally, the C# RestApiSample on GitHub code sample is provided.
Quelle: Azure

Whitepaper: Selecting the right secure hardware for your IoT deployment

How do you go about answering those perplexing questions such as what secure hardware to use? How do I gauge the level of security? How much security do I really need and hence how much premium should I place on secure hardware? We’ve published a new whitepaper to shed light on this subject matter.

In our relentless commitment to securing IoT deployments worldwide, we continue to raise awareness to the true nature of security—that it is a journey and never an endpoint. Challenges emerge, vulnerabilities evolve, and solutions age thereby triggering the need for renewal if you are to maintain a desired level of security.

Securing your deployment as desired comprises planning, architecture, and execution main phases. For IoT, these are further broken down into sub-phases to include design assessment, risk assessment, model assessment, development, and deployment as shown in Figure 1. The decision process at each phase is equally important, the process must take all other phases into consideration for optimal efficacy. This is especially true when choosing the right secure hardware, also known as secure silicon or Hardware Secure Module(HSM), to secure an IoT deployment.
 

Figure 1: The IoT Security Lifecycle

Choosing the right secure hardware for securing an IoT deployment requires that you understand what you are protecting against (risk assessment), which drives part of the requirements for the choice. The other part of the requirements entails logistical considerations like provisioning, deployment and retirement, as well as tactical considerations like maintainability. These requirements in turn drive architecture and development strategies which then allow you to make the optimal choice of secure hardware. While this prescription is not an absolute guarantee for security, following these guidelines allows one to comfortably claim due diligence for a holistic consideration towards the choice of the right secure hardware, and hence the greatest chance of achieving security goals.

The choice itself requires knowledge of available secure hardware options as well as corresponding attributes such as protocol and standards compliances. We’ve developed a whitepaper, The Right Secure Hardware for Your IoT Deployment, to highlight the secure hardware decision process. This whitepaper educates on the Architecture Decision phase of the IoT security lifecycle. It comprises the second whitepaper for the IoT security lifecycle decision making series following previously published whitepaper, Evaluating Your IoT Security, which offers education on the Planning phase.
 
Download IoT Security Lifecycle whitepaper series:

Evaluating Your IoT Security.
The Right Secure Hardware for Your IoT Deployment.

What strategies do you use in selecting the right hardware to secure your IoT devices and deployment? We invite you to share your thoughts in comments below.
Quelle: Azure

Using Qubole Data Service on Azure to analyze retail customer feedback

It has been a busy season for many retailers. During this time, retailers are using Azure to analyze various types of data to help accelerate purchasing decisions. The Azure cloud not only gives retailers the compute capacity to handle peak times, but also the data analytic tools to better understand their customers.

Many retailers have a treasure trove of information in the thousands, or millions, of product reviews provided by their customers. Often, it takes time for particular reviews to show their value because customers "vote" for helpful or not helpful reviews over time. Using machine learning, retailers can automate identifying useful reviews in near real-time and leverage that insight quickly to build additional business value.

But how might a retailer without deep big data and machine learning expertise even begin to conduct this type of advanced analytics on such a large quantity of unstructured data? We will be holding a workshop in January to show you how easy that can be through the use of Azure and Qubole’s big data service.

Using these technologies, anyone can quickly spin up a data platform and train a machine learning model utilizing Natural Language Processing (NLP) to identify the most useful reviews. Moving forward, a retailer can then identify the value of reviews as they are generated by the user base and gain insights that can impact many aspects of their business.

Join Microsoft, Qubole, and Precocity for a half-day, hands on lab experience where we will show how to:

Leverage Azure cloud-based services and Qubole Data Service to increase the velocity of managing advanced analytics for retail
Ingesting a large retail review data set from Azure and leverage Qubole notebooks to explore data in a retail context
Demonstrate the autoscaling capability of a Qubole Spark cluster during a Natural Language Processing (NLP) pipeline
Train a machine learning model at scale using Open Source technologies like Apache Spark and score new customer reviews in real-time
Demonstrate use of Azure’s Event Hub and CosmosDB coupled with Spark Streaming to predict helpfulness of customer reviews in real-time

This workshop can be the basis of creating business value from reviews for other purposes including:

Fake review fraud detection
Identifying positive product characteristics
Identify influencers
Uncover new feature attributes for a product to inform merchandising

Register today for our event in Dallas, Texas on January 30th, 2017.

Space is limited, so register early!
Quelle: Azure

Maximize your VM’s Performance with Accelerated Networking – now generally available for both Windows and Linux

We are happy to announce that Accelerated Networking (AN) is generally available (GA) and widely available for Windows and the latest distributions of Linux providing up to 30Gbps in networking throughput, free of charge! 

AN provides consistent ultra-low network latency via Azure's in-house programmable hardware and technologies such as SR-IOV. By moving much of Azure's software-defined networking stack off the CPUs and into FPGA-based SmartNICs, compute cycles are reclaimed by end user applications, putting less load on the VM, decreasing jitter and inconsistency in latency.

With the GA of AN, region limitations have been removed, making the feature widely available around the world. Supported VM series include D/DSv2, D/DSv3, E/ESv3, F/FS, FSv2, and Ms/Mms.

The deployment experience for AN has also been improved since public preview. Many of the latest Linux images available in the Azure Marketplace, including Ubuntu 16.04, Red Hat Enterprise Linux 7.4, CentOS 7.4 (distributed by Rogue Wave Software), and SUSE Linux Enterprise Server 12 SP3, work out of the box with no further setup steps needed. Windows Server 2016 and Windows Server 2012R2 also work out of the box.

All the information needed to deploy a VM with AN can be found here, Windows AN VM or Linux AN VM.
Quelle: Azure