Using HLL++ to speed up count-distinct in massive datasets

If you’re working in modern data analytics, you likely have some go-to tools. One of the most useful tools in a data analyst’s toolbox is the count-distinct function, which lets you count the number of unique values in a column in a data set. (Think of this as counting distinct elements in a data stream where there are repeated elements.) This can come in handy when gathering all types of business data—for example:How many unique visitors came to our website last yearHow many users got to a certain level of our online game todayHow many different IP addresses did a slice of network traffic come from How many unique visitors searched for a particular news event in a single dayHow many unique IoT devices started to report error codes after the code rollout However, as the number of unique items gets larger, the calculation to obtain this number exactly requires memory in proportion to the number of unique items. Another problem is that computing the values for unique visitors per day doesn’t mean you can simply add seven days of results together to get the unique count for the week, because that would overcount visitors seen on multiple days. A simple solution to do an exact count-distinct is to create a set structure. Then, as you process elements in the input, add them to the set if they are not in it yet, while also incrementing a set size counter. That looks something like this:The set will continue to grow in direct proportion to the cardinality of the dataset (the number of unique elements in the set), using large amounts of memory for large input cardinalities. You’ll have to consider the memory space needed for the set, as well as how to deal with this when using distributed processing, where data is spread across many machines for its processing. Count distinct functions from several machines cannot just be summed up, as the set objects from different machines will overlap (unless we repartitioned the data, such as with a hash partition). To deduplicate the sets in order to obtain an exact count, the entire leaf set needs to be sent to the root machine and merged (requiring huge amounts of I/O).Performing count-distinct, faster and cheaperSo how do you avoid massive memory costs for count distinct, or make the computation feasible in the first place? Data analysts will often require only an approximate count, with a small error margin being acceptable. Allowing for a small error in the result opens the door for massively cheaper computations: Approximate algorithms can compute count distinct with a fraction of the memory and I/O of an exact solution, at the price of results which are, for example, within 0.5% of the exact solution. A widely used approximate algorithm is HyperLogLog. Google’s implementation of and improvements to this algorithm are discussed in detail in this paper on HyperLogLog in practice. This implementation is known as HyperLogLog++ (we’ll refer to it as HLL++ for the rest of this post).The HLL++ algorithm makes it possible to store the intermediate state of an aggregation in a very compact form, called a sketch. These sketches are constant in size, as opposed to growing linearly, as our earlier set objects. These sketches lend themselves well to distributed processing frameworks, since you can efficiently transfer aggregation states over the wire.A further important benefit of sketches is that you can reuse and merge results for different time periods without needing to go back to the raw data. In our example counting unique visitors, a sketch can be produced and stored, summarizing the data for each day. To compute sliding window stats like seven-day active users, you can just reuse and merge the sketches for the relevant seven days instead of computing everything from scratch!  The Google implementation of HyperLogLog includes several improvements to the original algorithm: a compact and standardized sketch format, and a special higher accuracy mode for small cardinalities. This implementation was added to BigQuery in 2017 and has recently been open sourced and made directly available in Apache Beam as of version 2.16. That means it’s available for use in Cloud Dataflow, our fully managed service for transforming and enriching data in stream (real-time) and batch (historical) modes with equal reliability and expressiveness.Let’s explore the use of the HLL++ across several pipelines, using data coming from a streaming source:Outputting the approximate count distinct directly Output sketches to BigQuery, allowing for interoperability between BigQuery and Cloud Dataflow.Using BigQuery to run analytics queries against the sketches stored in step 2Outputting the sketches and metadata to Cloud StorageMerging and extracting results from files stored on Cloud StorageWe’ll show you how to use the transform directly to output results as well as store the aggregated sketches in BigQuery and, along with metadata, as Avro files in Cloud Storage. Building HLL++-enabled pipelines Apache Beam lets you process and analyze data as a stream, so making use of the Beam API with real-time data is simply a matter of adding it to a pipeline. The following operations, added in Beam 2.16, will reappear throughout the examples below: HllCount.Init—aggregates data into a sketch; HllCount.MergePartial—combines multiple sketches into one; HllCount.Extract—Extracts the estimated count of distinct elements from a sketch.Check out more details in the HllCount transform catalog entry. With these building blocks, we can now go ahead and explore a few use cases: 1. Compute approximate count of unique visitors to a website  Note: When testing, you can easily create a stream of values using the GenerateSequence utility class in Beam, which generates a sequence of data for you in stream mode. The pipeline below will process the IDs of everyone visiting our website and compute the approximate count based on the durationOfWindow.2. Generate unique visitors per page countThe example above used a PCollection of anonymized IDs visiting our website, but what if we wanted a count based on specific pages? For that we can make use of the transforms’ ability to work with key-value pairs. In the example below, the key is the webpage identifier and the value is visitor IDs.3. Storing the sketches in BigQueryBigQuery supports HLL++ via the HLL_COUNT functions, and BigQuery’s sketches are fully compatible with Beam’s, so it’s easy to interoperate with sketch objects across both systems. This use case mirrors usage of HLL++ within Google with Flume and BigQuery. Even when using approximate algorithms, Google-order-of-magnitude data sets can still be too large for analysts to query at interactive/”online” speeds (especially for expensive statistics like count distinct). The remedy is to pre-aggregate data into cubes using Google Flume pipelines, the internal version of Cloud Dataflow. End-user queries just filter and merge (roll up) over dozens of rows in the cube.  For functions like sum or count, the rollup is trivial, but what about count distinct? As you will recall from earlier, it is not possible to simply sum up count distinct values corresponding to a small time interval (day) into larger time intervals (week or month). With sketches, this becomes possible as the values can simply be merged.In the example below we will:Pre-aggregate data into sketches in Beam; Store the sketches in BigQuery as byte[] columns along with some metadata about the time interval. Run a rollup query in BigQuery, which can extract the results at interactive speed, thanks to the sketches that were pre-computed in Beam.With the bytes stored in BigQuery we can use the BigQuery HLL_COUNT.* functions to merge the sketches and extract the counts, with the following SQL query:Note: If you have sketches with different intervals in the same table, then you will need to use Window_Start and Window_End time in the WHERE clause.4. Storing the sketches externally Now, let’s output the results of the sketches to a file system. This is useful is when generating feature files ready for machine learning, where Cloud Dataflow is being used to enrich multiple sources of data before the modeling phase. The sketches’ ability to merge together means there’s no need to reprocess all of the data again. You just have to gather the files that correspond to the relevant time interval.There are many Apache Beam SinkIOs that can be used to create and store files. We will explore three of the common ones: FileIO, TextIO, AvroIO.TextIOTextIO will write out a PCollection<String> to a text file, with each element delimited by a new line character. Since a sketch is a byte array (byte[]), you need to first Base64-encode the bytes before writing them to TextIO. This is a nice straightforward approach which has a small processing overhead.FileIOFileIO lets you create a file, then you need to write a custom writer/reader to output the byte[] into a file, including any mechanisms to deal with multiple sketch objects in a single file, like value separators.AvroIOUsing a format like AVRO lets you store the sketches (as bytes) together with metadata. AVRO is also a widely used serialization format, with connectors available for many systems. Due to its convenience, we’ll use AvroIO for this example.First, create an object type that supports some extra metadata about the sketch. The @DefaultCoder annotation lets you use the AvroCoder coder in the pipeline.Using this class, you can write a pipeline that turns our stream of key-value sketches into our custom HLLAvroContainer objects, ready for output to a file. For extra optimization, let’s add the start and end time as metadata to the filename as well. This lets you easily use a glob pattern when reading the files to process only items you’re interested in, as opposed to having to read all files and apply a filter. The code snippet below will generate files to GS_FOLDER with a file name format of “2019-10-22T08:33:00.000Z-2019-10-22T16:33:00.000Z-pane-0-last-00000-of-00001″. You can use AvroIO’s more advanced options to write your own filenames to act as additional metadata (not shown).It’s also possible to push sketch objects from many windows into a single file. In order to do this, you would need to apply a larger window after the creation of the HLLAvroContainer but before the AvroIO write, such as Window.into(FixedWindows.of(<1 week>)). 5. Reading the sketches from external storageWith the output files in a folder, you can now read and then merge all of the individual files together from November 2019 for the approximate count distinct of that month.This will result in an output line per key, since all the results have been merged into a single result.HLL++ efficiently solves the count-distinct problem for large data sets with a small error margin, solving for the large majority of data analysts’ needs around count distinct. By its incorporation within Apache Beam, Cloud Dataflow offers you this algorithm for both streaming and batch processing pipelines. You can switch between the two easily, depending on your needs, and also move the intermediate aggregation state seamlessly back and forth between Beam and BigQuery.Note: As of version 2.16, there are several implementations of approximate count algorithms. We recommend the use of HllCount.java, especially if you need sketches and/or need compatibility with Google Cloud BigQuery.Among the other implementations, ApproximateUnique.java does not expose its intermediate aggregation state in the form of sketches and has lower accuracy; ApproximateDistinct.java is a reimplementation of the algorithm described in the HLL++ paper, but its sketch format is not compatible with BigQuery.
Quelle: Google Cloud Platform

Introducing Storage Transfer Service for on-premises data

There’s an enormous amount of data in the world today, and your company likely operates its own storage infrastructure to store this data. Running your business in the cloud can generate more value from your data and facilitate collaboration across your organization, all while optimizing for infrastructure costs. Before you can take advantage of all that cloud offers, though, you have to actually get your data to the cloud. That’s why we’ve developed a new software service that helps you accomplish large-scale, online data transfers: Transfer Service for on-premises data. This can help take the complexity out of data transfers and move data faster than existing online tools like gsutil.Transfer Service for on-premises data is a managed solution that lets you move your data without needing to engineer your own custom software or invest in an off-the-shelf solution. Now, you can complete large-scale data transfers online, which scale to high-speed network connections—up to billions of files, multiple PB of data, and tens of Gbps. Transfer Service for on-premises data validates data integrity, so you can transfer with confidence. It’s also designed to be reliable and secure, so that if agent failures occur, in-progress transfers will not be impacted. And with performance optimizations included from the application to the transport layer, the service can use your available bandwidth to minimize transfer times. Plus, this service requires no code or maintenance, allowing your organization to focus on innovation, not operations. On-premises data transfers can be complex. “I see enterprises default to making their own custom solutions, which is a slippery slope as they can’t anticipate the costs and long-term resourcing,” says Scott Sinclair, senior analyst at ESG. “With Transfer Service for on-premises data (beta), enterprises can optimize for TCO and reduce the friction that often comes with data transfers. This solution is a great fit for enterprises moving data for business-critical use cases like archive and disaster recovery, lift and shift, and analytics and machine learning.”Getting started with Transfer Service for on-premises dataHere’s how it works. First, install and start the on-premises software (the agent), then go to the Cloud Console and submit directories to transfer to Cloud Storage. When transferring data, the service will parallelize your transfer across many agents, and then coordinate these agents to transfer your data over a secure internet connection to Cloud Storage. Transfer Service for on-premises data also features a fully self-service GUI with detailed transfer logs so that you can create, monitor, and manage transfer jobs with confidence.Click to enlargeTransfer Service for on-premises data is now available in beta for you to try. Learn more about how the service works and how to get started today.
Quelle: Google Cloud Platform

Discover insights from text with AutoML Natural Language, now generally available

Organizations are managing and processing greater volumes of text-heavy, unstructured data than ever before. To manage this information more efficiently, organizations are looking to machine learning to help with the complex sorting, processing, and analysis this content needs. In particular, natural language processing is a valuable tool used to reveal the structure and meaning of text, and today we’re excited to announce that AutoML Natural Language is generally available. AutoML Natural Language has many features that make it a great match for these data processing challenges. It includes common machine learning tasks like classification, sentiment analysis, and entity extraction, which have a wide variety of applications, such as: Categorizing digital content, including news, blogs, and tweets, in real time to allow content creators to see patterns and insights—a great example is Meredith, which is categorizing text content across its entire portfolio of media properties in months instead of yearsIdentifying sentiment in customer feedbackTurning dark, unstructured scanned data into classified and searchable content We’re also introducing support for PDFs, including native PDFs and PDFs of scanned images. To further unlock the most complex and challenging use cases—such as understanding legal documents or document classification for organizations with large and complex content taxonomies—AutoML Natural Language now supports 5,000 classification labels, training up to 1 million documents, and document size up to 10 MB. One customer using this new functionality is Chicory, which develops custom digital shopping and marketing solutions for the grocery industry. “AutoML Natural Language allows us to solve complex classification problems at scale. We are using AutoML to classify and translate recipe ingredient data across a network of 1,300 recipe websites into actual grocery products that consumers can purchase seamlessly through our partnerships with dozens of leading grocery retailers like Kroger, Amazon, and Instacart,” Asaf Klibansky, Director of Engineering at Chicory explains. “With the expansion of the max classification label size to the thousands, we can expand our label/ingredient taxonomy to be more detailed than ever, providing our shoppers with better matches during their grocery shopping experience—a business challenge we have been trying to perfect since Chicory began. “Also, we see better model performance than we were able to achieve using open source libraries, and we have increased visibility into the individual label performance that we did not have before,” Klibanky continues. “This has allowed us to identify insufficient or poor quality training data per label quickly and reduce the time and cost between model iterations.” We’re continuously improving the quality of our models in partnership with Google AI research through better fine-tuning techniques, and larger model search spaces. We’re also introducing more advanced features to help AutoML Natural Language understand documents better. For example, AutoML Text & Document Entity Extraction will now look at more than just text to incorporate the spatial structure and layout information of a document for model training and prediction. This spatial awareness leads to better understanding of the entire document, and is especially valuable in cases where both the text and its location on the “page” are important, such as invoices, receipts, resumes, and contracts.Identifying applicant skills by location on the document.We also launched preferences for enterprise data residency for AutoML Natural Language customers in Europe and across the globe to better serve organizations in regulated industries. Many customers are already taking advantage of this functionality, which allows you to create a dataset, train a model, and make predictions while keeping your data and related machine learning processing within the EU or any other applicable region. Finally, AutoML Natural Language is FedRAMP-authorized at the Moderate level, making it easier for federal agencies to benefit from Google AI technology.To learn more about AutoML Natural Language and the Natural Language API, check out our website. We can’t wait to hear what you discover with your data.
Quelle: Google Cloud Platform

What’s new in Cloud Run for Anthos

Earlier this year, we announced Cloud Run, our newest compute platform for serverless containers, which runs either on Google’s fully managed infrastructure–or on your Google Kubernetes Engine (GKE) clusters with Cloud Run for Anthos. This portability is achieved with Knative open-source APIs.With Cloud Run for Anthos, we deliver and manage everything you need to support serverless containers on your GKE clusters, for example Knative and a service mesh, and it’s now generally available. Let’s take a look at some of the new features and improvements, including better networking and autoscaling capabilities, that make it easier to deploy and operate microservices in a serverless way on your Anthos GKE clusters. (Also make sure to check out what’s new in Cloud Run fully managed.) Traffic managementCloud Run for Anthos can now route each request or RPC randomly between multiple revisions of a service with the traffic percentages you configure. You can use this feature to perform canary deployments of a newer version of your application, sending a small percentage of the traffic and validating if it is performing correctly, before gradually increasing the traffic.Similarly, these new traffic management capabilities make it possible to roll back to an older version of your application quickly. You can manage traffic to your service on the Cloud Console, as well as the gcloud command-line tool.Bringing Cloud Run to your on-prem clustersThe beta of Cloud Run for Anthos only supported GKE clusters running on Google Cloud Platform (GCP). With the general availability of Cloud Run for Anthos, you can now deploy Cloud Run to your on-premises clusters deployed on VMware.With this, you can have the same serverless developer and operator experience across both environments. You can use either gcloud or the Cloud Console to deploy to your Anthos clusters and monitor them, regardless of whether they are running on-prem, or on GCP.Support for Kubernetes Secrets and ConfigMapsYou can now mount existing Kubernetes Secret and ConfigMap resources in your cluster to Cloud Run services using the Cloud Console interface or gcloud. This lets you deploy services without crafting lengthy Kubernetes manifest files.Fine tune network and autoscaling parametersWith the new improvements in the Cloud Console and gcloud, you can now further tune the autoscaling and networking parameters of your application at per-revision level.–min-instances / –max-instances let you specify the scaling boundaries of your application. For example, setting –min-instances to greater than 0 prevents your service from scaling down to zero in case of inactivity to prevent cold starts.–timeout allows you to specify a custom request timeout for your service.–port allows you to customize and specify on which port number your containerized application is listening. This allows you to bring your apps to Cloud Run without having to modify the application server’s port number.Some of these options are currently in beta, however soon these and many other configuration options will be available to you in the command-line and the Cloud Console.Making your services more observableOut of the box, Cloud Run for Anthos integrates with Stackdriver Monitoring to expose metrics from the services you have deployed. With general availability, you can find these metrics on Stackdriver Metrics, or directly on the “Metrics” tab of the service’s Cloud Run page.These metrics include some golden signals: request latencies, error rates, CPU and memory usage, and container instance count. You can further use these metrics to drill down by revision of your application, and create alerts and SLOs using Stackdriver Monitoring.It’s also worth noting that you get the same set of metrics even if your Anthos cluster is running on-premises, for example on VMware.Smaller cluster footprintThe Istio add-on is now optional for Cloud Run for Anthos, as Cloud Run now includes select components of Istio. The full version of Istio is still a great complement to Cloud Run if you want to have cluster-wide traffic policies and use Anthos Service Mesh dashboard to have single pane of glass visibility to services in your cluster. The Istio community has recently made several improvements to Istio, including reductions in its resource footprint.Partner ecosystemWe’re working with a wide variety of ISVs in the areas of CI/CD, security and observability so that you can continue to use your favorite tools with applications running on Cloud Run for Anthos. Click here for a recent list of Cloud Run partners and integrations. Try it today!Both Cloud Run fully managed and Cloud Run for Anthos are available for you to try out for your applications today. You can try out Cloud Run on your Anthos GKE clusters by following the quickstart guide for free until May 2020. You can also use the 12-month GCP free trial to get $300 in credit and create a cluster with Cloud Run for Anthos. And if you’re already running Anthos on-premises, try this quickstart guide to deploy Cloud Run to your VMware environment.
Quelle: Google Cloud Platform

Exploring container security: Performing forensics on your GKE environment

Running workloads in containers can be much easier to manage and more flexible for developers than running them in VMs, but what happens if a container gets attacked? It can be bad news. We recently published some guidance for how to collect and analyze forensic data in Google Kubernetes Engine (GKE), and how best to investigate and respond to an incident.When performing forensics on your workload, you need to perform a structured investigation, and keep a documented chain of evidence to know exactly what happened in your environment, and who was responsible for it. In that respect, performing forensics and mounting an incident response is the same for containers as it is for other environments—have an incident response plan, collect data ahead of time, and know when to call in the experts. What’s different with containers is (1) what data you can collect and how, and (2) how to react.Get planningEven before an incident occurs, make the time to put together an incident response plan. This typically includes: who to contact, what actions to take, how to start collecting information and how to communicate what’s going on, both internally and externally. Incident response plans are critical, so if panic does start to set in you’ll know what steps to follow.Other information that’s helpful to decide ahead of time, and list in your response plan, is external contacts or resources, and how your response changes based on severity of the incident. Severity levels and planned actions should be business-specific and dependent on your risks—for example, a data leak is likely more severe than an abuse of resources, and you may have different parties that need to be involved. This way, you’re not hunting around for—or debating—this information during an incident. If you don’t get the severity levels right the first time, in terms of categorization, speed of response, speed of communications, or something else, surface this in an incident post-mortem, and adjust as needed.Collect logs now, you’ll be thankful laterTo put yourself in the best possible position for responding to an incident, you want data! Artifacts such as logs, disks, and live recorded info are how you’re going to figure out what’s happening in your environment. Most of these you can get in the heat of the moment, but you need to set up logs ahead of time.There are several kinds of logs in a containerized environment that you can set up to capture: Cloud Audit Logs for GKE and Compute Engine nodes, including Kubernetes audit logs; OS specific logs; and your own application logs.There are several kinds of logs you can collect from a containerized environment.You should begin by collecting logs as soon as you deploy an app or set up a GCP project, to ensure they’re available if you need them for analysis in case of an incident. For more guidance on which logs to collect for further analysis for your containers, see our new solution Security controls and forensic analysis for GKE apps.Stay coolWhat should you do if you suspect an incident in your environment? Don’t panic! You may be tempted to terminate your pods, or restart the nodes, but try to resist the urge. Sure, that will stop the problem at hand, but it also alerts a potential attacker that you know that they’re there, depriving you of the ability to do forensics!So, what should you do? Put your incident response plan into action. Of course, what this means depends on the severity of the incident, and your certainty that you have correctly identified the issue. Your first step might be to ask your security team to further investigate the incident. The next step might be to snapshot the disk of the node that was running the container. You might then move other workloads off and quarantine the node to run additional analysis. For more ideas, check out the new documentation on mitigation options for container incidents next time you’re in such a situation (hopefully never!).To learn more about container forensics and incident response, check out our talk from KubeCon EU 2019, Container forensics: what to do when your cluster is a cluster (slides). But as always, the most important thing you can do is prevention and preparation—be sure to follow the GKE hardening guide, and set up those logs for later!
Quelle: Google Cloud Platform

Performance-driven dynamic resource management in E2 VMs

Editor’s note: This is the second post in a two-post series. Click here for part 1: E2 introduction.As one of the most avid users of compute in the world, Google has invested heavily in making compute infrastructure that is cost effective, reliable and performant. The new E2 VMs are the result of innovations Google developed to run its latency-sensitive, user-facing services efficiently. In this post, we dive into the technologies that enable E2 VMs to meet rigorous performance, security, and reliability requirements while also reducing costs.In particular, the consistent performance delivered by E2 VMs is enabled by:An evolution toward large, efficient physical serversIntelligent VM placementPerformance-aware live migrationA new hypervisor CPU schedulerTogether we call these technologies dynamic resource management. Just as Google’s Search, Ads, YouTube, and Maps services benefited from earlier versions of this technology, we believe Google Cloud customers will find the value, performance, and flexibility offered by E2 VMs improves the vast majority of their workloads.Introducing dynamic resource managementBehind the scenes, Google’s hypervisor dynamically maps E2 virtual CPU and memory to physical CPU and memory on demand. This dynamic management drives cost efficiency in E2 VMs by making better use of the physical resources.Concretely, virtual CPUs (vCPUs) are implemented as threads that are scheduled to run on demand like any other thread on the host—when the vCPU has work to do, it is assigned an available physical CPU on which to run until it goes to sleep again. Similarly, virtual RAM is mapped to physical host pages via page tables that are populated when a guest-physical page is first accessed. This mapping remains fixed until the VM indicates that a guest-physical page is no longer needed.The image below shows vCPU work coming and going over the span of a single millisecond. Empty space indicates a given CPU is free to run any vCPU that needs it.A trace of 1 millisecond of CPU scheduler execution. Each row represents a CPU over time and each blue bar represents a vCPU running for a time span. Empty regions indicate the CPU is available to run the next vCPU that needs it.Notice two things: there is a lot of empty space, but few physical CPUs are continuously empty. Our goal is to better utilize this empty space by scheduling VMs to machines and scheduling vCPU threads to physical CPUs such that wait time is minimized. In most cases, we are able to do this extremely well. As a result, we can run more VMs on fewer servers, allowing us to offer E2 VMs for significantly less than other VM types.For most workloads, the majority of which are only moderately performance sensitive, E2 performance is almost indistinguishable from that of traditional VMs. Where dynamic resource management can differ in performance is in the long tail—the worst 1% or 0.1% of events. For example, a web serving application might see marginally increased response times once per 1,000 requests. For the vast majority of applications, including Google’s own latency-sensitive services, this difference is lost in the noise of other performance variations such as Java garbage collection events, I/O latencies and thread synchronization.The reason behind the difference in tail performance is statistical. Under dynamic resource management, virtual resources only consume physical resources when they are in use, enabling the host to accommodate more virtual resources than it could otherwise. However, occasionally, resource assignment needs to wait several microseconds for a physical resource to become free. This wait time can be monitored in Stackdriver and in guest programs like vmstat and top. We closely track this metric and optimize it in four ways that we detail below.1. An evolution toward large, efficient physical serversOver the past decade, core count and RAM density has steadily increased such that now our servers have far more resources than any individual E2 VM. For example, Google Cloud servers can have over 200 hardware threads available to serve vCPUs yet an E2 VM has at most 16 vCPUs. This ensures that a single VM cannot cause an unmanageable increase in load.We continually benchmark new hardware and look for platforms that are cost-effective and perform well for the widest variety of cloud workloads and services. The best ones become the “machines of the day” and we deploy them broadly. E2 VMs automatically take advantage of these continual improvements by flexibly scheduling across the zone’s available CPU platforms. As hardware upgrades land, we live-migrate E2 VMs to newer and faster hardware, allowing you to automatically take advantage of these new resources.2. Intelligent VM placementGoogle’s cluster management system, Borg, has a decade of experience scheduling billions of diverse compute tasks across diverse hardware, from TensorFlow training jobs to Search front- and back-ends. Scheduling a VM begins by understanding the resource requirements of the VM based on static creation-time characteristics.By observing the CPU, RAM, memory bandwidth, and other resource demands of VMs running on a physical server, Borg is able to predict how a newly added VM will perform on that server. It then searches across thousands of servers to find the best location to add a VM.These observations ensure that when a new VM is placed, it is compatible with its neighbors and unlikely to experience interference from those instances.3. Performance-aware live migrationAfter VMs are placed on a host, we continuously monitor VM performance and wait times so that if the resource demands of the VMs increase, we can use live migration to transparently shift E2 load to other hosts in the data center.The policy is guided by a predictive approach that gives us time to shift load, often before any wait time is encountered.VM live migration is a tried-and-true part of Compute Engine that we introduced six years ago. Over time, its performance has continually improved to the point where its impact on most workloads is negligible.4. A new hypervisor CPU schedulerIn order to meet E2 VMs performance goals, we built a custom CPU scheduler with significantly better latency guarantees and co-scheduling behavior than Linux’s default scheduler. It was purpose-built not just to improve scheduling latency, but also to handle hyperthreading vulnerabilities such as L1TF that we disclosed last year, and to eliminate much of the overhead associated with other vulnerability mitigations. The graph below shows how TCP-RR benchmark performance improves under the new scheduler.The new scheduler provides sub-microsecond average wake-up latencies and extremely fast context switching. This means that, with the exception of microsecond-sensitive workloads like high-frequency trading or gaming, the overhead of dynamic resource management is negligible for nearly all workloads.Get startedE2 VMs were designed to provide sustained performance and the lowest TCO of any VM family in Google Cloud. Together, our unique approach to fleet management, live-migration at scale, and E2’s custom CPU scheduler work behind the scenes to help you maximize your infrastructure investments and lower costs.E2 complements the other VM families we announced earlier this year—general-purpose (N2) and compute-optimized (C2) VMs. If your applications require high CPU performance for use-cases like gaming, HPC or single-threaded applications, these VM types offer great per-core performance and larger machine sizes.Delivering performant and cost-efficient compute is our bread and butter. The E2 machine types are now in beta. If you’re ready to get started, check out the E2 docs page and try them out for yourself!
Quelle: Google Cloud Platform

Introducing E2, new cost-optimized general purpose VMs for Google Compute Engine

General-purpose virtual machines are the workhorses of cloud applications. Today, we’re excited to announce our E2 family of VMs for Google Compute Engine featuring dynamic resource management to deliver reliable and sustained performance, flexible configurations, and the best total cost of ownership of any of our VMs.  Now in beta, E2 VMs offer similar performance to comparable N1 configurations, providing:Lower TCO: 31% savings compared to N1, offering the lowest total cost of ownership of any VM in Google Cloud.Consistent performance: Your VMs get reliable and sustained performance at a consistent low price point. Unlike comparable options from other cloud providers, E2 VMs can sustain high CPU load without artificial throttling or complicated pricing. Flexibility: You can tailor your E2 instance with up to 16 vCPUs and 128 GB of memory. At the same time, you only pay for the resources that you need with 15 new predefined configurations or the ability to use custom machine types. Since E2 VMs are based on industry-standard x86 chips from Intel and AMD, you don’t need to change your code or recompile to take advantage of this price-performance. E2 VMs are a great fit for a broad range of workloads including web servers, business-critical applications, small-to-medium sized databases and development environments. If you have workloads that run well on N1, but don’t require large instance sizes, GPUs or local SSD, consider moving them to E2. For all but the most demanding workloads, we expect E2 to deliver similar performance to N1, at a significantly lower cost. Dynamic resource managementUsing resource balancing technologies developed for Google’s own latency-critical services, E2 VMs make better use of hardware resources to drive costs down and pass the savings on to you. E2 VMs place an emphasis on performance and protect your workloads from the type of issues associated with resource-sharing thanks to our custom-built CPU scheduler and performance-aware live migration.You can learn more about how dynamic resource management works by reading the technical blog on E2 VMs.E2 machine typesAt launch, we’re offering E2 machine types as custom VM shapes or predefined configurations:We’re also introducing new shared-core instances, similar to our popular f1-micro and g1-small machine types. These are a great fit for smaller workloads like micro-services or development environments that don’t require the full vCPU.E2 VMs can be launched on-demand or as preemptible VMs. They are also eligible forcommitted use discounts, bringing additional savings up to 55% for 3 year commitments. E2 VMs are powered by Intel Xeon and AMD EPYC processors, which are selected automatically based on availability. Get startedE2 VMs are rolling out this week to eight regions: Iowa, South Carolina, Oregon, Northern Virginia, Belgium, Netherlands, Taiwan and Singapore; with more regions in the works. To learn more about E2 VMs or other GCE VM options, check out our machine types page and our pricing page.
Quelle: Google Cloud Platform

Packet Mirroring: Visualize and protect your cloud network

As networks grow in complexity, network and security administrators need to be able to analyze and monitor network traffic to respond to security breaches and attacks. However, in public cloud environments, getting access to network traffic can be challenging.Many customers use advanced security and traffic inspection tools on-prem, and need the same tools to be available in the cloud for certain applications. Our new Packet Mirroring service is now in beta, and allows you to troubleshoot your existing Virtual Private Clouds (VPCs). With this service, you can use third-party tools to collect and inspect network traffic at scale, provide intrusion detection, application performance monitoring, and better security controls, helping you ensure the security and compliance of workloads running in Compute Engine and Google Kubernetes Engine (GKE). For more, watch this video.For instance, Packet Mirroring lets you identify network anomalies within and across VPCs,  internal traffic from VMs to VMs, traffic between end locations on the Internet and VMs, and also traffic between VMs to Google services in production. Packet Mirroring is available in all Google Cloud Platform (GCP) regions, for all machine types, for both Compute Engine instances and GKE clusters.In short, Packet Mirroring allows you to:Help ensure advanced network security by proactively detecting threats. Respond to intrusions with signature-based detection on predetermined attack patterns, and also identify previously unknown attacks with anomaly-based detection.Improve application availability and performance with the capability to diagnose and analyze what’s going on over the wire instead of relying only on application logs.Support regulatory and compliance requirements by logging and monitoring of transactions for auditing purposes.“Google Cloud’s new Packet Mirroring service accelerates our cloud adoption by giving us the visibility we need to secure our applications and protect our most precious asset, our customers.” – Diane Brown, Senior Director IT Risk Management, Ulta BeautyPacket Mirroring is important for enterprise users from both a security and networking perspective. You can use Packet Mirroring in a variety of deployment setups for different network topologies, such as VPC Network Peering and Shared VPC. In Shared VPC environments, for instance, an organization may have packet mirroring policies and collector backends that were set up by the networking or security team in the host project; the packet mirroring policy, meanwhile, is enabled in the service projects where the developer team runs its applications. This centralized deployment mode improves the ease of use of Packet Mirroring for security and networking teams, at the same time making it transparent to the development teams.Packet Mirroring is natively integrated with Google’s Andromeda SDN fabric. This approach keeps Packet Mirroring performance and management overhead low, as the receiving software appliances running on the collector backends don’t need to perform any decapsulation operation to parse the receive packet mirrored data. Partnering for network securityWe’ve been working with several partners to help us test and develop Packet Mirroring, and have received valuable feedback along the way. Here are our Packet Mirroring partners, and how they work with the tool:Awake Security – Awake Delivers Security with Network Traffic Analysis in Google CloudCheck Point -CloudGuard IaaS Now Integrates with Google Cloud Packet MirroringCisco – Cisco Stealthwatch Cloud and Google Cloud continue partnership to secure customersCorelight – Finding Truth in the Cloud: Google Cloud Packet Mirroring & Corelight Network Traffic AnalysiscPacket Networks – Googling Packets Inside Google CloudExtraHop Networks – ExtraHop Reveal(x) + Google Cloud Packet MirroringFlowmon – Enhancing Network Visibility & Security in Google CloudIxia by Keysight Technologies – Improving cloud visibility with CloudLens and Packet MirroringNetscout – NETSCOUT Extends Its Visibility Without Borders into Google Cloud with Packet MirroringPalo Alto Networks – Announcing the new VM-Series Integration with Google Cloud Packet Mirroring ServiceHelp ensure security and compliance in the cloudOur goal is to give you the right advanced security solutions for connecting your business to Google Cloud. With Packet Mirroring, you can reduce risk, diagnose to ensure the availability of your mission-critical applications and services and meet compliance requirements. Click here to learn more about GCP’s cloud networking portfolio and reach out to us with feedback at  gcp-networking@google.com.
Quelle: Google Cloud Platform

8 production-ready features you’ll find in Cloud Run fully managed

Since we launched Cloud Run at Google Cloud Next in April, developers have discovered that “serverless” and “containers” run well together. With Cloud Run, not only do you benefit from fully managed infrastructure, up and down auto-scaling, and pay-as-you-go pricing, but you’re also able to package your workload however you like, inside a stateless container listening for incoming requests, with any language, runtime, or library of your choice. And you get all this without compromising portability, thanks to its Knative open-source underpinnings. Many Google Cloud customers already use Cloud Run in production, for example, to deploy public websites or APIs, or as a way to perform fast and lightweight data transformations or background operations. “Cloud Run promises to dramatically reduce the operational complexity of deploying containerized software. The ability to put an automatically scaling service in production with one command is very attractive.” – Jamie Talbot, Principal Engineer at Mailchimp.Cloud Run recently became generally available, as both a fully managed platform or on Anthos, and offers a bunch of new features. What are those new capabilities? Today, let’s take a look at what’s new in the fully managed Cloud Run platform.1. Service level agreementWith general availability, Cloud Run now comes with a Service Level Agreement (SLA). In addition, it now offers data location commitments that allow you to store customer data in a specific region/multi-region. 2. Available in 9 GCP regionsIn addition to South Carolina, Iowa, Tokyo, and Belgium, in the coming weeks, you’ll also be able to deploy containers to Cloud Run in North Virginia, Oregon, Netherlands, Finland, and Taiwan, for a total of nine cloud regions.3. Max instancesAuto-scaling can be magic, but there are times when you want to limit the maximum number of instances of your Cloud Run services, for example, to limit costs. Or imagine a backend service like a database is limited to a certain number of connections—you might want to limit the number of instances that can connect to that service. With the max instance feature, you can now set such a limit.Use the Cloud Console or Cloud SDK to set this limit:gcloud run services update SERVICE-NAME –max-instances 424. More secure: HTTPS onlyAll fully managed Cloud Run services receive a stable and secure URL. Cloud Run now only accepts secure HTTPS connection and redirects any HTTP connection to the HTTPS endpoint. But having an HTTPS endpoint does not mean that your service is publicly accessible—you are in control and can opt into allowing public access to your service. Alternatively, you can require authentication by leveraging the “Cloud Run Invoker” IAM role.5. Unary gRPC protocol supportCloud Run now lets you deploy and run unary gRPC services (i.e., non-streaming gRPC), allowing your microservices to leverage this RPC framework. To learn more, read Peter Malinas’ tutorial on Serverless gRPC with Cloud Run using Go, as well as Ahmet Alp Balkan’s article on gRPC authentication on Cloud Run.6. New metrics to track your instancesOut of the box, Cloud Run integrates with Stackdriver Monitoring. From within the Google Cloud Console, the Cloud Run page now includes a new “Metrics” tab that shows charts of key performance indicators for your Cloud Run service: requests per second, request latency, used instance time, CPU and memory.A new built-in Stackdriver metric called container/billable_instance_time gives you insights into the number of container instances for a service, with the billable time aggregated from all container instances.7. LabelsLike the bibs that identify the runners in a race, GCP labels can help you easily identify a set of services, break down costs, or distinguish different environments.You can set labels from the Cloud Run service list page in Cloud Console, or update labels with this command and flag:gcloud run services update SERVICE-NAME –update-labels KEY=VALUE8. Terraform supportFinally, if you practice Infrastructure as Code, you’ll be glad to know that Terraform now  supports Cloud Run, allowing you to provision Cloud Run services from a Terraform configuration. Ready, set, go!The baton is now in your hands. To start deploying your container images to Cloud Run, head over to our quickstart guides on building and deploying your images. With the always free tier and the $300 credit for new GCP accounts, you’re ready to take Cloud Run for a spin. To learn more, there’s the documentation of course, as well as the numerous samples with different language runtimes (don’t miss the “Run on Google Cloud” button to automatically deploy your code). In addition, be sure to check out the community-contributed resources on the Awesome Cloud Run github project. We’re looking forward to seeing what you build and deploy!
Quelle: Google Cloud Platform

New climate model data now in Google Public Datasets

Exploring public datasets is an important aspect of modern data analytics, and all this gathered data can help us understand our world. At Google Cloud, we maintain a collection of public datasets, and we’re pleased to collaborate with the Lamont-Doherty Earth Observatory (LDEO) of Columbia University and the Pangeo Project to host the latest climate simulation data in the cloud. The World Climate Research Programme (WCRP) recently began releasing the Coupled Model Intercomparison Project Phase 6 (CMIP6) data archive, aggregating the climate models created across approximately 30 working groups and 1,000 researchers investigating the urgent environmental problem of climate change. The CMIP6 climate model datasets include rich details on many aspects of the climate system, including historical and future simulations. The data are now accessible in Cloud Storage and will be in BigQuery soon. Along with making CMIP6 available on Google Cloud, the Pangeo Project develops software and infrastructure to make it easier to analyze and visualize climate data using cloud computing.On Google Cloud, this dataset will be continuously updated and available to researchers around the globe to use for their own projects—without the constraints of downloading terabytes or even petabytes of data. The entire archive may eventually contain 20 PB of data, of which about 100 TB of data are currently available in the cloud. You can request data from Pangeo’s CMIP6 Google Cloud Collection in this form.“It’s a very live data set. It’s going to be updated over the next year as the data come online and as people’s needs arise,” says Ryan Abernathey, associate professor of Earth and environmental sciences at Columbia University and LDEO. He emphasizes the practical impact of this project. “What people actually care about most is not the global mean temperature because no one lives in the ‘global mean world.’ People care about the local impacts of drought or extreme rainfall, which can cause severe hardship for society. With these high-resolution simulations of rare events, we get much better information for planning in response to expected changes in the climate.”What you’ll find in the CMIP6 dataThe models in CMIP6’s data range from high-resolution simulations based on historical data from 1850 onward to hypothetical scenarios that manipulate key variables. For example, Abernathey asks, “What if carbon dioxide (CO2) were to instantaneously quadruple its concentration overnight? That’s a very useful experiment, not because it helps us make a detailed projection about the future, but because it helps us probe our physical understanding of how the climate system responds to CO2.” Each of the CMIP6 models includes dozens of variables, ensemble members, and scenarios, leading to large, unwieldy datasets. But Pangeo, an ensemble of open-source Python tools for big data analysis, makes it easier to perform large-scale computations on CMIP6 and other similar large datasets.To help researchers work with the multidimensional datasets of climate research, Abernathey and his colleagues at LDEO and the National Center for Atmospheric Research (NCAR) drew on funding from the National Science Foundation (NSF) and computing support from Google Cloud to develop Pangeo, which is an open-source platform aimed at accelerating geoscience data analysis. Pangeo can be run on nearly any high-performance computing system, including Google Kubernetes Engine (GKE), which supports easy deployment with autoscaling (both up and down) and integration with other Google Cloud tools such as Cloud Storage and BigQuery. The Pangeo community shares expertise, such as use cases for different domain-specific applications, and contributes to the development of open-source tools, like a cloud-optimized data storage format called Zarr.”The CMIP project has grown since its early days, and now is seeing tremendous growth beyond the U.S. and E.U. into the developing world,” says V. Balaji, a computational climate scientist on leave from Princeton University. Currently at the Institut Pierre-Simon Laplace in Paris, Balaji has been involved with all aspects of CMIP, from defining the experiments and running the simulations to analyzing the output and designing the Earth System Grid Federation (ESGF), a network of services that underpin the global data infrastructure enabling this critical research enterprise. “For new entrants, and for academic researchers worldwide, Pangeo in the cloud represents an exciting new opportunity to broaden the user base of very large-scale climate data, without the need to acquire supercomputer-scale storage and analysis facilities,” says Balaji. “It bridges what I call the gap between ‘inspiration-driven’ and ‘industrial strength’ science, enabling a scientist to explore the data and design their own analysis, and immediately apply their findings at very large scale. The progress of Pangeo in the cloud will inform our own architectural choices in designing the future of the global climate data infrastructure.”With these high-resolution simulations of rare events, we get much better information for planning in response to expected changes in the climate.The Pangeo team at LDEO and NCAR recently hosted a hackathon to jumpstart the analysis of the CMIP6 data on Google Cloud for pressing scientific questions. One participant—Henri Drake, a Ph.D. candidate in MIT’s Program in Atmospheres, Oceans, and Climate—created a tutorial for analyzing simulations of global warming in state-of-the-art CMIP6 models, under the worst-case scenario of uncontrolled greenhouse gas emissions. These CMIP6 model projections “reflect millions of lines of model code and represent everything from forest transpiration in the Amazon rainforest and thunderstorms in the U.S. Midwest to the formation of meltwater ponds on Arctic sea ice,” says Drake. “We would need a huge supercomputer to run the simulations from the model source code ourselves. Thankfully, the climate modeling community does this for us by making their output publicly available.”Drake used these tutorials as a teaching assistant for the Climate Change course at MIT to demonstrate the ease of cloud computing for data-intensive climate science research, and also the value of open-source tools like the Pangeo software stack on Google Cloud. “The CMIP6 dataset was already technically publicly available, it just was not very accessible,” says Drake. “The cloud-based data and computation, when combined with the Pangeo software stack, enabled me to make calculations in just a few hours that could have taken weeks using more conventional methods. Using the Pangeo binder, it was easy to make these calculations available to the rest of the world.”The CMIP6 data join many other weather and climate-related datasets available through Google’s Public Dataset program at no charge. By making data more accessible and usable with BigQuery and Cloud Storage, we support academic research by accelerating discoveries and promoting innovative solutions to complex problems. For Abernathey, the benefits of cloud computing are a particularly good match for the needs of scientific research: “With Google Cloud, you’ve essentially got a supercomputer just sitting right there, so you can directly process the data at a very high speed.”Get started with your own project by requesting data here.
Quelle: Google Cloud Platform