Tips to get the most out of Google Cloud Documentation

As a Google Cloud practitioner, you typically spend a lot of time in the documentation pages to read up on the guides, commands, tutorials and more. The documentation team has introduced several features over the years to make it easier to be more productive while working with Google Cloud documentation.A few of these tips would be well known to some of you and I am hoping that there is at least one tip that you go away with that helps you. In no particular order, here are my personal list of tips that I have found useful.Interactive Tutorials or WalkthroughsThis is an excellent feature of the documentation, where an interactive tutorial opens up right in the Google Cloud Console and you can complete the tutorial as a sequence of steps. Several tutorials are available from the Google Cloud Console via the Support icon in the top Action Bar. SearchThe Search bar at the top in the Google Cloud console is an efficient way to search for various product services, documentation pages, tutorials and even Google Cloud Resources (e.g. Compute Engine VM names). While you can locate the specific Product page from the hamburger menu on the top left and the subsequent left-navigation bar, the Search bar is probably the quickest way to get to a product (Extra points to those who are power users and have used the “pin” feature to lock frequently used products at the top of the list in the left-navigation bar). Here is a screencast demonstrating how to search for a specific product. You will notice that it’s just not about going to a specific product but it also provides you different sections (Tutorials, Google Cloud Resources, etc).If you would like to straightaway look at all the products and their related documentation, you should check out the View all Products link in the left-navigation bar. The screencast below demonstrates that. Need more tutorials, Quickstarts and reference guides? You have probably noticed that as you navigate across the documentation, we have a list of tutorials, Quickstarts and reference guides available for each of the products. There are couple of ways that I use to get more information on a specific product. First up, you will notice that some of our product pages have a Learn icon. Here is a sample of the Compute Engine product home page.Click on the Learn button to get access to a bunch of related documentation around the product. At times, I want to try out a few more interactive tutorials (walkthroughs). You would have noticed that via the Support icon in the top action bar, you can get access to some interactive tutorials via the Start a tutorial link as we saw earlier. This list is limited and there are other interactive tutorials available and you can get them as follows:Let’s say that you are interested in learning more about IAM and want to check out the various interactive tutorials that are available under this service. Go to the main Search bar at the top and enter IAM. This will present a list of search results as we saw earlier. You will notice that we provide a few results under the Documentation and Tutorials section as shown above. The keyword here is Interactive Tutorial. If you click on See more results … , this will lead to a search results page, where you can further filter into interactive tutorials only. Saving your favorite documentation pagesAt the top of each documentation page, you will see a Bookmark icon that you can click on to save to  your collection of documentation pages that you can then reference easily from your Google Profile. For e.g. here is a documentation page on how to create and start a VM instance in Compute Engine. I wish to bookmark this document. All I need to do is click on the Bookmark icon as shown below:You can choose to save it to your My saved pages or create a New Collection and save it in that. In my case above, I have created a new collection named Compute Engine and I chose to bookmark this page under that. How do you access all your bookmarked pages? On the top bar, next to your Google Profile pic, you will see a set of 3 dots, click on that. This will provide you a way to visit your Google Developer Profile associated with that account. One of the options as you can see below is that of Saved pages. When you visit the page, you will see your Saved Pages as shown below:You can tap on any of the collections that you have created and all your bookmarks will be available under that. Providing FeedbackYour feedback is valuable and Google Cloud Documentation makes it easy for you to submit your feedback. Notice the Send feedback button on the documentation pages. Click that and it will help you give us feedback on the specific page or the particular product documentation in general. Interactive Code samplesThis one continues to be one of my favorites and it boosts developer productivity by multiple  levels, especially when you are trying out the various gcloud commands. The specific feature is about using placeholder variables in the commands e.g. Project ID, Region, etc that you need to repeat across a series of commands. The feature is well over 2+ years old and has been well documented in the following blog post. I reproduce a screencast of the same here and reproduce the text from that blog post pertaining to this feature:“If a page has multiple code samples with the same placeholder variable, you only need to replace the variable once. For example, when you replace a PROJECT_ID variable with your own Google Cloud project ID, all instances of the PROJECT_ID variable (including in any other command line samples on the page) will use the same Google Cloud project ID.”Hope this set of tips was useful to you. If you would like to try out an interactive tutorial, try out the Compute Engine quickstart. I am sure you have a list of your own tips that you have found useful while working with Google Cloud documentation? Do reach out on Twitter (@iRomin) with those. I’d love to hear about them. 
Quelle: Google Cloud Platform

Run faster and more cost-effective Dataproc jobs

Dataproc is a fully managed service for hosting open-source distributed processing platforms such as Apache Hive, Apache Spark, Presto, Apache Flink, and Apache Hadoop on Google Cloud. Dataproc provides flexibility to provision and configure clusters of varying sizes on demand. In addition, Dataproc has powerful features to enable your organization to lower costs, increase performance and streamline operational management of workloads running on the cloud. Dataproc is an important service in any data lake modernization effort. Many customers begin their journey to the cloud by migrating their Hadoop workloads to Dataproc and continue to modernize their solutions by incorporating the full suite of Google Cloud’s data offerings.This guide demonstrates how you can optimize Dataproc job stability, performance, and cost-effectiveness. You can achieve this by using a workflow template to deploy a configured ephemeral cluster that runs a Dataproc job with calculated application-specific properties.  Before you beginPre-requisitesA Google Cloud projectA 100-level understanding of Dataproc (FAQ)Experience with shell scripting, YAML templates, Hadoop ecosystemAn existing dataproc application , referred to as “the job” or “the application”Sufficient project quotas (CPUs, Disks, etc.) to create clusters Consider Dataproc Serverless or BigQueryBefore getting started with Dataproc, determine whether your application is suitable for (or portable to) Dataproc Serverless or BigQuery. These managed services will save you time spent on maintenance and configuration. This blog assumes the user has identified Dataproc as the best choice for their scenario. For more information about other solutions, please check out some of our other guides like Migrating Apache Hive to Bigquery and Running an Apache Spark Batch Workload on Dataproc Serverless.Separate data from computationConsider the advantages of using Cloud Storage. Using this persistent storage for your workflows has the following advantages:It’s a Hadoop Compatible File System (HCFS), so it’s easy to use with your existing jobs.Cloud Storage can be faster than HDFS. In HDFS, a MapReduce job can’t start until the NameNode is out of safe mode—a process that can take a few seconds to many minutes, depending on the size and state of your data.It requires less maintenance than HDFS.It enables you to easily use your data with the whole range of Google Cloud products.It’s considerably less expensive than keeping your data in replicated (3x) HDFS on a persistent Dataproc cluster. Pricing Comparison Examples (North America, as of 11/2022):GCS: $0.004 – $0.02 per GB, depending on the tierPersistent Disk: $0.04 – $0.34 per GB + compute VM costsHere are some guides on Migrating On-Premises Hadoop Infrastructure to Google Cloud and HDFS vs. Cloud Storage: Pros, cons, and migration tips. Google Cloud has developed an open-source tool for performing HDFS to GCS.Optimize your Cloud StorageWhen using Dataproc, you can create external tables in Hive, HBase, etc., where the schema resides in Dataproc, but the data resides in Google Cloud Storage. Separating compute and storage enables you to scale your data independently of compute power.In HDFS / Hive On-Prem setups, the compute and storage were closely tied together, either on the same machine or in a nearby machine. When using Google Cloud Storage over HDFS, you separate compute and storage at the expense of latency. It takes time for Dataproc to retrieve files on Google Cloud Storage. Many small files (e.g. millions of <1mb files) can negatively affect query performance, and file type and compression can also affect query performance.When performing data analytics on Google Cloud, it is important to be deliberate in choosing your Cloud Storage file strategy. Monitoring Dataproc JobsAs you navigate through the following guide, you’ll submit Dataproc Jobs and continue to optimize runtime and cost for your use case. Monitor the Dataproc Jobs console during/after job submissions to get in-depth information on the Dataproc cluster performance. Here you will find specific metrics that help identify opportunities for optimization, notably YARN Pending Memory, YARN NodeManagers, CPU Utilization, HDFS Capacity, and Disk Operations. Throughout this guide you will see how these metrics influence changes in cluster configurations.Guide: Run Faster and Cost-Effective Dataproc Jobs1. Getting startedThis guide demonstrates how to optimize performance and cost of applications running on Dataproc clusters. Because Dataproc supports many big data technologies – each with their own intricacies – this guide intends to be trial-and-error experimentation. Initially it will begin with a generic dataproc cluster with defaults set. As you proceed through the guide, you’ll increasingly customize Dataproc cluster configurations to fit your specific workload.Plan to separate Dataproc Jobs into different clusters – each data processing platform uses resources differently and can impact each other’s performances when run simultaneously. Even better, isolating single jobs to single clusters can set you up for ephemeral clusters, where jobs can run in parallel on their own dedicated resources.Once your job is running successfully, you can safely iterate on the configuration to improve runtime and cost, falling back to the last successful run whenever experimental changes have a negative impact.You can export an existing cluster’s configuration to a file during experimentation. Use this configuration to create new clusters through the import command.code_block[StructValue([(u’code’, u’gcloud dataproc clusters export my-cluster \rn –region=region \rn –destination=my-cluster.yamlrnrngcloud dataproc clusters import my-new-cluster \rn –region=us-central1 \rn –source=my-cluster.yaml’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ebfd8f5f5d0>)])]Keep these as reference to the last successful configuration incase drift occurs. 2. Calculate Dataproc cluster sizea. Via on-prem workload (if applicable)View the YARN UIIf you’ve been running this job on-premise, you can identify the resources used for a job on the Yarn UI. The image below shows a Spark job that ran successfully on-prem.The table below are key performance indicators for the job.For the above job you can calculate the followingNow that you have the cluster sizing on-prem, the next step is to identify the initial cluster size on Google Cloud. Calculate initial Dataproc cluster sizeFor this exercise assume you are using n2-standard-8, but a different machine type might be more appropriate depending on the type of workload. n2-standard-8 has 8 vCPUs and 32 GiB of memory. View other Dataproc-supported machine types here.Calculate the number of machines required based on the number of vCores required.Recommendations based on the above calculations:Take note of the calculations for your own job/workload.b. Via an autoscaling clusterAlternatively, an autoscaling cluster can help determine the right number of workers for your application. This cluster will have an autoscaling policy attached. Set the autoscaling policy min/max values to whatever your project/organization allows. Run your jobs on this cluster. Autoscaling will continue to add nodes until the YARN pending memory metric is zero. A perfectly sized cluster minimizes the amount of YARN pending memory while also minimizing excess compute resources.Deploying a sizing Dataproc clusterExample: 2 primary workers (n2-standard-8)0 secondary workers (n2-standard-8)pd-standard 1000GBAutoscaling policy: 0 min, 100 max.No application properties set.sample-autoscaling-policy.ymlcode_block[StructValue([(u’code’, u’workerConfig:rn minInstances: 2rn maxInstances: 2rnsecondaryWorkerConfig:rn minInstances: 0rn maxInstances: 100rnbasicAlgorithm:rn cooldownPeriod: 5mrn yarnConfig:rn scaleUpFactor: 1.0rn scaleDownFactor: 1.0rn gracefulDecommissionTimeout: 1hr’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ebfda5cd190>)])]code_block[StructValue([(u’code’, u’gcloud dataproc autoscaling-policies import policy-namern –source=sample-autoscaling-policy.yml \rn –region=regionrnrngcloud dataproc clusters create cluster-name \rn –master-machine-type=n2-standard-8 \rn –worker-machine-type=n2-standard-8 \rn –master-boot-disk-type=pd-standard \rn –master-boot-disk-size=1000GBrn –autoscaling-policy=policy-name rn –region=region’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ebfc6803ed0>)])]Submitting Jobs to Dataproc Clustercode_block[StructValue([(u’code’, u”gcloud dataproc jobs submit spark \rn –cluster=cluster-name \rn –region=region \rn –jar=<your-spark-jar-path> \rn –properties=’spark.executor.cores=5,spark.executor.memory=4608mb’ \rn — arg1 arg2″), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ebfe510d7d0>)])]Monitoring Worker Count / YARN NodeManagersObserve the peak number of workers required to complete your job.To calculate the number of required cores, multiply the machine size (2,8,16,etc. by the number of node managers.) 3. Optimize Dataproc cluster configurationUsing a non-autoscaling cluster during this experimentation phase can lead to the discovery of more accurate machine-types, persistent disks, application properties, etc. For now, build an isolated non-autoscaling cluster for your job that has the optimized number of primary workers.Example: N primary workers (n2-standard-8)0 secondary workers (n2-standard-8)pd-standard 1000GBNo autoscaling policyNo application properties setDeploying a non-autoscaling Dataproc clustercode_block[StructValue([(u’code’, u’gcloud dataproc clusters create cluster-name \rn –master-machine-type=n2-standard-8 \rn –worker-machine-type=n2-standard-8 \rn –master-boot-disk-type=pd-standard \rn –master-boot-disk-size=1000GB \rn –region=region \rn –num-workers=x’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ebfe510d1d0>)])]Choose the right machine type and machine sizeRun your job on this appropriately-sized non-autoscaling cluster. If the CPU is maxing out, consider using C2 machine type. If memory is maxing out, consider using N2D-highmem machine types. Prefer using smaller machine types (e.g. switch n2-highmem-32 to n2-highmem-8). It’s okay to have clusters with hundreds of small machines. For Dataproc clusters, choose the smallest machine with maximum network bandwidth (32 Gbps). Typically these machines are n2-standard-8 or n2d-standard-16.On rare occasions you may need to increase machine size to 32 or 64 cores. Increasing your machine size can be necessary if your organization is running low on IP addresses or you have heavy ML or processing workloads. Refer to Machine families resource and comparison guide | Compute Engine Documentation | Google Cloud for more information.Submitting Jobs to Dataproc Clustercode_block[StructValue([(u’code’, u’gcloud dataproc jobs submit spark \rn –cluster=cluster-name \rn –region=region \rn –jar=<your-spark-jar-path> \rn — arg1 arg2′), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ebfd978bf90>)])]Monitoring Cluster MetricsMonitor memory to determine machine-type:Monitor CPU to determine machine-type:Choose the right persistent diskIf you’re still observing performance issues, consider moving from pd-standard to pd-balanced or pd-ssd.Standard persistent disks (pd-standard) are best for large data processing workloads that primarily use sequential I/Os. For PD-Standard without local SSDs, we strongly recommend provisioning 1TB (1000GB) or larger to ensure consistently high I/O performance. Balanced persistent disks (pd-balanced) are an alternative to SSD persistent disks that balance performance and cost. With the same maximum IOPS as SSD persistent disks and lower IOPS per GB, a balanced persistent disk offers performance levels suitable for most general-purpose applications at a price point between that of standard and SSD persistent disks.SSD persistent disks (pd-ssd) are best for enterprise applications and high-performance database needs that require lower latency and more IOPS than standard persistent disks provide.For similar costs, pd-standard 1000GB == pd-balanced 500GB == pd-ssd 250 GB. Be certain to review performance impact when configuring disk. See Configure Disks to Meet Performance Requirements for information on disk I/O performance. View Machine Type Disk Limits for information on the relationships between machine types and persistent disks. If you are using 32 core machines or more, consider switching to multiple Local SSDs per node to get enough performance for your workload. You can monitor HDFS Capacity to determine disk size. If HDFS Capacity ever drops to zero, you’ll need to increase the persistent disk size.If you observe any throttling of Disk bytes or Disk operations, you may need to consider changing your cluster’s persistent disk to balanced or SSD:Choose the right ratio of primary workers vs. secondary workersYour cluster must have primary workers. If you create a cluster and you do not specify the number of primary workers, Dataproc adds two primary workers to the cluster. Then you must determine if you prioritize performance or cost optimization.If you prioritize performance, utilize 100% primary workers. If you prioritize cost optimization, specify the remaining workers to be secondary workers. Primary worker machines are dedicated to your cluster and provide HDFS capacity. On the other hand, secondary worker machines have three types: spot VMs, standard preemptible VMs, and non-preemptible VMs. As a default, secondary workers are created with the smaller of 100GB or the primary worker boot disk size. This disk space is used for local caching of data and do not run HDFS. Be aware that secondary workers may not be dedicated to your cluster and may be removed at any time. Ensure that your application is fault-tolerant when using secondary workers.Consider attaching Local SSDsSome applications may require higher throughput than what Persistent Disks provide. In these scenarios, experiment with Local SSDs. Local SSDs are physically attached to the cluster and provide higher throughput than persistent disks (see the Performance table). Local SSDs are available at a fixed size of 375 gigabytes, but you can add multiple SSDs to increase performance.Local SSDs do not persist data after a cluster is shut down. If persistent storage is desired, you can use SSD persistent disks, which provide higher throughput for their size than standard persistent disks. SSD persistent disks are also a good choice if partition size will be smaller than 8 KB (however, avoid small paritions).Like Persistent Disks, continue to monitor any throttling of Disk bytes or Disk operations to determine whether Local SSDs are appropriate:Consider attaching GPUsFor even more processing power, consider attaching GPUs to your cluster. Dataproc provides the ability to attach graphics processing units (GPUs) to the master and worker Compute Engine nodes in a Dataproc cluster. You can use these GPUs to accelerate specific workloads on your instances, such as machine learning and data processing.GPU drivers are required to utilize any GPUs attached to Dataproc nodes. You can install GPU drivers by following the instructions for this initialization action.Creating Cluster with GPUscode_block[StructValue([(u’code’, u’gcloud dataproc clusters create cluster-name \rn –region=region \rn –master-accelerator type=nvidia-tesla-k80 \rn –worker-accelerator type=nvidia-tesla-k80,count=4 \rn –secondary-worker-accelerator type=nvidia-tesla-k80,count=4rn –initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/gpu/install_gpu_driver.sh’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ebfda9ecbd0>)])]Sample cluster for compute-heavy workload:code_block[StructValue([(u’code’, u’gcloud dataproc clusters create cluster-name \rn –master-machine-type=c2-standard-30 \rn –worker-machine-type=c2-standard-30 \rn –master-boot-disk-type=pd-balanced \rn –master-boot-disk-size=500GB \rn –region=region \rn –num-workers=10′), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ebfda9ecb90>)])]4. Optimize application-specific propertiesIf you’re still observing performance issues, you can begin to adjust application properties. Ideally these properties are set on the job submission, isolating properties to their respective jobs. View the best practices for your application below.Spark Job TuningHive Performance TuningTez Memory TuningPerformance and Efficiency in Apache PigSubmitting Dataproc jobs with propertiescode_block[StructValue([(u’code’, u”gcloud dataproc jobs submit spark \rn –cluster=cluster-name \rn –region=region \rn –jar=my_jar.jar \rn –properties=’spark.executor.cores=5,spark.executor.memory=4608mb’ \rn — arg1 arg2″), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ebfe59c08d0>)])]5. Handle edge-case workload spikes via an autoscaling policyNow that you have an optimally sized, configured, tuned cluster, you can choose to introduce autoscaling. Autoscaling should not be viewed as a cost-optimization technique because aggressive up/down scaling can lead to Dataproc job instability. However, conservative autoscaling can improve Dataproc cluster performance during edge-cases that require more worker nodes.Use ephemeral clusters (see next step) to allow clusters to scale up, and delete them when the job or workflow is complete.Ensure primary workers make up >50% of your cluster.Avoid autoscaling primary workers. Primary workers run HDFS Datanodes, while secondary workers are compute-only workers. HDFS’s Namenode has multiple race conditions that cause HDFS to get into a corrupted state that causes decommissioning to get stuck forever. Primary workers are more expensive but provide job stability and better performance. The ratio of primary workers vs. secondary workers is a tradeoff you can make; stability versus cost.Note: Having too many secondary workers can create job instability. Best practice indicates to avoid having a majority of secondary workers. Prefer ephemeral, non-autoscaled clusters where possible.Allow these to scale up and delete them when jobs are complete.As stated earlier, you should avoid scaling down workers because it can lead to job instability. Set scaleDownFactor to 0.0 for ephemeral clusters.Creating and attaching autoscaling policiessample-autoscaling-policy.ymlcode_block[StructValue([(u’code’, u’workerConfig:rn minInstances: 10rn maxInstances: 10rnsecondaryWorkerConfig:rn maxInstances: 50rnbasicAlgorithm:rn cooldownPeriod: 4mrn yarnConfig:rn scaleUpFactor: 1.0rn scaleDownFactor: 0rn gracefulDecommissionTimeout: 0rnrnrngcloud dataproc autoscaling-policies import policy-namern –source=sample-autoscaling-policy.yml \rn –region=regionrnrngcloud dataproc clusters update cluster-name rn –autoscaling-policy=policy-name rn –region=region’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ebfe4027710>)])]6. Optimize cost and reusability via ephemeral Dataproc clustersThere are several key advantages of using ephemeral clusters:You can use different cluster configurations for individual jobs, eliminating the administrative burden of managing tools across jobs.You can scale clusters to suit individual jobs or groups of jobs.You only pay for resources when your jobs are using them.You don’t need to maintain clusters over time, because they are freshly configured every time you use them.You don’t need to maintain separate infrastructure for development, testing, and production. You can use the same definitions to create as many different versions of a cluster as you need when you need them.Build a custom imageOnce you have satisfactory cluster performance, you can begin to transition from a non-autoscaling cluster to an ephemeral cluster.Does your cluster have init scripts that install various software? Use Dataproc Custom Images. This will allow you to create ephemeral clusters with faster startup times. Google Cloud provides an open-source tool to generate custom images.Generate a custom imagecode_block[StructValue([(u’code’, u’git clone https://github.com/GoogleCloudDataproc/custom-images.gitrnrncd custom-images || exitrnrnpython generate_custom_image.py \rn –image-name “<image-name>” \rn –dataproc-version 2.0-debian10 \rn –customization-script ../scripts/customize.sh \rn –zone zone \rn –gcs-bucket gs://”<gcs-bucket-name>” \rn –disk-size 50 \rn –no-smoke-test’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ebfd8526810>)])]Using Custom Imagescode_block[StructValue([(u’code’, u’gcloud dataproc clusters create cluster-name \rn –image=projects/<PROJECT_ID>/global/images/<IMAGE_NAME> \rn –region=region rnrngcloud dataproc workflow-templates instantiate-from-file \rn –file ../templates/pyspark-workflow-template.yaml \rn –region region’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ebfd8f5f890>)])]Create a Workflow TemplateTo create an ephemeral cluster, you’ll need to set up a Dataproc workflow template. A Workflow Template is a reusable workflow configuration. It defines a graph of jobs with information on where to run those jobs.Use the gcloud dataproc clusters export command to generate yaml for your cluster config:code_block[StructValue([(u’code’, u’gcloud dataproc clusters export my-cluster \rn –region=region \rn –destination=my-cluster.yaml’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ebfd8f5fe10>)])]Use this cluster config in your workflow template. Point to your newly created custom image, your application, and add your job specific properties.Sample Workflow Template (with custom image)code_block[StructValue([(u’code’, u’—rnjobs:rn – pysparkJob:rn properties:rn spark.pyspark.driver.python: ‘/usr/bin/python3’rn args:rn – “arg1″rn mainPythonFileUri: gs://<path-to-python-script>rn stepId: step1rn placement:rn managedCluster:rn clusterName: cluster-namern config:rn gceClusterConfig:rn zoneUri: zonern masterConfig:rn diskConfig:rn bootDiskSizeGb: 500rn machineTypeUri: n1-standard-4rn imageUri: projects/<project-id>/global/images/<image-name>rn workerConfig:rn diskConfig:rn bootDiskSizeGb: 500rn machineTypeUri: n1-standard-4rn numInstances: 2rn imageUri: projects/<project-id>/global/images/<image-name>rn initializationActions:rn – executableFile: gs://<path-to-init-script>rn executionTimeout: ‘3600s”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ebfc6650cd0>)])]Deploying an ephemeral cluster via a workflow templatecode_block[StructValue([(u’code’, u’gcloud dataproc workflow-templates instantiate-from-file \rn –file ../templates/pyspark-workflow-template.yaml \rn –region region’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ebfc6650e10>)])]Dataproc Workflow Templates provide a dataproc orchestration solution for use-cases such as:Automation of repetitive tasksTransactional fire-and-forget API interaction modelSupport for ephemeral and long-lived clustersGranular IAM securityFor broader data orchestration strategies, consider a more comprehensive data orchestration service like Cloud Composer.Next stepsThis post demonstrates how you can optimize Dataproc job stability, performance, and cost-effectiveness. Use Workflow templates to deploy a configured ephemeral cluster that runs a Dataproc job with calculated application-specific properties. Finally, there are many ways that you can continue striving for maximum optimal performance. Please review and consider the guidance laid out in the Google Cloud Blog. For general best practices, check out Dataproc best practices | Google Cloud Blog. For guidance on running in production, check out 7 best practices for running Cloud Dataproc in production | Google Cloud Blog.
Quelle: Google Cloud Platform

Opinary generates recommendations faster on Cloud Run

Editor’s note:Berlin-based startup Opinary migrated their machine learning pipeline from Google Kubernetes Engine (GKE) to Cloud Run. After making a few architectural changes, their pipeline is now faster and more cost-efficient. They reduced the time to generate a recommendation from 20 seconds to a second, and realized a remarkable 50% cost reduction.In this post, Doreen Sacker and Héctor Otero Mediero share with us a detailed and transparent technical report of the migration.Opinary asks the right questions to increase reader engagementWe’re Opinary, and our reader polls appear in news articles globally. The polls let users share their opinion with one click and see how they compare to other readers. We automatically add the most relevant reader polls using machine learning. We’ve found that the polls help publishers increase reader retention, boost subscriptions, and improve other article success metrics. Advertisers benefit from access to their target groups contextually on premium publishers’ sites, and from high-performing interaction with their audiences.Let’s look at an example of one of our polls. Imagine reading an article on your favorite news site about whether or not to introduce a speed limit on the highway. As you might know, long stretches of German Autobahn still don’t have a legal speed limit, and this is a topic of intense debate. Critics of speeding point out the environmental impact and casualty toll. Opinary adds this poll to the article:Diving into the architecture of our recommendation systemHere’s how we’ve architected our system originally on GKE. Our pipeline starts with an article URL, and delivers a recommended poll to add to the article. Let’s take a more detailed look at the various components that make this happen. Here’s a visual overview: First, we’ll push a message with the article URL to a Pub/Sub topic (a message queue). The recommender service pulls the message from the queue in order to process it. Before this service can recommend a poll, it needs to complete a few steps, which we’ve separated out into individual services. The recommender service sends a request to these services one-by-one and stores the results in a Redis store. These are the steps: The article scraper service scrapes (downloads and parses) the article text from the URL.The encoder service encodes the text into text embeddings (we use the universal sentence encoder).The brand safety service detects if the article text includes descriptions of tragic events, such as death, murder, or accidents, because we don’t want to add our polls into these articles. With these three steps completed, the recommendation service can recommend a poll from our database of pre-existing polls, and submit it to an internal database we call Rec Store. This is how we end up recommending adding a poll about introducing a speed limit on the German Autobahn.Why we decided to move to Cloud RunCloud Run looked attractive to us for two reasons. First, because it automatically scales down all the way to zero container instances if there are no requests, we expected we would save costs (and we did!). Second, we liked the idea of running our code on a fully-managed platform without having to worry about the underlying infrastructure, especially since our team doesn’t have a dedicated data engineer (we’re both data scientists).As a fully-managed platform, Cloud Run has been designed to make developers more productive. It’s a serverless platform that lets you run your code in containers, directly on top of Google’s infrastructure. Deployments are fast and automated. Fill in your container image URL and seconds later your code is serving requests. Cloud Run automatically adds more container instances to handle all incoming requests or events, and removes them when they’re no longer needed. That’s cost-efficient, and on top of that Cloud Run doesn’t charge you for the resources a container uses if it’s not serving requests. The pay-for-use cost model was the main motivation for us to migrate away from GKE. We only want to pay for the resources we use – and not for a large idle cluster during the night.Enabling the migration to Cloud Run with a few changesTo move our services from GKE to Cloud Run, we had to make a few changes. Change the Pub/Sub subscriptions from pull to push. Migrate our self-managed Redis database in the cluster to a fully-managed Cloud Memorystore instance. This is how our initial target architecture looks in a diagram: Changing Pub/Sub subscriptions from pull to pushSince Cloud Run services scale with incoming web requests, your container must have an endpoint to handle requests. Our recommender service originally didn’t have an endpoint to serve requests, because we used the Pub/Sub client library to pull messages. Google recommends to use push subscriptions instead of pull subscriptions to trigger Cloud Run from Pub/Sub. With a push subscription, Pub/Sub delivers messages as requests to an HTTPS endpoint. Note that this doesn’t need to be Cloud Run, it can be any HTTPS URL. Pub/Sub guarantees delivery of a message by retrying requests that return an error or are too slow to respond (using a configurable deadline). Introducing a Cloud Memorystore Redis instanceCloud Run adds and removes container instances to handle all incoming requests. Redis doesn’t serve HTTP requests, and it likes to have one or a few stateful container instances attached to a persistent volume, instead of disposable containers that start on-demand.We created a Memorystore Redis instance to replace the in-cluster Redis instance. Memorystore instances have an internal IP address on the project’s VPC network. Containers on Cloud Run operate outside of the VPC. That means you have to add a connector to reach internal IP addresses on the VPC. Read the docs to learn more about Serverless VPC access.Making it faster using Cloud TraceThis first part of our migration went smoothly, but while we were hopeful that our system would perform better, we would still regularly spend almost 20 seconds generating a recommendation. We used Cloud Trace to figure out where requests were spending time. This is what we found:To handle a single request our code made roughly 2,000 requests to Redis. Batching all these requests into one request was a big improvement. The VPC connector has a default maximum limit on network throughput that was too low for our workload. Once we changed it to use larger instances, response times improved.As you can see below, when we rolled out these changes, we realized a noticeable performance benefit. Waiting for responses is expensiveThe changes described above led to scalable and fast recommendations. We reduced the average recommendation time from 10 seconds to under 1 second. However, the recommendation service was getting very expensive, because it spent a lot of time doing nothing, waiting for other services to return their response.The recommender service would receive a request, and wait for other services to return a response. As a result, many container instances in the recommender service were running but were essentially doing nothing except waiting. Therefore, the pay-per-use cost model of Cloud Run leads to high costs for this service. Our costs went up by a factor of 4 compared with the original setup on Kubernetes.Rethinking the architectureTo reduce costs, we needed to rethink our architecture. The recommendation service was sending requests to all other services, and would wait for their responses. This is called an orchestration pattern. To have the services work independently, we changed to a choreography pattern. We needed the services to execute their tasks one after the other, but without a single service waiting for other services to complete. This is what we ended up doing:We changed the initial entrypoint to be the article scraping service, rather than the recommender service. Instead of returning the article text, the scraping service now stores the text in a Cloud Storage bucket. The next step in our pipeline is to run the encoder service, and we invoke it using an EventArc trigger.EventArc lets you asynchronously deliver events from Google services, including those from Cloud Storage. We’ve set an EventArc trigger to fire an event as soon as the article scraper service adds the file to the Cloud Storage bucket. The trigger sends the object information to the encoder service using an HTTP request. The encoder service does its processing and saves the results in a Cloud Storage bucket again. One service after the other can now process and save the intermediate results in Cloud Storage for the next service to use.Now that we asynchronously invoke all services using EventArc triggers, no single service is actively waiting for another service to return results. Compared with the original setup on GKE, our costs are now 50% lower. Advice and conclusionsOur recommendations are now fast, scalable, and our costs are half as much as the original cluster setup.Migrating from GKE to Cloud Run is easy for container-based applications.Cloud Trace was useful for identifying where requests were spending time.Sending a request from one Cloud Run service to another and synchronously waiting for the result turned out to be expensive for us. Asynchronously invoking our services using EventArc triggers was a better solution. Cloud Run is under active development and new features are being added frequently, which makes it a nice developer experience overall.Related ArticleHow to use Google Cloud Serverless tech to iterate quickly in a startup environmentHow to use Google Cloud Serverless tech to iterate quickly in a startup environment.Read ArticleRelated ArticleCloud Wisdom Weekly: 5 ways to reduce costs with containersUnderstand the core features you should expect of container services, including specific advice for GKE and Cloud Run.Read ArticleRelated ArticleHow Einride scaled with serverless and re-architected the freight industryEinride, a Swedish freight mobility company, is partnering with Google Cloud to reimagine the freight industry as we know it.Read Article
Quelle: Google Cloud Platform

Accelerate integrated Salesforce insights with Google Cloud Cortex Framework

Enterprises across the globe rely on a number of strategic independent software vendors like Salesforce, SAP and others to help them run their operations and business processes. Now more than ever, the need to sense and respond to new and changing business demands has increased and the availability of data from these platforms is integral for business decision making. Many companies today are looking for accelerated ways to link their enterprise data with surrounding data sets and sources to gain more meaningful insights and business outcomes. Getting there faster given the complexity and scale of managing and tying this data together can be an expensive and challenging proposition.To embark on this journey, many companies choose Google’s Data Cloud to integrate, accelerate and augment business insights through a cloud first data platform approach with BigQuery to power data driven innovation at scale. Next, they take advantage of best practices and accelerator content delivered with Google Cloud Cortex Framework to establish an open, scalable data foundation that can enable connected insights across a variety of use cases. Today, we are excited to announce the next offering of accelerators available that expand Cortex Data Foundation to include new packaged analytics solution templates and content for Salesforce. New analytics content for SalesforceSalesforce provides a powerful Customer Relationship Management (CRM) solution that is widely recognized and adopted across many industries and enterprises. With increased focus on engaging customers better and improving insights on relationships, this data is highly valuable and relevant as it spans many business activities and processes including sales, marketing, and customer service. With Cortex Framework, Salesforce data can now be more easily integrated into a single, scalable data foundation in BigQuery to unlock new insights and value. With this release, we take the guesswork out of the time, effort, and cost to establish a Salesforce data foundation in BigQuery. You can deploy Cortex Framework for Salesforce content to kickstart customer-centric data analytics and gain broader insights across key areas including: accounts, contacts, leads, opportunities and cases. Take advantage of the predefined data models for Salesforce along with analytics examples in Looker for immediate customer relationship focused insights, or easily join Salesforce data with other delivered data sets, such as Google Trends, Weather, or SAP to enable richer, connected insights. The choice is yours, and the sky’s the limit with the flexibility of Cortex to enable your specific use cases.By bringing Salesforce data together with other public, community, and private data sources, Google Cloud Cortex Framework helps accelerate the ability to optimize and innovate your business with connected insights.What’s nextThis release extends upon prior content releases for SAP and other data sources to further enhance the value of Cortex Data Foundation across private, public and community data sources. Google Cloud Cortex Framework continues to expand content to help better meet the needs of customers on data analytics transformation journeys. Stay tuned for more announcements coming soon.To learn more about Google Cloud Cortex Framework, visit our solution page, and try out Cortex Data Foundation today to discover what’s possible.Related ArticleAccelerating SAP CPG enterprises with Google Cloud Cortex FrameworkGoogle Cloud Cortex Framework launches analytics content to make it easier for SAP enterprises to solve common Consumer Packaged Goods us…Read Article
Quelle: Google Cloud Platform

Hierarchical Firewall Policy Automation with Terraform

Firewall rules are an essential component of network security in Google Cloud. Firewalls in Google Cloud can broadly be categorized into two types; Network Firewall Policies and Hierarchical Firewall Policies. While Network Firewalls are directly associated with a VPC to allow/deny the traffic, Hierarchical Firewalls can be thought of as the policy engine to use Resource Hierarchy for creating and enforcing policies across the organization. Hierarchical policies can be enforced at the organization level or at the folder(s) level. Like Network Firewall rules, hierarchical firewall policy rules can allow or deny traffic AND can also delegate the evaluation to lower level policies or to the network firewall rules itself (with a go_next). Lower-level rules cannot override a rule from a higher place in the resource hierarchy. This lets organization-wide admins manage critical firewall rules in one place.So, now let’s think of a few scenarios where Hierarchical Firewall policies will be useful1. Reduce the number of Network Firewall: Example: say in xyz.com got 6 Shared VPCs based upon their business segments. It is a security policy to refuse SSH access to any VMs in the company, i.e. deny TCP port 22 traffic. With Network Firewalls, this rule needs to be enforced at 6 places (each Shared VPC). Growing number of granular Network firewall rules for each network segment means more touch points, i.e. means more chances of drift and accidents. Security admins get busy with hand holding and almost always become a bottleneck for even simple firewall changes. With Hierarchical firewall Policies, Security Admins can create a common/single policy to deny TCP port 22 traffic and enforce it to xyz.com org. OR explicitly target one/many Shared VPCs from the policy. This way a single policy can define the broader traffic control posture.  2. Manage critical firewall rules using centralized policies AND safely delegate non-critical controls at VPC levelExample: At xyz.com SSH to GCEs is strictly prohibited and non-negotiable. Auditors need this. While allowing/denying TCP traffic to port 443 depends on which Shared VPC the traffic is going to. In this case security admins can create a policy to deny TCP port 22 traffic and enforce this policy to the xyz.com. Another policy is created for TCP port 443 traffic to say “go_next” and decide at the next lower level if this traffic is allowed. Then, have a Network Firewall rule to allow/deny 443 traffic at the Shared VPC level. This way Security Admin has broader control at a higher level to enforce traffic control policies and delegate where possible. Ability to manage the most critical firewall rules at one place also frees project level administrators (e.g., project owners, editors or security admin) from having to keep up with changing organization-wide policies. With hierarchical firewall policies, Security admin can centrally enforce, manage and observe the traffic control patterns.Create, Configure and Enforce Hierarchical Firewall PoliciesThere are 3 major components of Hierarchical Firewall Policies; Rules, Policy and Association. Broadly speaking a “Rule” is a decision making construct to declare if the traffic should be allowed, denied or delegated to the next level for decision. “Policy” is a collection of rules, i.e. one or more rules can be associated with a Policy. “Association” tells the enforcement point of the policy in the Google Cloud resource hierarchy. These concepts are extensively explained on the product page.A simple visualization of Rules, Policy and Association looks likeInfrastructure as Code (Terraform) for Hierarchical Firewall PoliciesThere are 3 Terraform Resources that need to be stitched together to build and enforce Hierarchical Firewall Policies.  #1 Policy Terraform Resource – google_compute_firewall_policyIn this module the most important parameter is the “parent” parameter. Hierarchical firewall policies, like projects, are parented by a folder or organization resource. Remember this is NOT the folder where the policy is enforced or associated. It is just a folder which owns Policy(s) that you are creating. Using a Folder to own the hierarchical firewall policies, also simplifies the IAM to manage who can create/modify these policies, i.e. just assign the IAM to this folder. For a scaled environment it is recommended to create a separate “firewall-policy” folder to host all of your Hierarchical Firewall Policies.Samplecode_block[StructValue([(u’code’, u’/*rn Create a Policyrn*/rnresource “google_compute_firewall_policy” “base-fw-policy” {rn parent = “folders/<folder-id>”rn short_name = “base-fw-policy”rn description = “A Firewall Policy Example”rn}’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3eeca59971d0>)])]You can get the Folder ID of the “firewall-policy” folder using below commandgcloud resource-manager folders list –organization=<your organization ID> –filter='<name of the folder>’For example, if your firewall policy folder is called ‘firewall-policy’ then use gcloud resource-manager folders list –organization=<your organization ID> –filter=’firewall-policy’ #2 Rules Terraform Resource – google_compute_firewall_policy_ruleMost of the parameters in this resource definition are very obvious but there are a couple of them that need special consideration.disabled – Denotes whether the firewall policy rule is disabled. When set to true, the firewall policy rule is not enforced and traffic behaves as if it did not exist. If this is unspecified, the firewall policy rule will be enabled. enable_logging – enabling firewall logging is highly recommended for many future operational advantages. To enable it, pass true to this parameter.target_resources – This parameter comes handy when you want to target certain Shared VPC(s) for this rule. You need to pass the URI path for the Shared VPC. Top get the URI for the VPC use this command code_block[StructValue([(u’code’, u’gcloud config set project <Host Project ID>rngcloud compute networks list –uri’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3eeca4d95c50>)])]SampleHere is some sample Terraform code to create a Firewall Policy Rule with priority 9000 to deny TCP port 22 traffic from 35.235.240.0/20 CIDR block (used for identity aware proxy)code_block[StructValue([(u’code’, u’/*rn Create a Firewall rule #1rn*/rnresource “google_compute_firewall_policy_rule” “base-fw-rule-1″ {rn firewall_policy = google_compute_firewall_policy.base-fw-policy.idrn description = “Firewall Rule #1 in base firewall policy”rn priority = 9000rn enable_logging = truern action = “deny”rn direction = “INGRESS”rn disabled = falsern match {rn layer4_configs {rn ip_protocol = “tcp”rn ports = [22]rn }rn src_ip_ranges = [“35.235.240.0/20″]rn }rn target_resources = [“https://www.googleapis.com/compute/v1/projects/<PROJECT-ID>/global/networks/<VPC-NAME>”]rn}’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3eeca58a7750>)])]#3 Association Terraform Resource – google_compute_firewall_policy_associationIn the attachment_target pass the folder ID where you want to enforce this policy, i.e. everything under this folder (all projects) will get this policy. In the case of Shared VPCs, the target folder should be the parent of your host project. Samplecode_block[StructValue([(u’code’, u’/*rn Associate the policy rn*/rnresource “google_compute_firewall_policy_association” “associate-base-fw-policy” {rn firewall_policy = google_compute_firewall_policy.base-fw-policy.idrn attachment_target = “folders/<Folder ID>”rn name = “Associate Base Firewall Policy with dummy-folder”rn}’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3eeca58a73d0>)])]Once these policies are enforced, you can see it on the console under “VPC Network->Firewall” as something like below.In the Firewall Policy Folder, the created Hierarchical Firewall Policy will show up. Remember there are 4 default firewall rules that come with each policy, so even when you create a single rule in your policy, rule count will be 5, as shown in the panel below.Go into the Policy to see the rules you created and association of the policy (shown in 2 panels). SummaryHierarchical Firewall Policy simplifies the complex process of enforcing consistent traffic control policies across your Google Cloud environment. With Terraform modules and automation shown in this article, it gives Security admins ability to build guardrails using a policy engine and known Infrastructure as Code platform. Check out the Hierarchical Firewall Policy doc and how to use them. 
Quelle: Google Cloud Platform

CISO Survival Guide: Vital questions to help guide transformation success

Part of being a security leader whose organization is taking on a digital transformation is preparing for hard questions – and complex answers – on how to implement a transformation strategy. In our previous CISO Survival Guide blog, we discussed how financial services organizations can more securely move to the cloud. We examined how to organize and think about the digital transformation challenges facing the highly-regulated financial services industry, including the benefits of the Organization, Operation, and Technology (OOT) approach, as well as embracing new processes like continuous delivery and required cultural shifts.As part of Google Cloud’s commitment to shared fate, today we offer tips on how to ask the right questions that can help create the conversations that lead to better transformation outcomes for your organization. While there often is more than one right answer, a thoughtful, methodical approach to asking targeted questions and maintaining an open mind about the answers you hear back can help achieve your desired result. These questions are designed to help you figure out where to start and where to end your organization’s security transformation. By asking the following questions, CISOs and business leaders can develop a constructive, focused dialogue which can help determine the proper balance between implementing security controls and fine-tuning the risk tolerance set by the executive management and the board of directors.aside_block[StructValue([(u’title’, u’Hear monthly from our Cloud CISO in your inbox’), (u’body’, <wagtail.wagtailcore.rich_text.RichText object at 0x3eba10dcfc90>), (u’btn_text’, u’Subscribe today’), (u’href’, u’https://go.chronicle.security/cloudciso-newsletter-signup?utm_source=cgc-blog&utm_medium=blog&utm_campaign=FY23-Cloud-CISO-Perspectives-newsletter-blog-embed-CTA&utm_content=-&utm_term=-‘), (u’image’, None)])]To start the conversation, begin by asking: What defines our organization’s culture?How can we best integrate the culture with our security goals?CISOs should ask business leaders:What makes a successful transformation? What are the key goals of the transformation?What data is (most) valuable?  What data can be retired, reclassified, or migrated?  What losses can we afford to take and still function?  What is the real risk that the organization is willing to accept?Business leaders should ask CISOs and the security team:What are the best practices for protecting our valuable data?What is the business impact of implementing those controls?  What are the top threats that we need to address?CISOs and business leaders should ask: Which threats are no longer as important? Where could we potentially use spending for more cost-effective controls such as firewalls and antivirus software?What benefits do we get from refactoring our applications?Are we really transforming, or lifting and shifting?How should we perform identity and access management to meet our business objectives?What are the core controls needed to ensure enterprise-level performance for the first workloads?CISOs and risk teams should ask:How can we use the restructuring of an existing body of code to streamline security functions?How should we monitor our security posture to ensure we are aligned with our risk appetite?Business and technical teams should ask:What’s our backup plan? What do we do if that fails?Practical advice and the realities of operational transformationSome organizations have been working in the cloud for more than a decade and have already addressed many operational procedures, sometimes with painful lessons learned along the way. If you’ve been operating in the cloud securely for that long, we recognize that there’s a lot to be gained from understanding your approaches to culture, operational expertise, and technology. However, there are still many organizations that have not thought through how they will operate in a cloud environment until it’s almost ready – and at that point, it might be too late. If you can’t detail how a cloud environment will operate before its launch, how will you know who should be responsible for maintaining it? Who are the critical stakeholders, along with those responsible for engineering and maintaining specific systems, who should be identified at the start of the transformation?  There are likely several groups of stakeholders, such as those aligned with operations for transformation, and those focused on control design for cloud aligned with operations. If you don’t have the operators involved in the design phase, you’re destined to create clever security controls with very little practical value because those tasked with day-to-day maintenance most likely won’t have the expertise or training to effectively operate these controls. This is complicated by the fact that many organizations are struggling to recruit and retain resources with the right skills to operate in the cloud. We believe that training current employees to learn new cloud skills, and giving them the time away from other responsibilities, can help build skilled, diverse cloud security teams.If your organization continually experiences high turnover in security leadership and skilled staff, it’s up to you to navigate your culture to ensure greater consistency. You can, of course, choose to supplement internal knowledge with trusted partners – however, that’s an expensive strategy for ongoing operational cost.We met recently with a security organization that turns over skilled staff and leadership every two to three years. This rate of churn results in a continual resetting of security goals. This particular team joked that it’s like “Groundhog Day” as they constantly re-evaluate their best security approaches yet make no meaningful progress. This is not a model to emulate.Many security controls fail not because they are improperly engineered, but because the people who use them – your security team – are improperly trained and insufficiently  motivated. This is especially true for teams with high turnover rates and other organizational misalignments. A security control that blocks 100% of attacks might be engineered correctly, but if you can’t efficiently operate it, the effectiveness of the control will plummet to zero over time. Worse, it then becomes a liability because you incorrectly assume you have a functioning control.In our next blog, we will highlight several proven approaches that we believe can help guide your security team through your organization’s digital transformation. To learn more now, check out:Previous blogPodcast: CISO walks into the cloud: Frustrations, successes, lessons… and does the risk change?Report: CISO’s Guide to Cloud Security TransformationRelated ArticleCISO Survival Guide: How financial services organizations can more securely move to the cloudThe first in a series of CISO survival guide blog posts offers cloud security advice for CISOs in financial services organizations tackli…Read Article
Quelle: Google Cloud Platform

Announcing the GA of BigQuery multi-statement transactions

Transactions are mission critical for modern enterprises supporting payments, logistics, and a multitude of business operations. And in today’s modern analytics-first and data-driven era, the need for the reliable processing of complex transactions extends beyond just the traditional OLTP database; today businesses also have to trust that their analytics environments are processing transactional data in an atomic, consistent, isolated, and durable (ACID) manner. So BigQuery set out to support DML statements spanning large numbers of tables in a single transaction and commit the associated changes atomically (all at once) if successful or rollback atomically upon failure. And today, we’d like to highlight the recent general availability launch of multi-statement transactions within BigQuery and the new business capabilities it unlocks. While in preview, BigQuery multi-statement transactions were tremendously effective for customer use cases, such as keeping BigQuery synchronized with data stored in OLTP environments, the complex post processing of events pre-ingested into BigQuery, complying with GDPR’s right to be forgotten, etc. One of our customers, PLAID, leverages these multi-statement transactions within their customer experience platform KARTE to analyze the behavior and emotions of website visitors and application users, enabling businesses to deliver relevant communications in real time and further PLAID’s mission to Maximize the Value of People with the Power of Data.“We see multi-statement transactions as a valuable feature for achieving expressive and fast analytics capabilities. For developers, it keeps queries simple and less hassle in error handling, and for users, it always gives reliable results.”—Takuya Ogawa, Lead Product EngineerThe general availability of multi-statement transactions not only provides customers with a production ready means of handling their business critical transactions in a comprehensive manner within a single transaction, but now also provides customers with far greater scalability compared to what was offered during the preview. At GA, multi-statement transactions increase support for mutating up to 100,000 table partitions and modifying up to 100 tables per transaction. This 10x scale in the number of table partitions and 2x scale in the number of tables was made possible by a careful re-design of our transaction commit protocol which optimizes the size of the transactionally committed metadata. The GA of multi-statement transactions also introduces full compatibility with BigQuery sessions and procedural language scripting. Sessions are useful because they store state and enable the use of temporary tables and variables, which then can be run across multiple queries when combined with multi-statement transactions. Procedural language scripting provides users the ability to run multiple statements in a sequence with shared state and with complex logic using programming constructs such as IF … THEN and WHILE loops.For instance, let’s say we wanted to enhance the current multi-statement transaction example, which uses transactions to atomically manage the existing inventory and supply of new arrivals of a retail company. Since we’re a retailer monitoring our current inventory on hand, we would now also like to add functionality to automatically suggest to our Sales team which items we should promote with sales offers when our inventory becomes too large. To do this, it would be useful to include a simple procedural IF statement, which monitors the current inventory and supply of new arrivals and modifies a new PromotionalSales table based on total inventory levels. And let’s validate the results ourselves before committing them as one single transaction to our sales team by using sessions. Let’s see how we’d do this via SQL.First, we’ll create our tables using DDL statements:code_block[StructValue([(u’code’, u’CREATE OR REPLACE TABLE my_dataset.Inventoryrn(product string,rnquantity int64,rnsupply_constrained bool);rn rnCREATE OR REPLACE TABLE my_dataset.NewArrivalsrn(product string,rnquantity int64,rnwarehouse string);rn rnCREATE OR REPLACE TABLE my_dataset.PromotionalSalesrn(product string,rninventory_on_hand int64,rnexcess_inventory int64);’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e4c43fb6cd0>)])]Then, we’ll insert some values into our Inventory and NewArrivals tables:code_block[StructValue([(u’code’, u”INSERT my_dataset.Inventory (product, quantity)rnVALUES(‘top load washer’, 10),rn (‘front load washer’, 20),rn (‘dryer’, 30),rn (‘refrigerator’, 10),rn (‘microwave’, 20),rn (‘dishwasher’, 30);rn rnINSERT my_dataset.NewArrivals (product, quantity, warehouse)rnVALUES(‘top load washer’, 100, ‘warehouse #1′),rn (‘dryer’, 200, ‘warehouse #2′),rn (‘oven’, 300, ‘warehouse #1′);”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e4c40dced50>)])]Now, we’ll use a multi-statement transaction and procedural language scripting to atomically merge our NewArrivals table with the Inventory table while taking excess inventory into account to build out our PromotionalSales table. We’ll also create this within a session, which will allow us to validate the tables ourselves before committing the statement to everyone else.code_block[StructValue([(u’code’, u”DECLARE average_product_quantity FLOAT64;rn rnBEGIN TRANSACTION;rn rnCREATE TEMP TABLE tmp AS SELECT * FROM my_dataset.NewArrivals WHERE warehouse = ‘warehouse #1′;rnDELETE my_dataset.NewArrivals WHERE warehouse = ‘warehouse #1′;rn rn#Calculates the average of all product inventories.rnset average_product_quantity = (SELECT AVG(quantity) FROM my_dataset.Inventory);rn rnMERGE my_dataset.Inventory IrnUSING tmp TrnON I.product = T.productrnWHEN NOT MATCHED THENrnINSERT(product, quantity, supply_constrained)rnVALUES(product, quantity, false)rnWHEN MATCHED THENrnUPDATE SET quantity = I.quantity + T.quantity;rn rn#The below procedural script uses a very simple approach to determine excess_inventory based on current inventory being 120% of the average inventory across all products.rnIF EXISTS(SELECT * FROM my_dataset.Inventoryrn WHERE quantity > (1.2 * average_product_quantity)) THENrn INSERT my_dataset.PromotionalSales (product, inventory_on_hand, excess_inventory)rn SELECTrn product,rn quantity as inventory_on_hand,rn quantity – CAST(ROUND((1.2 * average_product_quantity),0) AS INT64) as excess_inventoryrn FROM my_dataset.Inventoryrn WHERE quantity > (1.2 * average_product_quantity);rnEND IF;rn rnSELECT * FROM my_dataset.NewArrivals;rnSELECT * FROM my_dataset.Inventory ORDER BY product;rnSELECT * FROM my_dataset.PromotionalSales ORDER BY excess_inventory DESC;rn#Note the multi-statement SQL temporarily stops here within the session. This runs successfully if you’ve set your SQL to run within a session.”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e4c4268fad0>)])]From the results of the SELECT statements, we can see the warehouse #1 arrivals were successfully added to our inventory and the PromotionalSales table correctly reflects what excess inventory we have. Looks like these transactions are ready to be committed.However, just in case there were some issues with our expected results, if others were to query the tables outside the session we created, the changes wouldn’t have taken effect. Thus, we have the ability to validate our results and could roll them back if needed without impacting others.code_block[StructValue([(u’code’, u’#Run in a different tab outside the current session. Results displayed will be consistent with the tables before running the multi-statement transaction.rnSELECT * FROM my_dataset.NewArrivals;rnSELECT * FROM my_dataset.Inventory ORDER BY product;rnSELECT * FROM my_dataset.PromotionalSales ORDER BY excess_inventory DESC;’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e4c4268f650>)])]Going back to our configured session, since we’ve validated our Inventory, NewArrivals, and PromotionalSales tables are correct, we can go ahead and commit the multi-statement transaction within the session, which will propagate the changes outside the session too.code_block[StructValue([(u’code’, u’#Now commit the transaction within the same session configured earlier. Be sure to delete or comment out the rest of the SQL text run earlier.rnCOMMIT TRANSACTION;’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e4c42de2590>)])]And now that the PromotionalSales table has been updated for all users, our sales team has some ideas of what products they should promote due to our excess inventory.code_block[StructValue([(u’code’, u’#Results now propagated for all users.rnSELECT * FROM my_dataset.PromotionalSales ORDER BY excess_inventory DESC;’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e4c42de2c10>)])]As you can tell, using multi-statement transactions is simple, scalable, and quite powerful, especially combined with other BigQuery Features. Give them a try yourself and see what’s possible.
Quelle: Google Cloud Platform

New year, new skills – How to reach your cloud career destination

Cloud is a great place to grow your career in 2023. Opportunity abounds, with cloud roles offering strong salaries and scope for growth as a constantly evolving field.1 Some positions do not require a technical background, like project managers, product owners and business analysts. For others, like solutions architects, developers and administrators, coding and technical expertise are a must. Either way, cloud knowledge and experience are required to land that dream job. But where do you start? And how do you keep up with the fast pace of ever-changing cloud technology? Check out these tips below. There are also suggested training opportunities to help support your growth, including no-cost options below!Start by looking at your experienceYour experience can be a great way to get into cloud, even if it seems non-traditional. Think creatively about transferable skills and opportunities. Here are a few scenarios where you might find yourself today:You already work in IT, but in legacy systems or the data center. Forrest Brazeal, Head of Content Marketing at Google Cloud, talks about that in detail in this video.Use your sales experience to become a sales engineer, or your communications experience to become a developer advocate. Stephanie Wong, Developer Advocate at Google Cloud, discusses that here.You don’t have that college degree that is included in the job requirements. I’ve talked about that in a recent video here.Your company has a cloud segment, but your focus is in another area. Go talk to people! Access your colleagues who do what you want to do. Get their advice for skilling up.Define where you need to fill in gapsIf you are looking at a technical position, you will need to show cloud applicable experience, so learn about the cloud and build a portfolio of work. Here are a few key skills we recommend everyone have to start1:Code is non-negotiable. People who come from software development backgrounds typically find it easier to get into and maneuver through the cloud environment because of their coding experience. Automation, basic data manipulation and scaling is a daily requirement. If you don’t have a language you already know, learning Python is a great place to begin.Understand Linux. You’ll need to know the Linux filesystem, basic Linux commands and fundamentals of containerization.Learn core networking concepts like the IP Protocol and the others that layer on top of it, DNS, and subnets.Make sure you understand the cloud itself, and in particular the specifics about Google Cloud for a role at Google.Familiarity with open source tooling. Terraform for automation and Kubernetes for containers are portable between clouds and are worth taking the time to learn.Boost your targeted hands-on skillsCheck out Google Cloud Skills Boost for a comprehensive collection of training to help you upskill into a cloud role, including hands-on labs that get you real-world experience in Google Cloud. New users can start off with a 30 day no-cost trial2. Take a look at these recommendations:No-cost labs and coursesA Tour of Google Cloud Hands-on Labs- 45 minutesA Tour of Google Cloud Sustainability – 60 minutesIntroduction to SQL for BigQuery and Cloud SQL – 60 minutes Infrastructure and Application Modernization with Google Cloud – Introductory course with three modules Preparing for Google Cloud certification – Courses to help you prepare for Google Cloud certification examsBuild hands on projectsThis part is critical for the interview portion. Take the cloud skills you have learned and create something tangible that you can use as a story during an interview. Consider building a project on Github so others can see it working live, and document it well. Be sure to include your decision making process. Here is an example:Build an API or a web applicationDevelop the code for the applicationPick the infrastructure to deploy that application in the cloud, choose your storage option, and a database with which it will interact Get valuable cloud knowledge for non-technical rolesFor tech-adjacent roles, like those in business, sales or administration, having a solid knowledge of cloud principles is critical. We recommend completing the Cloud Digital Leader training courses, at no cost. Or go the extra mile and consider taking the Google Cloud Digital Leader Certification exam once you complete the training:No-cost courseCloud Digital Leader Learning Path – understand cloud capabilities, products and services and how they benefit organizations $99 registration feeGoogle Cloud Digital Leader Certification – validate your cloud expertise by earning a certificationCommit to learning in the New YearA couple of other resources we have are the Google Cloud Innovators Program, which will help you grow on Google Cloud and connect you with other community members. There is no-cost to join, and it will give you access build your skills and the future of cloud! Join today.Start your new year strong, whether you are exploring Google Cloud Data, DevOps or Networking certifications by completing Arcade games each week. This January play to win in The Arcade while you learn new skills and earn prizes on Google Cloud Skills Boost. Each week we will feature a new game to help you show and grow your cloud skills, while sampling certification-based learning paths.  Make 2023 the year to build your cloud career and commit to learning all year, with our $299/year annual subscription. This subscription includes $500 of Google Cloud credits (and a bonus $500 of Google Cloud credits after you successfully certify), a $200 certification voucher, $299 annual subscription to Google Cloud Skills Boost with access to the entire training catalog, live-learning events and quarterly technical briefings with executives. 1. Starting your career in cloud from IT – Forrest Brazeal, Head of Content Marketing, Google Cloud2. Credit card required to activate a 30 day no-cost trial for new users.
Quelle: Google Cloud Platform

Optimize and scale your startup – A look into the Build Series

At Google Cloud, we want to provide you with the access to all the tools you need to grow your business. Through the Google Cloud Technical Guides for Startups, leverage industry leading solutions with how-to video guides andresourcescurated for startups. This multi-series contains 3 chapters: Start, Build and Grow, which matches your startup’s  journey:The Start Series: Begin by building, deploying and managing new applications on Google Cloud from start to finish.The Build Series: Optimize and scale existing deployments to reach your target audiences.The Grow Series: Grow and attain scale with deployments on Google Cloud.Additionally, at Google we have the Google for Startups Cloud Program, which is designed to help your business get off the ground and enable a sustainable growth plan for the future. The start of the Build Series delineates the benefits of the program, the application process, and more to help your business get started on Google Cloud.A quick recap of the Build SeriesOnce you have applied for the Google for Startups Cloud Program, there’s so much to explore and try out on Google Cloud. Figuring out a rapid but solid application development process can be key to many businesses in reducing time to market. Furthermore, learning what database to use to handle application data can be tricky. Deep dive into our Firestorevideo which walks through how Firestore can help you unlock application innovation with simplicity and speed.We then move on to a deep dive intoBigQuery and how it can help businesses. BigQuery is designed to support analysis over petabytes of data regardless of whether it’s structured or unstructured. This video is the goto video for getting started on BigQuery!If you are someone looking to run your Spark and Hadoop jobs faster and on the cloud, look to Dataproc. To learn more about Dataproc and how this has helped other customers with their Hadoop clusters, click the video below to learn all things Dataproc related.Next, we find out what Dataflow can bring to your business; some advantages, sample architectures, demos on the console, and how other customers are using Dataflow. We also talked about Machine Learning, starting from selecting the right ML solution to Machine Learning APIs on cloud to exploring Vertex AI. Following that we look into API management in Google Cloud and how Apigee helps operate your APIs with enhanced scale, security, and automation.We ended the series with the last two episodes focusing around security deep-dive and using Cloud Tasks and Cloud Scheduler.Coming up next – The Grow SeriesDive into the next chapter of this multi-series, with our upcoming Grow Series, where we will be focusing on growing and attaining scale with deployments on Google Cloud.Check out our website and join us by checking out the video series on the Google Cloud Tech channel, and subscribe to stay up to date. See you in the cloud!
Quelle: Google Cloud Platform

New control plane connectivity and isolation options for your GKE clusters

Once upon a time, all Google Kubernetes Engine (GKE) clusters used public IP addressing for communication between nodes and the control plane. Subsequently, we heard your security concerns and introduced private clusters enabled by VPC peering. To consolidate the connectivity types, starting in March 2022, we began using Google Cloud’s Private Service Connect (PSC) for new public clusters’ communication between the GKE cluster control plane and nodes, which has profound implications for how you can configure your GKE environment. Today, we’re presenting a new consistent PSC-based framework for GKE control plane connectivity from cluster nodes. Additionally, we’re excited to announce a new feature set which includes cluster isolation at the control plane and node pool levels to enable more scalable, secure — and cheaper! — GKE clusters. New architectureStarting with GKE version 1.23 and later, all new public clusters created on or after March 15th, 2022 began using Google Cloud’s PSC infrastructure to communicate between the GKE cluster control plane and nodes. PSC provides a consistent framework that helps connect different networks through a service networking approach, and allows service producers and consumers to communicate using private IP addresses internal to a VPC. The biggest benefit of this change is to set the stage for using PSC-enabled features for GKE clusters.Figure 1: Simplified diagram of PSC-based architecture for GKE clustersThe new set of cluster isolation capabilities we’re presenting here is part of the evolution to a more scalable and secure GKE cluster posture. Previously, private GKE clusters were enabled with VPC peering, introducing specific network architectures. With this feature set, you now have the ability to:Update the GKE cluster control plane to only allow access to a private endpointCreate or update a GKE cluster node pool with public or private nodesEnable or disable GKE cluster control plane access from Google-owned IPs.In addition, the new PSC infrastructure can provide cost savings. Traditionally, control plane communication is treated as normal egress and is charged for public clusters as a normal public IP charge. This is also true if you’re running kubectl for provisioning or other operational reasons. With PSC infrastructure, we have eliminated the cost of communication between the control plane and your cluster nodes, resulting in one less network egress charge to worry about.Now, let’s take a look at how this feature set enables these new capabilities.Allow access to the control plane only via a private endpointPrivate cluster users have long had the ability to create the control plane with both public and private endpoints. We now extend the same flexibility to public GKE clusters based on PSC. With this, if you want private-only access to your GKE control plane but want all your node pools to be public, you can do so. This model provides a tighter security posture for the control plane, while leaving you to choose what kind of cluster node you need, based on your deployment. To enable access only to a private endpoint on the control plane, use the following gcloud command:code_block[StructValue([(u’code’, u’gcloud container clusters update CLUSTER_NAME \rn –enable-private-endpoint’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e7046d9dd10>)])]Allow toggling and mixed-mode clusters with public and private node poolsAll cloud providers with managed Kubernetes offerings offer both public and private clusters. Whether a cluster is public or private is enforced at the cluster level, and cannot be changed once it is created. Now you have the ability to toggle a node pool to have private or public IP addressing. You may also want a mix of private and public node pools. For example, you may be running a mix of workloads in your cluster in which some require internet access and some don’t. Instead of setting up NAT rules, you can deploy a workload on a node pool with public IP addressing to ensure that only such node pool deployments are publicly accessible. To enable private-only IP addressing on existing node pools, use the following gcloud command:code_block[StructValue([(u’code’, u’gcloud container node-pools update POOL_NAME \rn –cluster CLUSTER_NAME \rn –enable-private-nodes’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e704796ec90>)])]To enable private-only IP addressing at node pool creation time, use the following gcloud command:code_block[StructValue([(u’code’, u’gcloud container node-pools create POOL_NAME \rn –cluster CLUSTER_NAME \rn –enable-private-nodes’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e7046dc6710>)])]Configure access from Google Cloud In some scenarios, users have identified workloads outside of their GKE cluster, for example, applications running in Cloud Run or any GCP VMs sourced with Google Cloud public IPs were allowed to reach the cluster control plane. To mitigate potential security concerns, we have introduced a feature that allows you to toggle access to your cluster control plane from such sources. To remove access from Google Cloud public IPs to the control plane, use the following gcloud command:code_block[StructValue([(u’code’, u’gcloud container clusters update CLUSTER_NAME \rn –no-enable-google-cloud-access’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e7046df2510>)])]Similarly, you can use this flag at cluster creation time.Choose your private endpoint addressMany customers like to map IPs to a stack for easier troubleshooting and to track usage. For example — IP block x for Infrastructure, IP block y for Services, IP block z for the GKE control plane, etc. By default, the private IP address for the control plane in PSC-based GKE clusters comes from the node subnet. However, some customers treat node subnets as infrastructure and apply security policies against it. To differentiate between infrastructure and the GKE control plane, you can now create a new custom subnet and assign it to your cluster control plane.code_block[StructValue([(u’code’, u’gcloud container clusters create CLUSTER_NAME \rn –private-endpoint-subnetwork=SUBNET_NAME’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e7046dcd9d0>)])]What can you do with this new GKE architecture?With this new set of features, you can basically remove all public IP communication for your GKE clusters! This, in essence, means you can make your GKE clusters completely private. You currently need to create the cluster as public to ensure that it uses PSC, but you can then update your cluster using gcloud with the –enable-private-endpoint flag, or the UI, to configure access via only a private endpoint on the control plane or create new private node pools. Alternatively, you can control access at cluster creation time with the –master-authorized-networks and –no-enable-google-cloud-access flags to prevent access from public addressing to the control plane.Furthermore, you can use the REST API or Terraform Providers to actually build a new PSC-based GKE cluster with the default (thus first) node pools to have private nodes. This can be done by setting the enablePrivateNodes field to true (instead of leveraging the public GKE cluster defaults and then updating afterwards, as currently required with gcloud and UI operations). Lastly, the aforementioned features extend not only to Standard GKE clusters, but also to GKE Autopilot clusters.When evaluating if you’re ready to move these PSC-based GKE cluster types to take advantage of private cluster isolation, keep in mind that the control plane’s private endpoint has the following limitations:Private addresses in URLs for new or existing webhooks that you configure are not supported. To mitigate this incompatibility and assign an internal IP address to the URL for webhooks, set up a webhook to a private address by URL, create a headless service without a selector and a corresponding endpoint for the required destination.The control plane private endpoint is not currently accessible from on-premises systems.The control plane private endpoint is not currently globally accessible: Client VMs from different regions than the cluster region cannot connect to the control plane’s private endpoint.All public clusters on version 1.25 and later that are not yet PSC-based are currently being migrated to the new PSC infrastructure; therefore, your clusters might already be using PSC to communicate with the control plane.To learn more about GKE clusters with PSC-based control plane communication, check out these references:GKE Concept page for public clusters with PSCHow-to: Change Cluster Isolation pageHow-to: GKE node pool creation page with isolation feature flagHow-to: Schedule Pods on GKE Autopilot private nodesgcloud reference to create a cluster with a custom private subnetTerraform Providers Google: release v4.45.0 pageGoogle Cloud Private Services Connect page.Here are the more specific features in the latest Terraform Provider, handy to integrate into your automation pipeline:Terraform Providers Google: release v4.45.0gcp_public_cidrs_access_enabledenable_private_endpointprivate_endpoint_subnetworkenable_private_nodes
Quelle: Google Cloud Platform