Streamline data management and governance with the unification of Data Catalog and Dataplex

Today, we are excited to announce that Google Cloud Data Catalog will be unified with Dataplex into a single user interface. With this unification, customers have a single experience to search and discover their data, enrich it with relevant business context, organize it by logical data domains, and centrally govern and monitor their distributed data with built-in data intelligence and automation capabilities. Customers now have access to an integrated metadata platform that connects technical and operational metadata with business metadata, and then uses this augmented and active metadata to drive intelligent data management and governance. The enterprise data landscape is becoming increasingly diverse and distributed with data across multiple storage systems, each having its own way of handling metadata, security, and governance. This creates a tremendous amount of operational complexity, and thus, generates strong market demand for a metadata platform that can power consistent operations across distributed data.Dataplex provides a data fabric to automate data management, governance, discovery, and exploration across distributed data at scale. With Dataplex, enterprises can easily organize their data into data domains, delegate ownership, usage, and sharing of data to data owners who have the right business context, while still maintaining a single pane of glass to consistently monitor and govern data across various data domains in their organization. Prior to this unification, data owners, stewards and governors had to use two different interfaces – Dataplex to organize, manage, and govern their data, and Data Catalog to discover, understand, and enrich their data. Now with this unification, we are creating a single coherent user experience where customers can now automatically discover and catalog all the data they own, understand data lineage, check for data quality, augment that metadata with relevant business context, organize data into business domains, and then use that combined metadata to power data management. Together we provide an integrated experience that serves the full spectrum of data governance needs in an organization, enabling data management at scale.“With Data Catalog now being part of Dataplex, we get a unified, simplified, and streamlined experience to effectively discover and govern our data, which enables team productivity and analytics agility for our organization. We can now use a single experience to search and discover data with relevant business context, organize and govern this data based on business domains, and enable access to trusted data for analytics and data science – all within the same platform.” saidElton Martins, Senior Director of Data Engineering at Loblaw Companies Limited.Getting startedExisting Data Catalog and Dataplex customers and new customers can now start using Dataplex for metadata discovery, management and governance. Please note that while the user experience interface is unified via this release, all existing APIs and feature functionalities of both products will continue to work as before. To learn more, please refer to technical documentations or contact the Google Cloud sales team.Related ArticleScalable Python on BigQuery using Dask and NVIDIA GPUsTo accelerate data analytics and machine learning workflows, we introduce the Dask BigQuery connector to read data through BigQuery stora…Read Article
Quelle: Google Cloud Platform

Using Pacemaker for SAP high availability on Google Cloud – Part 1

Problem StatementMaintaining business continuity of your mission critical systems usually demands high availability (HA) solutions that will failover without human intervention. If you are running SAP HANA or SAP NetWeaver (SAP NW) on Google Cloud, the OS-native high availability (HA) cluster capability provided by Red Hat Enterprise Linux (RHEL) for SAP and SUSE Linux Enterprise Server (SLES) for SAP is often adopted as the foundational functionality to provide business continuity for your SAP system. This blog will introduce some basic terminology and concepts about the RedHat and SUSE HA implementation of Pacemaker cluster software for SAP HANA and NetWeaver platforms.Pacemaker TerminologyResourceThe resource in Pacemaker is the service made highly available by the cluster. For SAP HANA, there are two resources: HANA and HANA Topology. For SAP NetWeaver Central Services, there are also two resources: one for the Central Services instance that runs the Message Server and Enqueue Server (ASCS in NW ABAP or SCS NW Java) and another one for the Enqueue Replication Server (ERS). In the Pacemaker cluster, we also configure other resources for serving other functions such as Virtual IP (VIP) or Internal Load Balancer (ILB) health check mechanism. Resource agentA resource agent manages each resource. It defines the logic for resource operations called by the Pacemaker cluster to start, stop or monitor the health of resources. They are usually Linux bash or python scripts which implement functions for resource agent operations.Resource agents managing SAP resources are co-developed by SAP and OS vendors. They are open sourced in GitHub, OS vendors downstream to SAP resource agent package for their Linux distro.For HANA scale up, resource agents “SAPHANA” and “SAPHANATopology” For HANA scale out, resource agents “SAPHANAController” and “SAPHANATopology”For NetWeaver Central Services, the resource agent is “SAPInstance”Why are there two resource agents to manage HANA? “SAPHanaTopology” is responsible for monitoring HANA topology status on all cluster nodes and updating HANA relevant cluster properties. The attributes are read by “SAPHANA” as part of the HANA monitoring function.Resource agents are usually installed in the directory `/usr/lib/ocf/resource.d/`.Resource operationA resource can have what is called a resource operation. Resource operations are major types of actions: monitor, start, stop, promote, demote. These work as described, for example, if a resource operation is a “promote” operation then it will promote a resource in the cluster. The actions are built into the respective resource agent scripts.Properties of an operation:interval – If set to a nonzero value, defines how frequently the operation occurs after the first monitor action completes. timeout – defines the amount of time the operation has to complete before the operation is aborted and considered failed.on-fail – defines the action to be executed if the operation fails. The default action for operation ‘stop’ is ‘fence’ and the default for all others is ‘restart’.role – run the operation only on node that the cluster thinks should be in the specified role. A role can be master or slave, started or stopped. The role provides context for pacemaker to make resource location and operation decisions.Resource groupResource agents can be grouped into administrative units that are dependent on one another and need to be started sequentially and stopped in the reverse order.While technically each cluster resource is failed over one at a time, logically (to simplify cluster configuration) failover of resource groups is configured. For SAP HANA, for example, there is typically one resource group containing both the VIP resource and the ILB healthcheck resource.Resource constraintsConstraints determine the behavior of a resource in a cluster. Categories of constraints are location, order and colocation. The list below includes the constraints in SLES and RHEL.Location Constraint – determines on which nodes a resource can run; e.g., pins each fence device to the other host VM.Order Constraint – determines the order in which resources run; e.g., first start resource SAPHANATopology then start resource SAPHANA.Colocation Constraint – determines that the location of one resource depends on the location of another resource; e.g., the IP address resource group should be on the same host as the primary HANA instance.Fencing and fence agentA fencing or fence agent is an abstraction that allows a Pacemaker cluster to isolate problematic cluster nodes or cluster resources for which the state cannot be determined. Fencing can be performed at either the cluster node level or at the cluster resource/resource group level. Fencing is most commonly performed at the cluster node level by remotely power cycling the problematic cluster node or by disabling its access to the network.Similar to resource agents, these agents are also usually bash or python scripts. The two commonly used fence agents within GCP are “gcpstonith” and “fence_gce”, with “fence_gce” being the more robust successor of “gcpstonith”. Fence agents leverage the compute engine reset API in order to fence problematic nodes.The fencing resource “gcpstonith” is usually downloaded and saved in the directory `/usr/lib64/stonith/plugins/external` . The resource “fence_gce” comes with the RHEL and SLES images with the HA extension.CorosyncCorosync is an important piece of a Pacemaker cluster whose effect on the cluster is often undervalued. Corosync enables servers to interact as a cluster, while Pacemaker provides the ability to control how the cluster behaves. Corosync provides messaging and membership functionality along with other functions:Maintains the quorum information.Is used by all cluster nodes to communicate and coordinate cluster tasks.Stores the default location of the Corosync configuration: /etc/corosync/corosync.confIf there is a communication failure or timeout within Corosync then there will be a membership change or fencing action performed.Clones and Clone SetsClones represent resources that can become active on multiple hosts without requiring the creation of unique resource definitions for them. When resources are grouped across hosts, we call this a clone set. There are different types of cloned resources. The main clone set of interest for SAP configurations is that of a stateful clone, which represents a resource with a particular role. In the context of the SAP HANA database, the primary and secondary database instances would be contained within the SAPHana clone set.ConclusionNow that you have read through the terminology, let’s see how an SAP Pacemaker cluster looks on each OS: SLES:There are have two nodes in the cluster and both are online* Online: [ node-x node-y ]The STONITH resource is started on each node and is using the “gcpstonith” fence agent  * STONITH-node-x      (stonith:external/gcpstonith):   Started node-y  * STONITH-node-y      (stonith:external/gcpstonith):   Started node-xThere is a resource group called g-primary that contains both the IPAddr2 resource agent, which adds the ILB forwarding rule IP address to the NIC of the active node, and the anything resource agent, which starts a program ‘socat’ to respond to ILB health check probes:    * rsc_vip_int-primary       (ocf::heartbeat:IPaddr2):        Started node-y    * rsc_vip_hc-primary        (ocf::heartbeat:anything):       Started node-yThere is a Clone Set for the SAPHANATopology resource agent containing the two nodes:cln_SAPHanaTopology_TST_HDB00 [rsc_SAPHanaTopology_TST_HDB00] There is a Clone Set for the SAPHANA resource agent containing a master and slave node:  * Clone Set: msl_SAPHana_TST_HDB00 [rsc_SAPHana_TST_HDB00] (promotable)Note: You can see that one of the clone sets is marked as promotable. If a clone is promotable, its instances can perform a special role that Pacemaker will manage via the promote and demote operations of the resource agent.RHEL:There are two nodes in the cluster and both are online:* Online: [ rhel182ilb01 rhel182ilb02 ]The STONITH resource is started on the opposite node and is using the more robust “fence_gce” fence agent:STONITH-rhel182ilb01 (stonith:fence_gce): Started rhel182ilb02STONITH-rhel182ilb02 (stonith:fence_gce): Started rhel182ilb01There is a resource group called g-primary that contains both the IPAddr2 resource agent, which adds the ILB forwarding rule IP address to the NIC of the active node, and the haproxy resource agent, which starts a program ‘haproxy’ to respond to ILB health check probes:* rsc_healthcheck_R82        (service:haproxy):       Started rhel182ilb02 * rsc_vip_R82_00       (ocf::heartbeat:IPaddr2):        Started rhel182ilb02There is a Clone Set for the SAPHanaTopology resource agent containing the two nodes:* Clone Set: SAPHanaTopology_R82_00-clone [SAPHanaTopology_R82_00] There is a Clone Set for the SAPHana resource agent containing a master and slave node:  * Clone Set: SAPHana_R82_00-clone [SAPHana_TST_HDB00] (promotable)If you compare both SLES and RHEL clusters above, even though they are completely different clusters, you can see the similarities and technologies which are used to perform cluster operations.Congratulations. Now you should have a firm grasp of the key areas and terms of a SAP Cluster running on Google Cloud Platform.Where to go from here? Review our other blogs to become an expert in understanding your cluster and its behavior:What’s happening in your SAP systems? Find out with Pacemaker AlertsAnalyze Pacemaker events in Cloud LoggingRelated ArticleWhat’s happening in your SAP systems? Find out with Pacemaker AlertsThe cluster alerting enables the system administrator to be notified about critical events of the enterprise workloads in GCP like the SA…Read Article
Quelle: Google Cloud Platform

Scalable Python on BigQuery using Dask and GPUs

BigQuery is Google Cloud’s fully managed serverless data platform that supports querying using ANSI SQL. BigQuery also has a data lake storage engine that unifies SQL queries with other open source processing frameworks such as Apache Spark, Tensorflow, and Dask. BigQuery storage provides an API layer for OSS engines to process data. This API enables mixing and matching programming in languages like Python with structured SQL in the same data platform. This post provides an introduction to using BigQuery with one popular distributed Python framework, Dask, an open source library that makes it easy to scale Python tools to BigQuery sized datasets. We will also show you how to extend Dask with RAPIDS, a suite of open-source libraries and APIs to execute GPU-accelerated pipelines directly on BigQuery storage.Integrating Dask and RAPIDS with BigQuery storage A core component of BigQuery architecture is the separation of compute and storage. BigQuery storage can be directly accessed over a highly performant Storage Read API which enables users to consume data in multiple streams and provides both column projections and filtering at the storage level. Coiled, a Google Cloud Partner that provides enterprise-grade Dask in your GCP account, developed an open-source Dask-BigQuery connector (GitHub) that enables Dask processing to take advantage of the Storage Read API and governed access to BigQuery data. RAPIDSis an open sourced library spawned from NVIDIA that uses Dask to distribute data and computation over multiple NVIDIA GPUs. The distributed computation can be done on a single machine or in a multi-node cluster. Dask integrates with both RAPIDS cuDF, XGBoost, and RAPIDS cuML for GPU-accelerated data analytics and machine learning.To start using Dask using BigQuery data, you can install the dask-bigquery connector from any Python IDE. You simply install `dask-bigquery` with `pip` or `conda`, authenticate with Google Cloud, and then use the few lines of python code as shown below to pull data from a BigQuery table.code_block[StructValue([(u’code’, u’import dask_bigqueryrnrnddf = dask_bigquery.read_gbq(rn project_id=”your_project_id”,rn dataset_id=”your_dataset”,rn table_id=”your_table”,rn)rnddf.head()’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e6b8a655150>)])]Achieving Python scalability on BigQuery with Dataproc While Dask and the BQ connector can essentially be installed anywhere that Python can be run and scale to the number of cores available in that machine, the real power of scaling comes in when you can use an entire cluster of virtual machines. An easy way to do this on Google Cloud is by using Dataproc. Using the initialization actions outlined in this GitHub repo, getting setup with Dask and RAPIDS on a Dataproc cluster with NVIDIA GPUs is fairly straightforward.Let’s walk through an example using the NYC taxi dataset. As a first step, let’s create a RAPIDS accelerated Dask yarn cluster object on Dataproc by running the following code:code_block[StructValue([(u’code’, u’from dask.distributed import Clientrnfrom dask_yarn import YarnClusterrnrncluster = YarnCluster(worker_class=”dask_cuda.CUDAWorker”, rn worker_gpus=1, worker_vcores=4, worker_memory=’24GB’, rn worker_env={“CONDA_PREFIX”:”/opt/conda/default/”})rncluster.scale(4)’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e6ba01f7510>)])]Now that we have a Dask client, we can use it to read the NYC Taxi dataset in a BigQuery table through the Dask BigQuery connector:code_block[StructValue([(u’code’, u’d_df = dask_bigquery.read_gbq(rn project_id=”k80-exploration”,rn dataset_id=”spark_rapids”,rn table_id=”nyc_taxi_0″,rn)’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e6ba01f7150>)])]Next, let’s use RAPIDS Dask cuDF libraries to accelerate the preprocessing with GPUs.code_block[StructValue([(u’code’, u”taxi_df = dask_cudf.from_dask_dataframe(d_df)rntaxi_df = clean(taxi_df, remap, must_haves)rntaxi_df = taxi_df.query(‘ and ‘.join(query_frags))”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e6ba345dc90>)])]Finally, we can use a feature of the Dask dataframe to split into two datasets — one for training and one for testing. These datasets can also be converted to XGBoost Dmatrix and sent into XGBoost for training on GPU.code_block[StructValue([(u’code’, u”xgb_clasf = xgb.dask.train(client, rn params,rn dmatrix_train, rn num_boost_round=2000,rn evals=[(dmatrix_train, ‘train’), (dmatrix_test,’test’)]rn )”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e6b8aa98b90>)])]The complete notebook can be accessed at this GitHub link. Currently, Dask-BigQuery connector doesn’t support native write back to BigQuery, user need work around that through cloud storage, with Dask or Dask Rapids, write back to GCS first with `to_parquet(“gs://temp_path/”)`, then having BigQuery load from GCS with: `bigquery.Client.load_table_from_uri(“gs://temp_path/”)`.What’s nextIn this blog, we introduced a few key components to allow BigQuery users to scale their favorite Python libraries through Dask to process large datasets. With the broad portfolio of NVIDIA GPUs embedded across Google Cloud data analytics services like BigQuery and Dataproc and the availability of GPU-accelerated software like RAPIDS, developers can significantly accelerate their analytics and machine learning workflows. Acknowledgements: Benjamin Zaitlen, Software Engineer Manager, NVIDIA; Jill Milton, Senior Partnership Manager, NVIDIA, Coiled Developer Team.Related ArticleLearn how BI Engine enhances BigQuery query performanceThis blog explains how BI Engine enhances BigQuery query performance, different modes in BI engine and its monitoring.Read Article
Quelle: Google Cloud Platform

Google Cloud Data Heroes Series: Meet Tomi, a data engineer based in Germany and creator of the ‘Not So BigQuery Newsletter’

Google Cloud Data Heroes is a series where we share stories of the everyday heroes who use our data tools to do incredible things. Like any good superhero tale, we explore our Google Cloud Data Heroes’ origin stories, how they moved from data chaos to a data-driven environment, what projects and challenges they are overcoming now, and how they give back to the community. In this month’s edition, we’re pleased to introduce Tomi! Tomi grew up in Croatia, and is now residing in Berlin, Germany, where he currently works as a freelance Google Cloud data engineer. In this role, he regularly uses BigQuery. Tomi’s familiarity with BigQuery and his passion for Google Cloud led him to creating the weekly newsletter Not So BigQuery, where he discusses the latest data-related information from the GCP world.  Additionally, he also works for one of the largest automotive manufacturers in Germany as an analyst. When not in front of the keyboard, Tomi enjoys walking with his dog and his girlfriend, going to bakeries, or spending a night watching television.When were you introduced to the cloud, tech, or data field? What made you pursue this in your career? I always struggled with the question ‘what do you want to do in your life?. I attended school at Zagreb University of Applied Science for my information technology studies degree, but I was still unsure if I should become a developer, data engineer or something completely different.A couple of years into working as a junior IT Consultant, I stumbled upon a job advertisement looking for a Data Analyst/Scientist. Back then, finding out that you can get paid to just work with data all day sounded mind-blowing to me. A dream job.I immediately applied for the role and started learning about the skills needed. This is also where I gained my first experience with the Cloud as I signed up for a Google Cloud Platform free trial in February 2018. On the platform, there was a blog post describing how to run Jupyter notebooks in the Cloud. It interested me, and I went ahead and created my very first Compute Engine instance in Google Cloud Platform.I didn’t get the job I initially applied for, but this was the trigger for me that set things in motion and got me to where I am now.What courses, studies, degrees, or certifications were instrumental to your progression and success in the field? In your opinion, what data skills or competencies should data practitioners be focusing on acquiring to be successful in 2022 and why? Looking back at my university days, I really enjoyed the course about databases, which was partially because I had a really great teacher, but also because this was the first time I got to do something which catered to my then still-unknown data-nerdy side.In 2019, I got my Google Cloud Certified Associate Cloud Engineer Certification which was a challenging and rewarding entry-level certification for Google Cloud. I would recommend considering getting one of these as a way of focusing one’s learning.One major change I’ve observed since working in the data field is the ongoing transition from on-prem to cloud and serverless. I remember a story from my early consulting days working in an IT operations team, when there was a major incident caused by an on-prem server outage. At some point one frustrated colleague said something like, ‘why do we even have to have servers? Why can’t it just *run* somehow?’ What sounded like a bit of a silly question back then turned out to be quite ‘visionary’ with all the serverless and cloud-based tech we have today.What drew you to Google Cloud? Tell us about that process, what you’re most proud of in this area, and why you give back to the community? There is this great newsletter on Google Cloud Platform called GCP Weekly, run by a data community member named Zdenko Hrček that I really like. However, since the GCP ecosystem is growing at a rapid pace there are sometimes just too many news and blogs in a single week. I really struggled to catch up with all the new product updates and tutorials. That’s when I had the idea: ‘what if there would be a shorter newsletter with only news about BigQuery and other data-related tools’? Fast forward to today, my Not So BigQuery newsletter has more than 220 subscribers.I was also inspired by the awesome content created by Priyanka Vergadia, Staff Developer Advocate at Google Cloud, such as her Sketchnotes series. I created the GCP Data Wiki, which is a public Notion page with cards for every database/storage service in GCP with useful details such as links to official docs, Sketchnotes and more.What are 1-2 of your favorite projects you’ve done with Google Cloud’s data products? One of my first projects built with Google Cloud products was an automated data pipeline to get track data from the official Spotify API. I was looking for a data project to add to my portfolio and found out that Spotify lets you query their huge library via a REST API. This later evolved into a fully-serverless pipeline running on Google Cloud Functions and BigQuery. I also wrote a blog post about the whole thing, which got 310 claps on Medium.Additionally, the Not So BigQuery newsletter I created is actually powered by a tool I built using Google Sheets and Firebase (Functions). I have a Google Sheet where I pull in the news feed sections from sources such as the Google Cloud Blog and Medium. Using the built-in Sheets formulas such as IMPORTFEED and FILTER, I built a keyword-based article curation algorithm pre-selecting the articles to include in the next issue of the newsletter. Then my tool called crssnt (pronounced as the french pastry) takes the data from the Google Sheet and displays it in the newsletter. If you are curious how the Google Sheet looks like, you can check it out here.What are your favorite Google Cloud Platform data products within the data analytics, databases, and/or AI/ML categories? What use case(s) do you most focus on in your work? What stands out about GCP’s offerings?My favorite is BigQuery but I’m also a huge fan of Firestore. BigQuery is my tool of choice for pretty much all of my data warehouse needs (for both personal and client projects). What really stood out to me for me is the ease of use when it comes to setting up new databases from scratch and getting first results in the form of e.g. a Data Studio dashboard built on top of a BigQuery table. Similarly, I always go back to Firestore whenever I have an idea about some new front-end project since it’s super easy to get started and gives me a lot of flexibility.From similar non-Google products, I used Snowflake a while ago but didn’t find the user interface nearly as intuitive and user-friendly as BigQuery.What’s next for you in life? It’s going to be mostly ‘more of the same’ for me: as a data nerd, there is always something new to discover and learn. My overall message to readers would be to try to not worry too much about fitting into predefined career paths, job titles and so on, and just do your thing. There is always more than one way of doing things and reaching your goals. Want to join the Data Engineer Community?Register for the Data Engineer Spotlight on July 20th, where attendees have the chance to learn from four technical how-to sessions and hear from Google Cloud Experts on the latest product innovations that can help you manage your growing data. Begin your own Data Hero journeyReady to embark on your Google Cloud data adventure? Begin your own hero’s journey with GCP’s recommended learning path where you can achieve badges and certifications along the way. Join the Cloud Innovators program today to stay up to date on more data practitioner tips, tricks, and events.If you think you have a good Data Hero story worth sharing, please let us know! We’d love to feature you in our series as well.Related ArticleGoogle Cloud Data Heroes Series: Meet Francisco, the Ecuadorian American founder of Direcly, a Google Cloud PartnerIn the Data Heroes series we share stories of people who use data analytics tools to do incredible things. In this month’s edition, Meet …Read Article
Quelle: Google Cloud Platform

Using Google Kubernetes Engine’s GPU sharing to search for neutrinos

Editor’s note: Today we hear from the San Diego Supercomputer Center (SDSC) and University of Wisconsin-Madison about how GPU sharing in Google Kubernetes Engines is helping them detect neutrinos at the South Pole with the gigaton-scale IceCube Neutrino Observatory.IceCube Neutrino Observatory is a detector at the South Pole designed to search for nearly massless subatomic particles called neutrinos. These high-energy astronomical messengers provide information to probe events like exploding stars, gamma-ray bursts, and cataclysmic phenomena involving black holes and neutron stars. Scientific computer simulations are run on the sensory data that IceCube collects on neutrinos to pinpoint the direction of detected cosmic events and improve their resolution.The most computationally intensive part of the IceCube simulation workflow is the photon propagation code, a.k.a. ray-tracing, and that code can greatly benefit from running on GPUs. The application is high throughput in nature, with each photon simulation being independent of the others. Apart from the core data acquisition system at the South Pole, most of IceCube’s compute needs are served by an aggregation of compute resources from various research institutions all over the world, most of which use the Open Science Grid (OSG) infrastructure as their unifying glue. GPU resources are relatively scarce in the scientific resource provider community. In 2021, OSG had only 6M GPU hours vs 1800M CPU core hours in its infrastructure. The ability to expand the available resource pool with cloud resources is thus highly desirable.The SDSC team recently extended the OSG infrastructure to effectively use Kubernetes-managed resources to support IceCube compute workloads on the Pacific Research Platform (PRP). The service manages dynamic provisioning in a completely autonomous fashion by implementing horizontal pilot pod autoscaling based on the queue depth of the IceCube batch system. Unlike on-premises systems, Google Cloud offers the benefits of elasticity (on-demand scaling) and cost efficiency (only pay for what gets used). We needed a flexible platform that can avail these benefits to our community. We found Google Kubernetes Engine (GKE) to be a great match for our needs due to its support for auto-provisioning, auto-scaling, dynamic scheduling, orchestrated maintenance, job API and fault tolerance, as well as support for co-mingling of various machine types (e.g. CPU + GPU and on-demand + Spot) in the same cluster and up to 15,000 nodes per cluster.While IceCube’s ray-tracing simulation greatly benefits from computing on the GKE GPUs, it still relies on CPU compute for feeding the data to the GPU portion of the code. And GPUs have been getting faster at a much higher rate than CPUs have! With the advent of the NVIDIA V100 and A100 GPUs, the IceCube code is now CPU-bound in many configurations. By sharing a large GPU between multiple IceCube applications, the IceCube ray-tracing simulation again becomes GPU-bound, and therefore we get significantly more simulation results from the same hardware. GKE has native support for both simple GPU time-sharing and the more advanced A100 Multi-Instance GPU (MIG) partitioning, making it incredibly easy for IceCube — and OSG at large — to use.To leverage the elasticity of the Google Cloud, we fully relied on GKE horizontal node auto-scaling for provisioning and de-provisioning GKE compute resources. Whenever there were worker pods that could not be started, the auto-scaler provisioned more GKE nodes, up to a set maximum. Whenever a GKE node was unused, the auto-scaler de-provisioned it to save costs.Performance resultsUsing Google Cloud GPU resources was very simple through GKE. We used the same setup we were already using on the on-prem PRP Kubernetes cluster, simply pointing our setup to the new cluster.After the initial setup, IceCube was able to efficiently use Google Cloud resources, without any manual intervention by the supporting SDSC team beyond setting the auto-scaling limits. This was a very welcome change from other cloud activities the SDSC team has performed on behalf of IceCube and others, that required active management of provisioned resources.AutoscalingThe GKE auto-scaling for autonomous provisioning and de-provisioning of cloud resources worked as advertised, closely matching the demand from IceCube users, as seen in Fig. 1. We were particularly impressed by GKE’s performance in conjunction with GPU sharing; the test run shown used seven A100 MIG partitions per GPU.Fig. 1: Monitoring snapshot of the unconstrained GKE auto-scaling test run.GPU sharingBoth full-GPU and shared-GPU Kubernetes nodes with A100, V100 and T4 GPUs were provisioned, but IceCube jobs did not differentiate between them, since all provisioned resources met the jobs’ minimum requirements.We assumed that GPU sharing benefits would vary based on the CPU-to-GPU ratio of the chosen workflow, so during this exercise we picked one workflow from each extreme. IceCube users can choose to speed up the GPU-based ray-tracing compute of some problems by, roughly speaking, increasing the size of the target for the photons by some factor. For example, setting oversize=1 gives the most precise simulation, and oversize=4 gives the fastest. Faster compute (of course) results in a higher CPU-to-GPU ratio. The fastest oversize=4 workload benefitted the most from GPU sharing. As can be seen from Fig. 2, IceCube oversize=4 jobs cannot make good use of anything faster than a NVIDIA T4. Indeed, even for the low-end T4 GPU, sharing increases the job throughput by about 40%! For the A100 GPU, GPU sharing gets us a 4.5x throughput increase, which is truly transformational. Note that MIG and “plain” GPU sharing provide comparable throughput improvements, but MIG comes with much stronger isolation guarantees, which would be very valuable in a multi-user setup.Fig. 2: Number of IceCube oversize=4 jobs per hour, grouped by GPU setup.The more demanding oversize=1 workload makes much better use of the GPUs, so we observe no job throughput improvement for the older T4 and V100 GPUs. The A100 GPU, however, is still too powerful to be used as a whole, and GPU sharing gives us almost a 2x throughput improvement here, as illustrated in Fig. 3.Fig. 3: Number of IceCube oversize=1 jobs per day, grouped by GPU setup.GPU sharing of course increases the wallclock time needed by any single job to run to completion. This is however not a limiting factor for IceCube, since the main objective is to produce the output of thousands of independent jobs, and the expected timeline is measured in days, not minutes. Job throughput and cost effectiveness are therefore much more important than compute latency.Finally, we would like to stress that most of the used resources were provisioned on top of Spot VMs, making them significantly cheaper than their on-demand equivalents. GKE gracefully handled any preemption, making this mode of operation very cost effective.Lessons learnedGKE with GPU sharing has proven to be very simple to use, given that our workloads were already Kubernetes-ready. From a user point of view, there were virtually no differences from the on-prem Kubernetes cluster they were accustomed to.The benefits of GPU sharing obviously depend on the chosen workloads, but at least for IceCube it seems to be a necessary feature for the latest GPUs, i.e. the NVIDIA A100. Additionally, a significant fraction of IceCube jobs can benefit from GPU sharing even for lower-end T4 GPUs.When choosing the GPU-sharing methodology, we definitely prefer MIG partitioning. While less flexible than time-shared GPU sharing, MIG’s strong isolation properties make management of multi-workload setups much more predictable. That said, “plain” GPU sharing was still more than acceptable, and was especially welcome on GPUs that lack MIG support.In summary, the GKE shared-GPU experience was very positive. The observed benefits of GPU sharing in Kubernetes were an eye-opener and we plan to make use of it whenever possible.Want to learn more about sharing GPUs on GKE? Check out this user guide.Related ArticleTurbocharge workloads with new multi-instance NVIDIA GPUs on GKEYou can now partition a single NVIDIA A100 GPU into up to seven instances and allocate each instance to a single Google Kubernetes Engine…Read Article
Quelle: Google Cloud Platform

Deploying high-throughput workloads on GKE Autopilot with the Scale-Out compute class

GKE Autopilot is a full-featured, fully managed Kubernetes platform that combines the full power of the Kubernetes API with a hands-off approach to cluster management and operations. Since launching Autopilot last year, we’ve continued to innovate, adding capabilities to meet the demands of your workloads. We’re excited to introduce the concept of compute classes in Autopilot, together with the Scale-Out compute class, which offers high performance x86 and Arm compute, now available in Preview.Autopilot compute classes are a curated set of hardware configurations on which you can deploy your workloads. In this initial release, we are introducing the Scale-Out compute class, which is designed for workloads that are optimized for a single-thread-per-core and scale horizontally. The Scale-Out compute class currently supports two hardware architectures — x86 and Arm — allowing you to choose whichever one offers the best price-performance for your specific workload. The Scale-Out compute class joins our original, general-purpose compute option and is designed for running workloads that benefit from the fastest CPU platforms available on Google Cloud, and with greater cost-efficiency for applications that have high CPU utilization.We also heard from you that some workloads would benefit from higher-performance compute. To serve this need, x86 workloads running on the Scale-Out compute class are currently served by 3rd Gen AMD EPYCTM processors, with Simultaneous Multithreading (SMT) disabled, achieving the highest per-core benchmark among x86 platforms in Google Cloud.And for the first time, Autopilot supports Arm workloads. Currently utilizing the new Tau T2A VMs running on Ampere® Altra® Arm-based processors, the Scale-Out compute class gives your Arm workloads price-performance benefits combined with a thriving, open, end-to-end platform independent ecosystem. Autopilot Arm Pods are currently available in us-central, europe-west4, and asia-southeast1.Deploying Arm workloads using the Scale-Out compute classTo deploy your Pods on a specific compute class and CPU, simply add a Kubernetes nodeSelector or node affinity rule with the following labels in your deployment specification:cloud.google.com/COMPUTE-CLASSkubernetes.io/ARCHTo run an Arm workload on Autopilot, you need a cluster running version 1.24.1-gke.1400 or later and in one of the supported regions. You can create a new cluster at this version, or upgrade an existing one. To create a new Arm-supported cluster on the CLI, use the following:code_block[StructValue([(u’code’, u’CLUSTER_NAME=autopilot-armrnREGION=us-central1rnVERSION=1.24.1-gke.1400rngcloud container clusters create-auto $CLUSTER_NAME \rn –release-channel “rapid” –region $REGION \rn –cluster-version $VERSION’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e2ad76ca790>)])]For example, the following Deployment specification will deploy the official Nginx image on the Arm architecture:code_block[StructValue([(u’code’, u’apiVersion: apps/v1rnkind: Deploymentrnmetadata:rn name: nginx-arm64rnspec:rn selector:rn matchLabels:rn app: nginxrn template:rn metadata:rn labels:rn app: nginxrn spec:rn nodeSelector:rn cloud.google.com/compute-class: Scale-Outrn kubernetes.io/arch: arm64rn containers:rn – name: nginxrn image: nginx:latest’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e2ad7dcb7d0>)])]Deploying x86 workloads on the Scale-Out compute classThe Scale-out compute class also supports the x86 architecture by simply adding a selector for the `Scale-Out` compute class. You can either explicitly set the architecture with kubernetes.io/arch: amd64 or omit that label from the selector, as x86 is the default.To run an x86 Scale-Out workload on Autopilot, you need a cluster running version 1.24.1-gke.1400 or later and in one of the supported regions. The same CLI command from the example above will get you an x86 Scale-Out-capable GKE Autopilot cluster.code_block[StructValue([(u’code’, u’apiVersion: apps/v1rnkind: Deploymentrnmetadata:rn name: nginx-arm64rnspec:rn selector:rn matchLabels:rn app: nginxrn template:rn metadata:rn labels:rn app: nginxrn spec:rn nodeSelector:rn cloud.google.com/compute-class: Scale-Outrn containers:rn – name: nginxrn image: nginx:latest’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e2ad7b83090>)])]Deploying Spot Pods using the Scale-Out compute classYou can also combine compute classes with Spot Pods by adding the label cloud.google.com/gke-spot: “true”to the nodeSelector:code_block[StructValue([(u’code’, u’apiVersion: apps/v1rnkind: Deploymentrnmetadata:rn name: nginx-arm64rnspec:rn selector:rn matchLabels:rn app: nginxrn template:rn metadata:rn labels:rn app: nginxrn spec:rn nodeSelector:rn cloud.google.com/gke-spot: “true”rn cloud.google.com/compute-class: Scale-Outrn kubernetes.io/arch: arm64rn containers:rn – name: nginxrn image: nginx:latest’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e2ad7b83050>)])]Spot Pods are supported for both the x86 and Arm architectures when using the Scale-Out compute class.Try the Scale-Out compute class on GKE Autopilot today!To help you get started, check out our guides on creating an Autopilot cluster, getting started with compute classes, building images for Arm workloads, and deploying Arm workloads on GKE Autopilot.Related ArticleRun your Arm workloads on Google Kubernetes Engine with Tau T2A VMsWith Google Kubernetes Engine’s (GKE) support for the new Tau VM T2A, you can run your containerized workloads on the Arm architecture.Read Article
Quelle: Google Cloud Platform

How to overcome 5 common SecOps challenges

Editor’s note: This blog was originally published by Siemplify on April 12, 2022.The success of the modern security operations center, despite the infusion of automation, machine learning, and artificial intelligence, remains heavily dependent on people. This is largely due to the vast amounts of data a security operations center must ingest—a product of an ever-expanding attack surface and the borderless enterprise brought on by the rapid rise of cloud adoption. All those alerts coming in mean proactive and reactive human decision making remains critical.Perhaps it should come as no surprise that the information security analyst now ranks as No. 1 in U.S. News’ 100 Best Jobs Rankings, “determined by identifying careers with the largest projected number and percentage of openings through 2030, according to the U.S. Bureau of Labor Statistics.” Security, and specifically detection and response, is not only a business imperative—it is arguably the top worry on the minds of CEOs.However, the security analyst is also one of the most likely professionals to want to leave their jobs, according to a newly released “Voice of the SOC Analyst” study conducted by Tines. What gives? Turnover woes are attributable to several key SecOps challenges that never seem to budge.1)  Alert fatigue and false positives: Have you ever received so much spam or junk mail that you end up ignoring your new messages entirely, leading you to miss an important one? The same can happen for alerts. Too much noise is unsustainable and can lead to the real threats being missed, especially as perimeters expand and cloud adoption increases.2) Disparate tools: Already in the company of too many point-detection tools, security operations professionals are saying hello to a few more in the era of remote work and increased cloud demands. The latest count is north of 75 security tools that need to be managed by the average enterprise.3) Manual processes: Use case procedures that result in inconsistent, unrepeatable processes can bottleneck response times and frustrate SecOps teams. Not everything in the SOC needs to—or should be—automated, but much can be, which then frees up analysts and engineers to concentrate on higher-order tasks and be able to more easily train new employees.4) Talent shortage: Death, taxes, and the cybersecurity skills shortage. As sure as the sun will rise tomorrow, so will the need for skilled individuals to wage the cybersecurity fight. But what happens when not enough talent is filling the seats? Teams must compensate to fill the gap.5) Lack of visibility: Security operations metrics are critical for improving productivity and attracting executive buy-in and support, but SecOps success can be difficult to track, as reports can require a significant amount of work to pull together.The caveat of course is that it would be rare to find a SecOps team working without the above challenges. As such, some of the immediate steps you can take to push back against these constraints focus on people-powered processes and technologies to remedy the issues.According to a recent paper co-authored by Google Cloud and Deloitte: Humans are—and will be—needed to both perform final triage on the most obtuse security signals (similar to conventional SOC Level 3+) and to conduct a form of threat hunting (i.e. looking for what didn’t trigger that alert). Machines will be needed to deliver better data to humans, both in a more organized form (stories made of alerts) and in improved quality detections using rules and algorithms— all while covering more emerging IT environments.Both humans and machines will need to work together on mixed manual and automated workflows.So, what does this ultimately mean you must do to improve your security operations? Here are five practical suggestions:Detect Threats More EfficientlyEfficiencies within the SOC can be realized from a SIEM solution that automatically detects threats in real-time and at scale. The right platform will support massive data ingestion and storage, relieve traditional cost and scaling limitations, and broaden the lens for anomaly and machine learning/AI-based detection. With data stored and analyzed in one place, security teams can investigate and detect threats more effectively.Respond to Threats AutomaticallySOAR can be a game-changer in terms of caseload reduction and faster (and smarter, especially when integrated with threat intelligence) response times. But before rushing headfirst into automation, you should consider your processes, review outcomes you are trying to achieve (such as reduced MTTD)–and then decide exactly what you want to automate (which can be a lot with SOAR). Once clear processes are determined where automation can contribute, SOC personnel are freed up to be more creative in their work.Prioritize LogsMany teams lack a strategy for collecting, analyzing and prioritizing logs, despite the fact that these sources of insight often hold the clues of an ongoing attack. To help, here are two cheat sheets featuring essential logs to monitor.Outsource What You Can’t Do YourselfProcess improvements may help you compensate for perceived personnel shortages (for example, perhaps fixing a misconfigured monitoring tool will reduce alert noise). Of course, many organizations need additional human hands to help them perform tasks like round-the-clock monitoring and more specialized functions like threat hunting. Here is where a managed security services provider or managed detection provider can be helpful. Be realistic about your budget, however, as you may be able to introduce some solutions in-house. Institute Career ModelsLack of management support is cited as the fourth-biggest obstacle to a fully functioning SOC model, according to the 2022 SANS Security Operations Center Survey. To overcome this, SecOps leaders must help improve workflow processes, protect innovation, keep teams absorbed in inspiring and impactful work versus mundane tasks, remain flexible with staff, and endorse training and career development. Because at the end of the day, the SOC is still distinctly human–and that is who will be the difference maker between success and failure.Related ArticleRaising the bar in Security Operations: Google Acquires SiemplifyGoogle has acquired Siemplify, a leading security orchestration, automation and response (SOAR) provider. Siemplify will join Google Clou…Read Article
Quelle: Google Cloud Platform

Flock Freight builds a more efficient, resilient and environmentally sustainable shipping supply chain on Google Cloud

Commercial trucks often travel partially empty because many shippers don’t have enough cargo to fill an entire container or trailer. Although offering available space to other shippers helps minimize carbon emissions and reduce operating costs, most trucking companies can’t efficiently schedule, track, or deliver multiple freight loads.Companies have always struggled to ship over-the-road freight efficiently. However,recent economic events have created an unprecedentedlogistics and transportation crisis that continues to disrupt supply chains, delay deliveries, and significantly raise the price of basic goods. Since some stores can’t keep their shelves fully stocked, many people across the country are finding it more difficult than ever to buy the things they need at an affordable price. Although exacerbated by the pandemic, many of these supply chain issues have existed for decades. That’s why, in 2015, Flock Freight was started with the mission of reducing waste and inefficiency from the supply chain by reimagining the way freight moves. First to market with advanced algorithms that enable pooling shipments at scale, we create a new standard of service for shippers, increase revenue for carriers and reduce the impact of carbon emissions through shared truckload (STL) service. Our technology helps lower prices compared to full truckload (FTL) by enabling shippers to only pay for the space they need—and maintain full control over pickup and delivery dates. Flock Freight also optimizes travel routes to speed up deliveries compared to traditional less than truckload (LTL), while eliminating unnecessary shipping hub transfers to minimize damage to cargo.Today, thousands of shippers and trucking companies across the U.S. use Flock Freight to schedule shared truckloads, lower shipping costs, quickly deliver and track goods, and reduce their carbon footprint by up to 40%. Flock Freight further offsets carbon emissions by buying carbon credits for every FlockDirect™ guaranteed shared truckload shipment—at no extra cost to shippers.Moving Flock Freight to Google CloudWe founded Flock Freight with a small team based in southern California. We soon realized we needed a more scalable and affordable technology stack to support our rapidly growing platform and team. After joining theGoogle for Startups Cloud Program and consulting with dedicated Google startup experts, we decided to move all our data and applications toGoogle Cloud.Thehighly secure-by-design infrastructure of Google Cloud now enables thousands of Flock Freight customers to move their freight faster, cheaper, and with less damage than traditional shipping methods. Specifically, we rely onGoogle Kubernetes Engine (GKE) to support the combinatorial optimization and machine learning (ML) algorithms and services that identify, pool, and schedule shared truckloads. We also leverage GKE to rapidly develop, deploy, and manage new applications and services.In addition, we leverageCloud SQL to automate database provisioning, storage capacity management, and other time-consuming tasks. Cloud SQL easily integrates with existing apps and Google Cloud services such as GKE andPub/Sub. Lastly, we useCompute Engine to create and run virtual machines, optimize resource utilization, and lower computing costs by up to 91%. These cost savings allow us to shift more resources to R&D and rapidly develop new solutions and services for our customers.Building a greener, more resilient, and responsive supply chainThe Google for Startups Cloud Program and dedicated Google startup experts were instrumental in helping us manage cloud infrastructure cost and maintaining very high SLAs,  helping Flock Freight to focus on developing a comprehensive shipping platform that powers shared truckloads and drives positive industry change. We especially want to highlight the Google Cloud research credits we relied on to launch Flock Freight and make rapid progress toward transforming the shipping industry. To this day, we continue to work with Google Cloud Managed Services partnerDoiT International to further scale and optimize operations on Google Cloud.We’re proud of the results we’re delivering for our customers. For example, ahome improvement importer now enjoys faster, safer, and easier shipping with 99.9% damage-free service and a 97.5% on-time delivery rate. Apackaging supplier continues to maintain a 99% on-time delivery streak and decrease carbon emissions by 37%, while a mineral water companyconsistently reduces delivery expenses upwards of 50%. Nationwide demand for shared truckloads continues to increase as the shipping industry works to lower costs and alleviate supply chain disruptions. With the Flock Freight platform, companies are building a more sustainable and resilient supply chain by efficiently combining multiple shipments into shared truckloads.If you want to learn more about how Google Cloud can help your startup, visit our pagehere to get more information about our program, and sign up for our communications to get a look at our community activities, digital events, special offers, and more.Related ArticleDrive Hockey Analytics uses Google Cloud to deliver pro-level sports tracking performance to youthLearn how Drive Hockey Analytics is bringing affordable and predictive pro-level analytics to youth teams on Google Cloud.Read Article
Quelle: Google Cloud Platform

Performance considerations for loading data into BigQuery

It is not unusual for customers to load very large data sets into their enterprise data warehouse. Whether you are doing an initial data ingestion with hundreds of TB of data or incrementally loading from your systems of record, performance of bulk inserts is key to quicker insights from the data. The most common architecture for batch data loads uses Google Cloud Storage(Object storage) as the staging area for all bulk loads. All the different file formats are converted into an optimized Columnar format called ‘Capacitor’ inside BigQuery.This blog will focus on various file types and data loading tools for best performance. Data files that are uploaded to BigQuery, typically come in Comma Separated Values(CSV), Avro, Parquet, JSON, ORC formats. We are going to use a large dataset to compare and contrast each of these file formats. We will explore loading efficiencies of compressed vs. uncompressed data for each of these file formats. Data can be loaded into BigQuery using multiple tools in the GCP ecosystem. You can use the Google Cloud console, bq load command, using the BigQuery API or using the client libraries. We will also compare and contrast each loading mechanism for the same dataset. This blog attempts to elucidate the various options for bulk data loading into BigQuery and also provides data on the performance for each file-type and loading mechanism.Introduction There are various factors you need to consider when loading data into BigQuery. Data file formatData compressionTool used to load dataLevel of parallelization of data loadSchema autodetect ‘ON’ or ‘OFF’Data file formatBulk insert into BigQuery is the fastest way to insert data for speed and cost efficiency. Streaming inserts are however more efficient when you need to report on the data immediately. Today data files come in many different file types including comma separated(CSV), json, parquet, avro  to name a few. We are often asked how the file format matters and whether there are any advantages in choosing one file format over the other. CSV files (comma-separated values) contain tabular data with a header row naming the columns. When loading data one can parse the header for column names. When loading from csv files one can use the header row for schema autodetect to pick up the columns. With schema autodetect set to off, one can skip the header row and create a schema manually, using the column names in the header. CSV files can use other field separator/newline characters too as a separator, since many data outputs already have a comma in the data. You cannot store nested or repeated data in CSV file format.JSON (JavaScript object notation) data is stored as a key-value pair in a semi structured format. JSON is preferred as a file type because it can store data in a hierarchical format. The schemaless nature of json data rows gives the flexibility to evolve the schema and thus change the payload. JSON and XML formats are user-readable, but JSON documents are typically much smaller than XML. REST-based web services use json over other file types.Parquet is a column-oriented data file format designed for efficient storage and retrieval of data.  Parquet compression and encoding is very efficient and provides improved performance to handle complex data in bulk.Avro: The data is stored in a binary format and the schema is stored in JSON format. This helps in minimizing the file size and maximizes efficiency. Avro has reliable support for schema evolution by managing added, missing, and changed fields. From a data loading perspective we did various tests with millions to hundreds of billions of rows with narrow to wide column data .We have done this test with a public dataset named `bigquery-public-data:worldpop.population_grid_1km`. We used 4000 flex slots for the test and the number of loading slots is limited to the number of slots you have allocated for your environment, though the load slots do not use all of the slots you throw at it.. Schema Autodetection was set to ‘NO’. For the parallelization of the data files each file should typically be less than 256MB for faster throughput and here is a summary of our findings:Do I compress the data? Sometimes batch files are compressed for faster network transfers to the cloud. Especially for large data files that are being transferred, it is faster to compress the data before sending over the cloud Interconnect or VPN connection. In such cases is it better to uncompress the data before loading into BigQuery? Here are the tests we did for various file types with different compression algorithms.Shown results are the average of five runs:How do I load the data?There are various ways to load the data into BigQuery. You can use the Google Cloud Console, command line, Client Library(shown python here) or use the Direct API call. We compared these data loading techniques and compared the efficacy of each method. Here is a comparison of the timings for each method. You can also see that Schema Autodetect works very well, where there are no datatype quality issues in the source data and you are consistently getting the same columns from a data sourceConclusionThere is no advantage in loading time when the source file is in compressed format. In fact for the most part uncompressed data loads in the same or faster time than compressed data. We noticed that for csv and avro file types you do not need to uncompress for faster load times. For other file types including parquet and json it takes longer to load the data when the file is compressed. Decompression is a CPU bound activity and your mileage varies based on the amount of load slots assigned to your load job. Data loading slots are different from the data querying slots. For compressed files, you should parallelize the load operation, so as to make sure that data loads are efficient. Split the data files to 256MB or less to avoid spilling over the uncompression task to disk.From a performance perspective avro, csv and parquet files have similar load times. Use the command line to load larger volumes of data for the most efficient data loading. Fixing your schema does load the data faster than schema autodetect set to ‘ON’. Regarding ETL jobs, it is faster and simpler to do your transformation inside BigQuery using SQL, but if you have complex transformation needs that cannot be done with SQL, use Dataflow for unified batch and streaming, Dataproc for open source based pipelines, or Cloud Data Fusion for no-code / low-code transformation needs.To learn more about how Google BigQuery can help your enterprise, try out Quickstarts page here.Disclaimer: These tests were done with limited resources for BigQuery in a test environment during different times of the day with noisy neighbors, so the actual timings and the number of rows might not be reflective of your test results. The numbers provided here are for comparison sake only, so that you can choose the right file types, compression and loading technique for your workload. Related ArticleLearn how BI Engine enhances BigQuery query performanceThis blog explains how BI Engine enhances BigQuery query performance, different modes in BI engine and its monitoring.Read Article
Quelle: Google Cloud Platform

Moving data from the mainframe to the cloud made easy

IBM mainframes have been around since the 1950s and are still vital for many organizations. In recent years many companies that rely on mainframes have been working towards migrating to the cloud. This is motivated by the need to stay relevant, the increasing shortage of mainframe experts and the cost savings offered by cloud solutions. One of the main challenges in migrating from the mainframe has always been moving data to the cloud. The good thing is that Google has open sourced a bigquery-zos-mainframe connector that makes this task almost effortless.What is the Mainframe Connector for BigQuery and Cloud Storage?The Mainframe Connector enables Google Cloud users to upload data to Cloud Storage and submit BigQuery jobs from mainframe-based batch jobs defined by job control language (JCL). The included shell interpreter and JVM-based implementations of gsutil and bq command-line utilities make it possible to manage a complete ELT pipeline entirely from z/OS. This tool moves data located on a mainframe in and out of Cloud Storage and BigQuery; it also transcodes datasets directly to ORC (a BigQuery supported format). Furthermore, it allows users to execute BigQuery jobs from JCL, therefore enabling mainframe jobs to leverage some of Google Cloud’s most powerful services.The connector has been tested with flat files created by IBM DB2 EXPORT that contain binary-integer, packed-decimal and EBCDIC character fields that can be easily represented by a copybook. Customers with VSAM files may use IDCAMS REPRO to export to flat files, which can then be uploaded using this tool. Note that transcoding to ORC requires a copybook and all records must have the same layout. If there is a variable layout, transcoding won’t work, but it is still possible to upload a simple binary copy of the dataset.Using the bigquery-zos-mainframe-connectorA typical flow for Mainframe Connector involves the following steps:Reading the mainframe datasetTranscoding the dataset to ORCUploading ORC to Cloud StorageRegistering it as an external tableRunning a MERGE DML statement to load new incremental data into the target tableNote that if the dataset does not require further modifications after loading, then loading into a native table is a better option than loading into an external table.In regards to step 2, it is important to mention that DB2 exports are written to sequential datasets on the mainframe and the connector uses the dataset’s copybook to transcode it to an ORC.The following simplified example shows how to read a dataset on a mainframe, transcode it to ORC format, copy the ORC file to Cloud Storage, load it to a BigQuery-native table and run SQL that is executed against that table.1. Check out and compile:code_block[StructValue([(u’code’, u’git clone https://github.com/GoogleCloudPlatform/professional-servicesrncd ./professional-services/tools/bigquery-zos-mainframe-connector/rn rn# compile util library and publish to local maven/ivy cacherncd mainframe-utilrnsbt publishLocalrn rn# build jar with all dependencies includedrncd ../gszutilrnsbt assembly’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e7e135cd450>)])]2. Upload the assembly jar that was just created in target/scala-2.13 to a path on your mainframe’s unix filesystem.3. Install the BQSH JCL Procedure to any mainframe-partitioned data set you want to use as a PROCLIB. Edit the procedure to update the Java classpath with the unix filesystem path where you uploaded the assembly jar. You can edit the procedure to set any site-specific environment variables.4. Create a jobSTEP 1:code_block[StructValue([(u’code’, u’//STEP01 EXEC BQSHrn//INFILE DD DSN=PATH.TO.FILENAME,DISP=SHRrn//COPYBOOK DD DISP=SHR,DSN=PATH.TO.COPYBOOKrn//STDIN DD *rngsutil cp –replace gs://bucket/my_table.orcrn/*’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e7e115c0850>)])]This step reads the dataset from the INFILE DD and reads the record layout from the COPYBOOK DD. The input dataset could be a flat file exported from IBM DB2 or from a VSAM file. Records read from the input dataset are written to the ORC file at gs://bucket/my_table.orc with the number of partitions determined by the amount of data.STEP 2:code_block[StructValue([(u’code’, u’//STEP02 EXEC BQSHrn//STDIN DD *rnbq load –project_id=myproject \rn myproject:MY_DATASET.MY_TABLE \rn gs://bucket/my_table.orc/*rn/*’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e7e126e2850>)])]This step submits a BigQuery load job that will load ORC file partitions from my_table.orc into MY_DATASET.MY_TABLE. Note this is the path that was written to on the previous step. STEP 3:code_block[StructValue([(u’code’, u’//STEP03 EXEC BQSHrn//QUERY DD DSN=PATH.TO.QUERY,DISP=SHRrn//STDIN DD *rnbq query –project_id=myprojectrn/*’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e7e126e2690>)])]This step submits a BigQuery Query Job to execute SQL DML read from the QUERY DD (a format FB file with LRECL 80). Typically the query will be a MERGE or SELECT INTO DML statement that results in transformation of a BigQuery table. Note: the connector will log job metrics but will not write query results to a file.Running outside of the mainframe to save MIPSWhen scheduling production-level load with many large transfers, processor usage may become a concern. The Mainframe Connector executes within a JVM process and thus should utilize zIIP processors by default, but if capacity is exhausted, usage may spill over to general purpose processors. Because transcoding z/OS records and writing ORC file partitions requires a non-negligible amount of processing, the Mainframe Connector includes a gRPC server designed to handle compute-intensive operations on a cloud server; the process running on z/OS only needs to upload the dataset to Cloud Storage and make an RPC call. Transitioning between local and remote execution requires only an environment variable change. Detailed information on this functionality can be found here. AcknowledgementsThanks to those who tested, debugged, maintained and enhanced the tool: Timothy Manuel, Suresh Balakrishnan,Viktor Fedinchuk,Pavlo KravetsRelated Article30 ways to leave your data center: key migration guides, in one placeEssential guides for all the workloads your business is considering migrating to the public cloud.Read Article
Quelle: Google Cloud Platform