Supporting federal agency compliance during the pandemic (and beyond)

If digital transformation was only a trend a few years ago, it’s now quickly becoming a reality for many federal government agencies. The COVID-19 pandemic has pushed all kinds of agencies to reconsider the timelines and impact of their digital initiatives, whether this means moving core technology infrastructure to the cloud, rolling out more modern productivity tools for government employees, or using artificial intelligence to better deliver citizen services.At Google Cloud, we continue to help federal agencies of all sizes tackle their trickiest problems as they rapidly transform and digitize. At the same time—building on our FedRAMP High authorization announcement from last year—we’re committed to pursuing the latest government certifications, such as the Department of Defense’s (DoD) upcoming Cybersecurity Maturity Model Certification (CMMC), to ensure federal agencies and the vendors that work with them are fully compliant.Applying intelligent automation to assist the U.S. Patent and Trademark OfficeRecently, Accenture Federal Services (AFS) was awarded a position on the U.S. Patent and Trademark Office (USPTO) Intelligent Automation and Innovation Support Services (IAISS) blanket purchase agreement (BPA), a multi-contract vehicle. The five-year BPA includes piloting, testing, and implementing advanced technologies, including intelligent automation, artificial intelligence (AI), microservices, machine learning, natural language processing, robotic process automation, and blockchain. The goal of IAISS is to transform business processes and enhance mission delivery, and it’s expected to be a model for the federal government nationwide. AFS and Google Cloud previously worked with the USPTO to help the agency’s more than 9,000 patent examiners rapidly perform more thorough searches by augmenting their on-premise search tools with Google’s AI. The new solution—created by merging Google’s machine learning models with Accenture’s design, prototyping, and data science capabilities—helps extend examiners’ expertise during the patent search process.Supporting secure cloud management at the Defense Innovation UnitWe also recently announced that Google Cloud was chosen by the Defense Innovation Unit (DIU)—an organization within the Department of Defense (DoD) focused on scaling commercial technology across the DoD—to build a secure cloud management solution to detect, protect against, and respond to cyber threats worldwide.The multi-cloud solution will be built on Anthos, Google Cloud’s app modernization platform, allowing DIU to prototype web services and applications across Google Cloud, Amazon Web Services, and Microsoft Azure—while being centrally managed from the Google Cloud Console. The solution will provide real-time network monitoring, access control, and full audit trails, enabling DIU to maintain its strict cloud security posture without compromising speed and reliability. As a pioneer in zero-trust security and deploying innovative approaches to protect and secure networks, we’re looking forward to partnering with DIU on this critical initiative.Supporting Cybersecurity Maturity Model Certification (CMMC) readinessFinally, while COVID-19 has driven a lot of how federal agencies are working day-to-day, the need for strong cybersecurity protections is as important as ever. At Google Cloud, meeting the highest standards for cybersecurity in the ever-evolving threat and regulatory landscape is one of our primary goals. In January of this year, the DoD published the Cybersecurity Maturity Model Certification (CMMC), a new standard designed to ensure cyber hygiene throughout the DoD supply chain. While the CMMC standard is not yet operational, the CMMC Advisory Board has advised cloud providers to conduct gap analysis against NIST SP 800-53, NIST SP 800-171, and preliminary versions of CMMC requirements. We’ve contracted with a third-party assessor to perform preliminary analyses of Google Cloud against the underlying CMMC controls, and we’re confident we’ll be able to meet the currently proposed controls—and to provide our customers with the right guidance to empower them in their CMMC journeys. For questions about Google’s existing compliance offerings, FedRAMP, or the CMMC, please contact Google Cloud sales. You can also visit our Compliance Resource Center and Government and Public Sector Compliance page to learn more about how we support your specific compliance needs. And to read more about our work with the public sector, including how we’re helping support agencies through the pandemic, visit our website.
Quelle: Google Cloud Platform

Grow your cloud career with high-growth jobs and skill badges

Cloud computing and data skills are especially in demand, as organizations are increasingly turning to digital solutions to transform the way they work and do business. The World Economic Forum predicts there will be close to a 30 percent increase in demand for data, AI, engineering, and cloud computing roles by 2022. Since April, Google Cloud learners have more than doubled year-over-year1. Of those who have started learning with us in 2020, many are looking to upskill or reskill into stable, well paying career paths.To help our expanding community of learners ramp quickly with their cloud careers, Google Cloud is unveiling a new Grow your cloud career webpage where you can find information on in-demand cloud career paths and free upskilling and reskilling resources. You can earn your first Google Cloud skill badges for your resume, which signify to employers that you have hands-on Google Cloud experience. We also have a special no cost learning section for small business leaders to help you build your first website and transform your business with data and AI.If you’re not sure which cloud role is right for you, we recommend exploring these three high-growth career paths.Data AnalystBy 2025, an estimated 463 exabytes of data is expected to be generated everyday. From online purchases, to personal health trackers, to smart factories, and more, the world generates massive amounts of data, but without Data Analysts this data is meaningless. Data Analysts interpret and gather insights from data, enabling better decision making. Their work is instrumental across several industries and for many business functions, including product development, supply chain management, and customer experience. You don’t need a technical background to get started in this role, but you will need to develop foundational skills in SQL (Structured Query Language), data visualization, and data warehousing. Cloud EngineerWith more than 88 percent of organizations now using cloud and planning to increase their usage, it’s no wonder that the Cloud Engineer role was one of the top in-demand job roles in the U.S. in 2019. Cloud Engineers play a critical role in setting up their company’s infrastructure, deploying applications, and monitoring cloud systems and operations. If you have education or experience in IT, the Cloud Engineer role may be the most natural path for you. It will give you a broad foundation in cloud and expose you to several different functions. Although working in cloud will require a shift in mindset for most with a traditional IT background, particularly in terms of automated infrastructure, scale, and agile workflows, there are several transferable IT skills that will continue to serve you well in this role.Cloud Application DeveloperFor those with a software development background, expanding your skills into cloud development is a must. Cloud offers developers several benefits, including scalability, better security, cost efficiencies, and ease of deployment. As a Cloud Developer, you are responsible for designing, building, testing, deploying, and monitoring highly scalable and reliable cloud-native applications. To upskill into this role, you will need to gain a deep understanding of cloud platforms, databases, and systems integration. If you’re ready to jumpstart your cloud career, visit our Grow your cloud career page where you can start upskilling and earning Google Cloud recognized skill badges for the Data Analyst, Cloud Engineer, or Cloud Developer job roles—get started at no cost here.1. According to internal data.
Quelle: Google Cloud Platform

Genomics analysis with Hail, BigQuery, and Dataproc

At Google Cloud, we work with organizations performing large-scale research projects. There are a few solutions we recommend to do this type of work, so that researchers can focus on what they do best—power novel treatments, personalized medicine, and advancements in pharmaceuticals. (Find more details about creating a genomics data analysis architecture in this post.)Hail is an open source, general-purpose, Python-based data analysis library with additional data types and methods for working with genomic data on top of Apache Spark. Hail is built to scale and has first-class support for multi-dimensional structured data, like the genomic data in a genome-wide association study (GWAS). The Hail team has made their software available to the community with the MIT license, which makes Hail a perfect augmentation to the Google Cloud Life Sciences suite of tools for processing genomics data. Dataproc makes open source data and analytics processing fast, easy, and more secure in the cloud and offers fully managed Apache Spark, which can accelerate data science with purpose-built clusters. And what makes Google Cloud really stand out from other cloud computing platforms is our healthcare-specific tooling that makes it easy to merge genomic data with data sets from the rest of the healthcare system. When genotype data is harmonized with phenotype data from electronic health records, device data, medical notes, and medical images, the hypothesis space becomes boundless. In addition, with Google Cloud-based analysis platforms like AI Platform Notebooks and Dataproc Hub, researchers can easily work together using state-of-the-art ML tools and combine datasets in a safe and compliant manner. Getting started with Hail and DataprocAs of Hail version 0.2.15, pip installations of Hail come bundled with a command-line tool, hailctl, which has a submodule called dataproc for working with Dataproc clusters configured for Hail. This means getting started with Dataproc and Hail is as easy as going to the Google Cloud console, then clicking the icon for Cloud Shell at the top of the console window. This Cloud Shell provides you with command-line access to your cloud resources directly from your browser without having to install tools on your system. From this shell, you can quickly install Hail by typing:Once Hail downloads and installs, you can create a Dataproc cluster fully configured for Apache Spark and Hail simply by running this command: Once the Dataproc cluster is created, you can click the button from the Cloud Shell that says Open Editor, which will take you to a built-in editor for creating and modifying code. From this editor, choose New File and call the file my-first-hail-job.py.In the editor, copy and paste the following code:By default, Cloud Shell Editor should save this file, but you can also explicitly save the file from the menu where the file was created. Once you have verified the file is saved, return to the command line terminal by clicking on the Open Terminal button. From the terminal, now run the command:Once the job starts, find the Dataproc section of the Google Cloud Console and review the output of your genomics job from the Jobs tab.Congratulations, you just ran your first Hail job on Dataproc! For more information, see Using Hail on Google Cloud Platform. Now, we’ll pull Dataproc and Hail into the rest of the clinical data warehouse. Create a Dataproc Hub environment for Hail As mentioned earlier, Hail version 0.2.15 pip installations come bundled with hailctl, a command-line tool that has a submodule called dataproc for working with Google Dataproc clusters. This includes a fully configured notebook environment that can be used simply by calling:However, to take advantage of notebook features specific to Dataproc, including the use of Dataproc Hub, you will need to use a Dataproc initialization action that provides a standalone version of Hail without the Hail-provided notebook. To create a Dataproc cluster that provides Hail from within Dataproc’s JuypterLab environment, run the following command:Once the cluster has been created (as indicated by the green check mark), click on the cluster name, choose the tab for Web Interfaces and click the component gateway link for JuypterLab.From within a Juypter IDE, you should have a kernel and console for Hail (see red box in image below):This running cluster can easily be translated into a Dataproc Hub configuration by running the Dataproc clusters export command:Use Dataproc Hub and BigQuery to analyze genomics data Now that the Juypter notebooks environment is configured with Hail for Dataproc, let’s take a quick survey of the ways we can interact with the BigQuery genotype and phenotype data stored in the insights zone.  Using BigQuery magic to query data into Pandas It is possible to run a GWAS study directly in BigQuery by using SQL logic to push the processing down into BigQuery. Then, you can bring just the query results back into a Pandas dataframe that can be visualized and presented in a notebook. From a Dataproc Juypter notebook, you can run BigQuery SQL, which returns the results in a Pandas dataframe simply by adding the bigquery magic command to the start of the notebook cell, like this: Find an example of a GWAS analysis performed in BigQuery with a notebook in this tutorial. A feature of BigQuery, BigQuery ML provides the ability to run basic regression techniques and K-means clusters using standard SQL queries.   More commonly, BigQuery is used for preliminary steps in GWAS/PheWAS: feature engineering, defining cohorts of data, and running descriptive analysis to understand the data. Let’s look at some descriptive statistics using the 1000 genome variant data hosted by BigQuery public datasets. Let’s say you wanted to understand what SNP data was available in chromosome 12 from the 1000 Genomes Project. In a Juypter cell, simply copy and paste the following into a cell:This query will populate a Pandas dataframe with very basic information from the 1000 Genomes Project samples in my cohort. You can now run standard Python and Pandas functions to review, plot, and understand the data available for this cohort.For more on writing SQL queries against tables that use the variant format, see the Advanced guide to analyzing variants using BigQuery. Using the Spark to BigQuery connector to work with BigQuery storage directly in Apache Spark When you need to process large volumes of genomic data for population studies and want to use generic classification and regression algorithms like Random Forest, Naive Bayes, or Gradient Boosted trees, or you need help with extracting or transforming features with algorithms like PCA or One Hot Encoding, Apache Spark offers these ML capabilities, among many others.Using the Apache Spark BigQuery connector from Dataproc, you can now treat BigQuery as another source to read and write data from Apache Spark. This is achieved nearly the same as any other Spark dataframe setup:Learn more here about the Apache Spark to BigQuery storage integration and how to get started.Run variant transforms to convert BigQuery into VCF for genomics tools like HailWhen you want to do genomics-specific tasks, this is where Hail can provide a layer on top of Spark that can be used to:Generate variant and sample annotationsUnderstand Mendelian violations in trios, prune variants in linkage disequilibrium, analyze genetic similarity between samples, and compute sample scores and variant loadings using PCAPerform variant, gene-burden and eQTL association analyses using linear, logistic, and linear mixed regression, and estimate heritabilityHail expects the data format to start with either VCF, BGEN, or PLINK. Luckily, BigQuery genomics data can easily be converted from the BigQuery VCF format into a VCF file using Variant Transforms. Once you create the VCF on Cloud Storage, call Hail’s import_vcf function, which transforms the file into Hail’s matrix table.To learn more about scalable genomics analysis with Hail, check out this YouTube series for Hail from the Broad Institute.
Quelle: Google Cloud Platform

11 best practices for operational efficiency and cost reduction with Google Cloud

As businesses consider the road ahead, many are finding they need to make tough decisions about what projects to prioritize and how to allocate resources. For many, the impact of COVID-19 has brought the benefits and limitations of their IT environment into focus. As these businesses plan their way forward, many will need to consider how to meet the needs of their new business realities with limited resources.This is a challenge ideally suited for IT—particularly any  business overly reliant on legacy infrastructure. A recent McKinsey study found that these legacy systems account for 74% of a company’s IT spend while hampering agility at the same time. Making fundamental IT changes like migrating on-premises workloads to the cloud can reduce costs, increase agility, and pay ROI dividends down the line.All of this is covered in our new eGuide, Solving for operational efficiency with Google Cloud. While modernization will look different for businesses of varying sizes and in varying industries, the benefits of moving to the cloud are broad and universal. These include:Increasing agility and reducing IT costs by adopting hybrid and multi-cloud strategies.Driving a higher return on ERP investments by migrating SAP systems to the cloud.Avoiding costly hardware refreshes and reducing on-premises infrastructure costs by migrating VMs to the cloud.Increasing scalability and gaining access to advanced analytics through data warehouse modernization.Make cluster management easier and more cost-effective by migrating on-premises Apache Hadoop clusters to the cloud.Gaining cost efficiencies by running specialized workloads in the cloud with a scalable Bare Metal Solution.Increasing flexibility and decreasing on-premises investments by migrating Windows workloads to the cloud.Embrace a modern architecture for scalability and cost efficiencies by offloading a mainframe environment.Leveraging AI to rapidly respond to customer needs and improve customer experience.Gaining more visibility and control to lower costs with billing and cost management tools.Improving productivity by transforming the way teams work together with cloud-native collaboration tools.With current business conditions, organizations need facts, knowledge, and best practices so they can prioritize investments and optimize costs. Our eGuide provides an overview of the key areas we see our customers prioritizing their investments and creating operational efficiencies, and highlights the many ways Google Cloud can support you in your journey. Read the eGuide. If you want more customized recommendations, take advantage of our IT Cost Assessment program which will analyze your individual IT spend against industry benchmark data and provide you with a view of cost optimization opportunities. Learn more here.Related ArticleNew IT Cost Assessment program: Unlock value to reinvest for growthOur new IT Cost Assessment program lets you understand how your company’s IT spend compares to your industry peers, so you can quickly id…Read Article
Quelle: Google Cloud Platform

Building a genomics analysis architecture with Hail, BigQuery, and Dataproc

We hear from our users in the scientific community that having the right technology foundation is essential. The ability to very quickly create entire clusters of genomics processing, where billing can be stopped once you have the results you need, is a powerful tool. It empowers the scientific community to spend more time doing their research and less time fighting for on-prem cluster time and configuring software.   At Google Cloud, we’ve developed healthcare-specific tooling that makes it easy for researchers to look at healthcare and genomic data holistically. Combining genotype data with phenotype data from electronic health records (EHRs), device data, medical notes, and medical images makes scientific hypotheses limitless. And, our analysis platforms like AI Platform Notebooks and Dataproc Hub let researchers easily work together using state-of-the-art ML tools and combine datasets in a safe and compliant manner. Building an analytics architecture for genomic association studiesGenome-wide association studies (GWAS) are one of the most prevalent ways to study which genetic variants are associated with a human trait, otherwise known as a phenotype. Understanding the relationships between our genetic differences and phenotypes such as diseases and immunity is key to unlocking medical understanding and treatment options. Historically, GWAS studies were limited to phenotypes gathered during a research study. These studies were typically siloed, separate from day-to-day clinical data. However, the increased use of EHRs for data collection, coupled with natural language processing (NLP) advances that unlock the data in medical notes, has created an explosion of phenotype data available for research. In fact, Phenome-wide association studies (PheWas) are gaining traction as a complementary way to study the same associations that GWAS provides, but starting from the EHR data. In addition, the amount of genomics data now being created is causing storage bottlenecks. This is especially relevant as clinical trials move toward the idea of basket trials, where patients are sequenced for hundreds of genes up front, then matched to a clinical trial for a gene variant. While all of this data is a boon for researchers, most organizations are struggling to provide their scientists with a unified platform for analyzing this data in a way that balances respecting patient privacy with sharing data appropriately with other collaborators. Google Cloud’s data lake empowers researchers to securely and cost-effectively ingest, store, and analyze large volumes of data across both genotypes and phenotypes. When this data lake infrastructure is combined with healthcare-specific tooling, it’s easy to store and translate a variety of healthcare formats, as well as reduce toil and complexity. Researchers can move at the speed of science instead of the speed of legacy IT. A recent epidemiology studycited BigQuery as a “cloud-based tool to perform GWAS,” and suggests that a future direction for PheWAS “would be to extend existing [cloud platform] tools to perform large-scale PheWAS in a more efficient and less time-consuming manner.” The architecture we’ll describe here offers one possible solution to doing just that. GWAS/PheWAS architecture on Google CloudThe goal of the below GWAS/PheWAS architecture is to provide a modern data analytics architecture that will: Safely and cost-effectively store a variety of large-scale raw data types, which can be interpreted or feature-engineered differently by scientists depending on their research tasksOffer flexibility in analysis tools and technology, so researchers can choose the right tool for the job, across both Google Cloud and open source softwareAccelerate the number of questions asked and increase the amount of scientific research that can be done by: Reducing the time scientists and researchers spend implementing and configuring IT environments for their various toolsIncreasing access to compute resources that can be provisioned as neededMake it easy to share and collaborate with outside institutions while maintaining control over data security and compliance requirements. Check out full details on our healthcare analytics platform, including a reference architecture. The architecture depicted below represents one of many ways to build a data infrastructure on Google Cloud. The zones noted in the image are logical areas of the platform that make it easier to explain the purpose for each area. These logical zones are not to be confused with Google Cloud’s zones, which are physical definitions of where resources are located. This particular architecture is designed to enable data scientists to perform GWAS and PheWAS analysis using Hail, Dataproc Hub, and BigQuery.Click to enlargeHere’s more detail on each of the components.Landing zoneThe landing zone, also referred to by some customers as their “raw zone,” is where data is ingested in its native format without transformations or making any assumptions about what questions might be asked of it later. For the most part, Cloud Storage is well-suited to serve as the central repository for the landing zone. It is easy to bring genomic data stored in raw variant call format (VCF) or SAM/BAM/CRAMfiles into this durable and cost-effective storage. A variety of other sources, such as medical device data, cost analysis, medical billing, registry databases, finance, and clinical application logs are also well suited for this zone, with the potential to be turned into phenotypes later. Take advantage of storage classes to get low-cost, highly durable storage on infrequently accessed data. For clinical applications that use the standard healthcare formats of HL7v2, DICOM, and FHIR, the Cloud Healthcare API makes it easy to ingest the data in its native format and tap into additional functionality, such as: Automated de-identificationDirect exposure to the AI Platform for machine learningEasy export into BigQuery, our serverless cloud data warehouseTransformation and harmonizationThe goal of this particular architecture is to prepare our data for use in BigQuery. Cloud Data Fusion has a wide range of prebuilt plugins for parsing, formatting, compressing, and converting data. Cloud Data Fusion also includes Wrangler, a visualization tool that interactively filters, cleans, formats, and projects the data, based on a small sample (1000 rows) of the dataset. Cloud Data Fusion generates pipelines that run on Dataproc, making it easy to extend Data Fusion pipelines with additional capabilities from the Apache Spark ecosystem. Fusion can also help track lineage between the landing and refined zones. For a more complete discussion of preparing health data for BigQuery, check out Transforming and harmonizing healthcare data for BigQuery. Direct export to BigQuery BigQuery is used as the centerpiece of our refined and insights zones, so many healthcare and life science formats can be directly exported into BigQuery. For example, a FHIR store can be converted to a BigQuery dataset with a single command line call of gcloud beta healthcare fhir-stores export bq.See this tutorial for more information on ingesting FHIR to BigQuery. When it comes to VCF files, the Variant Transforms tool can load VCF files from Cloud Storage into BigQuery. Under the hood, this tool uses Dataflow, a processing engine that can scale to loading and transforming hundreds of thousands of samples and billions of records. Later in this post, we’ll discuss using this Variant Transforms tool to convert data back from BigQuery and into VCF. Refined zoneThe refined zone in this genomics analysis architecture contains our structured, yet somewhat disconnected data. Datasets tend to be associated with specific subject areas but standardized by Cloud Data Fusion to use specific structures (for example, aligned on SNOWMED, single VCF format, unified patient identity, etc). The idea is to make this zone the source of truth for your tertiary analysis. Since the data is structured, BigQuery can store this data in the refined zone, but also start to expose analysis capabilities, so that:Subject matter experts can be given controlled access to the datasets in their area of expertiseETL/ELT writers can use standard SQL to join and further normalize tables that combine various subject areasData scientists can run ML and advanced data processing on these refined datasets using Apache Spark on Dataproc via the BigQuery connector with Spark. Insights zoneThe insights zone is optimized for analytics and will include the datasets, tables, and views designed for specific GWAS/PheWAS studies. BigQuery authorized views lets you share information with specified users and groups without giving them access to the underlying tables (which may be stored in the refined zone). Authorized views is often an ideal way to share data in the insights zone with external collaborators. Keep in mind that BigQuery (in both the insights and refined zones) offers a separation of storage from compute, so you only need to pay for the processing needed for your study. However, BigQuery still provides many of the data warehouse capabilities that are often needed for a collaborative insights zone, such as managed metadata, ACID operations, snapshot isolation, mutations, and integrated security. For more on how BigQuery storage provides a data warehouse without the limitations associated with traditional data warehouse storage, check out Data warehouse storage or a data lake? Why not both?Research and analysis For the actual scientific research, our architecture uses managed Jupyter Lab notebook instances from AI Platform Notebooks. This enterprise notebook experience unifies the model training and deployment offered by AI Platform with the ingestion, preprocessing, and exploration capabilities of Dataproc and BigQuery. This architecture uses Dataproc Hub, which is a notebook framework that lets data scientists select a Spark-based predefined environment that they need without having to understand all the possible configurations and required operations. Data scientists can combine this added simplicity with genomics packages like Hail to quickly create isolated sandbox environments for running genomic association studies with Apache Spark on Dataproc. To get started with genomics analysis using Hail and Dataproc, check out part two of this post.
Quelle: Google Cloud Platform

Presto optional component now available on Dataproc

Presto is an open source, distributed SQL query engine for running interactive analytics queries against data sources of many types. We are pleased to announce the GA release of the Presto optional component for Dataproc, our fully managed cloud service for running data processing software from the open source ecosystem. This new optional component brings the full suite of support from Google Cloud, including fast cluster startup times and integration testing with the rest of Dataproc. The Presto release of Dataproc comes with several new features that improve on the experience of using Presto, including supporting BigQuery integration out of the box, Presto UI support in Component Gateway, JMX and logging integrations with Cloud Monitoring, Presto Job Submission for automating SQL commands, and improvements to the Presto JVM configurations. Why use Presto on DataprocPresto provides a fast and easy way to process and perform ad hoc analysis of data from multiple sources, across both on-premises systems and other clouds. You can seamlessly run federated queries across large-scale Dataproc instances and other sources, including BigQuery, HDFS, Cloud Storage, MySQL, Cassandra, or even Kafka. Presto can also help you plan out your next BigQuery extract, transform, and load (ETL) job. You can use Presto queries to better understand how to link the datasets, determine what data is needed, and design a wide and denormalized BigQuery table that encapsulates information from multiple underlying source systems. Check out a complete tutorial of this. With Presto on Dataproc, you can accelerate data analysis because the Presto optional component takes care of much of the overhead required to get started with Presto. Presto coordinators and workers are managed for you and you can use an external metastore such as Hive to manage your Presto catalogs. You also have access to Dataproc features like initialization actions and component gateway, which now includes the Presto UI. Here are additional details about the benefits Presto on Dataproc offers:Better JVM tuningWe’ve configured the Presto component to have better garbage collection and memory allocation properties based on the established recommendations of the Presto community. To learn more about configuring your cluster, check out the Presto docs.Integrations with BigQueryBigQuery is Google Cloud’s serverless, highly scalable and cost-effective cloud data warehouse offering. With the Presto optional component, the BigQuery connector is available by default to run Presto queries on data in BigQuery by making use of the BigQuery Storage API. To help get you started out of the box, the Presto optional component also comes with two BigQuery catalogs installed by default: bigquery for accessing data in the same project as your Dataproc cluster, and bigquery_public_data for accessing BigQuery’s public datasets project. You can also add your own catalog when creating a cluster via cluster properties. Adding the following properties to your cluster creation command will create a catalog named bigquery_my_other_project for access to another project called my-other-project:Note: This is only currently supported on Dataproc image version 1.5 or preview image version 2.0, as Presto version 331 or above is required for the BigQuery connector.Use an external metastore to keep track of your catalogsWhile catalogs can be added to your Presto cluster at creation time, you can also keep track of your Presto catalogs by using an external metastore such as Hive and adding this to your cluster configuration. When creating a cluster, add the following properties:The Dataproc Metastore, now accepting alpha customers, provides a completely managed and serverless option for keeping your Presto metadata information accessible from multiple Dataproc clusters and lets you share tables between other processing engines like Apache Spark and Apache Hive.Create a Dataproc cluster with the Presto optional componentYou can create a Dataproc cluster by selecting a region with the Presto, Anaconda and Jupyter optional components and component gateway enabled with the following command. You can also include the Jupyter optional component and necessary Python dependencies to run Presto commands from a Jupyter Notebook:Submit Presto jobs with the gcloud commandYou can use Dataproc’s Presto Jobs API to submit Presto commands to your Dataproc cluster. The following example will execute the “SHOW CATALOGS;” Presto command and return the list of catalogs available to you:You should then see the output:Query BigQuery public datasetsBigQuery datasets are known as schemas in Presto. To view the full list of datasets, you can use the SHOW SCHEMAS command:Then, run the SHOW TABLES command to see which tables are in the dataset. For this example, we’ll use the chicago_taxi_trips dataset.Then submit a Presto SQL query against the table taxi_trips using this code:You can also submit jobs using Presto SQL queries saved in a file. Create a file called taxi_trips.sql and add the following code to the file:Then, submit this query to the cluster by running the following query:Submit Presto SQL queries using Jupyter NotebooksUsing Dataproc Hub or the Jupyter optional component, with ipython-sql, you can execute Presto SQL queries from a Jupyter Notebook. In the first cell of your notebook, run the following command:Now, run ad hoc Presto SQL queries from your notebook:Access the Presto UI directly from the Cloud ConsoleYou can now access the Presto UI without needing to SSH into the cluster, thanks to Component Gateway, which creates a link that you can access from the cluster page in Cloud Console. With the Presto UI, you can monitor the status of your collaborators and workers, as well as the status of your Presto jobs.Logging, monitoring and diagnostic tarball integrationsPresto jobs are now integrated into Cloud Monitoring and Cloud Logging to better track their status. By default, Presto job information is not shown in the main cluster monitoring page for Dataproc clusters. However, you can easily create a new dashboard using Cloud Monitoring and the metrics explorer. To create a chart for all Presto jobs on your cluster, select the resource type Cloud Dataproc Cluster and metric Job duration. Then apply the filter to only show job_type = PRESTO_JOB and use the aggregator mean.In addition to Cloud Monitoring, Presto server and job logs are available in Cloud Logging, as shown here:Last, Presto config and log information will also now come bundled in your Dataproc diagnostic tarball. You can download this by running the following command:To get started with Presto on Cloud Dataproc, check out this tutorial on using Presto with Cloud Dataproc. And use the Presto optional component to create your first Presto on Dataproc cluster.
Quelle: Google Cloud Platform

Not just compliance: reimagining DLP for today’s cloud-centric world

As the name suggests, data loss prevention (DLP) technology is designed to help organizations monitor, detect, and ultimately prevent attacks and other events that can result in data exfiltration and loss. The DLP technology ecosystem—covering network DLP, endpoint DLP, and data discovery DLP—has a long history, going back nearly 20 years, and with data losses and leaks continuing to impact organizations, it continues to be an important security control.In this blog, we’ll look back at the history of DLP before discussing how DLP is useful in today’s environment, including compliance, security, and privacy use cases.DLP History Historically, however, DLP technologies have presented some issues that organizations have found difficult to overcome, including: Disconnects between business and ITMismatched expectationsDeployment headwindsDLP alert triage difficultiesDLP solutions were also born in the era when security technologies were typically hardware appliances or deployable software—while the cloud barely existed as a concept—and most organizations were focused on perimeter security. This meant that DLP was focused largely on blocking or detecting data as it crossed the network perimeter. With the cloud and other advances, this is not the reality today, and often neither users nor the applications live within the perimeter.This new reality means we have to ask new questions: How do you reinvent DLP for today’s world where containers, microservices, mobile phones, and scalable cloud storage coexist with traditional PCs and even mainframes?How does DLP apply in the world where legacy compliance mandates coexist with modern threats and evolving privacy requirements? How does DLP evolve away from some of the issues that have hurt its reputation among security professionals?DLP todayLet’s start with where some of the confusion around DLP use cases comes from. While DLP technology is rarely cited as a control in regulations today (here’s an example), for a few years it was widely considered primarily a compliance solution. Despite that compliance focus, some organizations used DLP technologies to support their threat detection mission, using it to detect intentional data theft and risky data negligence. Today, DLP is employed to support privacy initiatives and is used to monitor (and minimize the risk to) personal data in storage and in use. Paradoxically, at some organizations these DLP domains sometimes conflict with each other. For example, if the granular monitoring of employees for insider threat detection is implemented incorrectly it may conflict with privacy policies.The best uses for DLP today live under a triple umbrella of security, privacy, and compliance. It should cover use cases from all three domains, and do so without overburdening the teams operating it. Modern DLP is also a natural candidate for cloud migration due to its performance profile. In fact, DLP needs to move to the cloud simply because so much enterprise data is quickly moving there.To demonstrate how DLP can work for compliance, security, and privacy in this new cloud world, let’s break down a Cloud DLP use case from each domain to illustrate some tips and best practices.ComplianceMany regulations focus on protecting one particular type of data—payment data, personal health information, and so on. This can lead to challenges like how to find that particular type of data so that you can protect it in the first place. Of course, every organization strives to have well-governed data that can be easily located. We also know that in today’s world, where large volumes of data are stored across multiple repositories, this is easier said than done. Let’s look at the example of the Payment Card Industry Data Security Standard (PCI DSS), an industry mandate that covers payment card data. (Learn more about PCI DSS on Google Cloud here.) In many cases going back 10 years or more, the data that was in scope for PCI DSS—i.e. payment card numbers—was often found outside of what was considered to be a Cardholder Data Environment (CDE). This pushed data discovery to the forefront, even before cloud environments became popular. Today, the need to discover “toxic” data—i.e. data that can lead to possibly painful compliance efforts, like payment card numbers—is even stronger, and data discovery DLP is a common method for finding this “itinerant” payment data. When moving to the cloud, the same logic applies: you need to scan your cloud resources for card data to assure that there is no regulated data outside the systems or components designated to handle it. This use case is something that should become part of what PCI DSS now calls “BAU,” or business as usual, rather than an assessment-time activity. A good practice is to conduct a periodic broad scan of many locations followed by a deep scan of “high-risk” locations where such data has been known to accidentally appear. This may also be combined with a deep and broad scan before each audit or assessment, whether it’s quarterly or even annually. For specific advice on how to optimally configure Google Cloud DLP for this use case, review these pages. SecurityDLP technologies are also useful in security risk reduction projects. With data discovery, for example, somes obvious security use cases include detecting sensitive data that’s accessible to the public when it should not be and detecting access credentials in exposed code. DLP equipped with data transformation capabilities can also address a long list of use cases focused on making sensitive data less sensitive, with the goal of making it less risky to keep and thus less appealing to cyber criminals. These use cases range from the mundane, like tokenization of bank account numbers, to esoteric, like protecting AI training data pipelines from intentionally corrupt data. This approach of rendering valuable, “theft-worthy” data harmless is underused in modern data security practice, in part because of a lack of tools that make it easy and straightforward, compared to, say, merely using data access controls. Where specifically can you apply this method? Account numbers, access credentials, other secrets, and even data that you don‘t want a particular employee to see, such as customer data, are great candidates. Note that in some cases, the focus is not on making the data less attractive to external attackers, but reducing the temptation to internal attackers looking for a low hanging fruit.PrivacyUsing DLP for privacy presented a challenge when it was first discussed. This is because some types of DLP—such as agent-based endpoint DLP—collect a lot of information about the person using the system where the agent is installed. In fact, DLP was often considered to be a privacy risk, not a privacy protection technology. Google Cloud DLP, however, was born as a privacy protection technology even before it became a security technology.However, types of DLP that can discover, transform, and anonymize data—whether in storage or in motion (as a stream)—present clear value for privacy-focused projects. The range of use cases that involve transforming data that’s a privacy risk is broad, and includes names, addresses, ages (yes, even age can reveal the person’s identity when small groups are analyzed), phone numbers, and so on.For example, let’s look at the case when data is used for marketing purposes (such as trend analysis), but the production datastores are queried. It would be prudent in this case to transform the data in a way that retains its value for the task at hand (it still lets you see the right trend), but destroys the risk of it being misused (such as by removing the bits that can lead to person identification). There are also valuable privacy DLP use cases in the area where two datasets with lesser privacy risk are combined, creating a data set with dramatically higher risks. This may come, for example, from a retailer merging a customer’s shopping history with their location history (such as visits to the store). It makes sense to measure the re-identification risks and transform the datasets either before or after merging to reduce the risk of unintentional exposure.What’s nextWe hope that these examples help show that modern cloud-native DLP can be a powerful solution for some of today’s data challenges.If you’d like to learn more about Google Cloud DLP and how it can help your organization, here are some things to try:First, adopt DLP as an integral part of your data security, compliance, or privacy program, not a thing to be purchased and used standaloneSecond, review your needs and use cases, for example the types of sensitive data you need to secureThird, review Google Cloud DLP materials, including this video and these blogs. For privacy projects, review our guidance on de-identification of personal data, specifically.Fourth, implement one or a very small number of use cases to learn the specific lessons of applying DLP in your particular environment. For example, for many organizations the starting use case is likely to be scanning to discover one type of data in a particular repository.We built Google Cloud DLP for this new era, its particular use cases, and its cloud-native technology. Check out our Cloud Data Loss Prevention page for more resources on getting started.
Quelle: Google Cloud Platform

Reinforcing our commitment to privacy with accredited ISO/IEC 27701 certification

For decades, there has been a growing focus on privacy in technology, with laws such as the EU’s General Data Protection Regulation (GDPR), the California Consumer Privacy Act, and the Australian Privacy Principlesproviding guidance on how to protect and maintain user privacy. Privacy has always been a priority at Google, and we’re continuously evolving to help our customers directly address global privacy and data protection requirements. Today, we’re pleased to announce that Google Cloud is the first major cloud provider to receive an accredited ISO/IEC 27701 certification as a data processor. Published in 2019, ISO/IEC 27701 is a global standard designed to help organizations align with international privacy frameworks and laws. It provides guidance for implementing, maintaining, and continuously improving a Privacy Information Management System (PIMS), and can be used by both data controllers and processors—a key consideration for organizations that must align with the GDPR. ISO/IEC 27701 is an extension of the security industry best practices that are codified in ISO/IEC 27001, which outlines and provides the requirements for an information security management system (ISMS).  Unlocking the benefits of ISO 27701Coalfire ISO, an independent third party, issued an accredited certificate of registration for ISO/IEC 27701 to Google Cloud Platform (GCP). This accredited certificate shows that Google’s PIMS for GCP (as shown in the certificate’s scope) conforms to the ISO/IEC 27701 requirements, and that the body conducting the audit and issuing the certificate did so in accordance with the International Accreditation Forum (IAF)/ANSI National Accreditation Board (ANAB) requirements. This means that the certificate will be recognized by other IAF-accredited audit and certification bodies under the IAF Multilateral Recognition Agreement (MLA). Ouraccredited certification demonstrates Google Cloud’s long-standing commitment to privacy and providing the most trusted experience for our customers. By meeting the rigorous standards outlined by ISO/IEC 27701, Google Cloud customers can leverage the many benefits our certification, including:A universal set of privacy controls, verified by a trusted third party in accordance with the requirements of their accreditation body, that can serve as a solid foundation for the implementation of a privacy programThe ability to rely on Google Cloud Platform’s accredited ISO/IEC 27701 certification in your own compliance effortsReduced time and expense for both internal and third-party auditors, who can now demonstrate compliance with several privacy objectives within a single audit cycleGreater clarity on privacy-related roles and responsibilities, which can facilitate efforts to comply with privacy regulations such as GDPROur commitment to customersCertifications provide independent validation of our ongoing commitment to world-class security and privacy, while also helping customers with their own compliance efforts. You can find more information on Google Cloud’s compliance efforts and our commitment to privacy in our compliance resource center.
Quelle: Google Cloud Platform

Dataproc Metastore: Fully managed Hive metastore now available for alpha testing

Google Cloud is announcing a new data lake building block for our smart analytics platform: Dataproc Metastore, a fully managed, highly available, auto-healing, open source Apache Hive metastore service that simplifies technical metadata management for customers building data lakes on Google Cloud. With Dataproc Metastore, you now have a completely serverless option for several use cases:A centralized metadata repository that can be shared among various ephemeral Dataproc clusters running different open source engines, such as Apache Spark, Apache Hive, and Presto;A metadata bridge between open source tables and code-free ETL/ELT with Data Fusion; A unified view of your open source tables across Google Cloud, providing interoperability between cloud-native services like Dataproc and various other open source-based partner offerings on Google Cloud.To get started with Dataproc Metastore today, join our alpha program by reaching out by email: join-dataproc-metastore-alpha@google.com.Why Hive Metastore?A core benefit of Dataproc is that it lets you create a fully configured, autoscaling, Hadoop and Spark cluster in around 90 seconds. This rapid creation and flexible compute platform makes it possible to treat cluster creation and job processing as a single entity. When the job completes, the cluster can terminate and you pay only for the Dataproc resources required to run your jobs. However, information about tables—the metadata—that was created during those jobs is not always something that you want to be thrown out with the cluster. You often want to keep that table information between jobs or make the metadata available to other clusters and other processing engines. If you use open source technologies in your data lakes, you likely already use the Hive Metastore as the trusted metastore for big data processing. Hive metastore has achieved standardization as the mechanism that open source data systems use to share data structures. The below diagram demonstrates just some of the ecosystem that is already built around Hive Metastore’s capabilities.Click to enlargeHowever, this same Hive Metastore can be a friction point for customers who need to run their data lakes on Google Cloud. Today, Dataproc customers will often use Cloud SQL to persist Hive metadata off-cluster. But we’ve heard about some challenges with this:You must self-manage and troubleshoot the RDBMS Cloud SQL instance.Hive servers are managed independently of RDBMS: This can create both scalability issues for incoming connections, and locking issues in the database. The CloudSQL instance is a single point of failure that requires a maintenance window with downtime, making it impossible to use with data lakes that need always-on processing. This architecture requires that direct JDBC access be provided to each cluster, which can introduce security risks when used with sensitive data.  In order to trust that the Hive Metastore can serve in the critical path for all your data processing jobs, your other option is to move beyond the CloudSQL workaround and spend significant time architecting a highly available IaaS layer that includes load balancing, autoscaling, installations and updates, testing, and backups. However, the Dataproc Metastore abstracts all of this toil and provides these as features in a managed service. Enterprise customers have told us they want a managed Hive Metastore that they can rely on for running business-critical data workloads in Google Cloud data lakes. In addition, customers have expressed a desire for the full, open source-based Hive metastore catalog that maintains their integration points with numerous applications, can provide table statistics for query optimization, and supports Kerberos authentication so that existing security models based on tools like Apache Ranger and Apache Atlas continue to function. We also hear that customers want to avoid a new client library that would require a rewrite of existing software or a “compatible” API that only offers limited functionality of the Hive metastore. Enterprise customers want to use the full features of the open source Hive metastore. The Dataproc Metastore team has accepted this challenge, and now provides a fully serverless Hive metastore service. The Dataproc Metastore complements the Google Cloud Data Catalog, a fully managed and highly scalable data discovery and metadata management service. Data Catalog empowers organizations to quickly discover, understand, and manage all their data with simple and easy-to-use search interfaces, while the Dataproc Metastore offers technical metadata interoperability among open source big data processing. Common use cases for Dataproc MetastoreFlexible analysis of your data lake with centralized metadata repositoryWhen German wholesale giant METRO moved their ecommerce data lake to Google Cloud, they were able to match daily events to compute processing and reduce infrastructure costs by 30% to 50%. The key to these types of gains when it comes to data lakes is severing the ties between storage and compute. By disconnecting the storage layer from compute clusters, your data lake gains flexibility. Not only can clusters come up and down as needed, but cluster specifications like vCPUs, GPUs, and RAM can be tailored to the specific needs of the jobs at hand. Dataproc already offers several features that help you achieve this flexibility.Cloud Storage Connector lets you take data off your cluster by providing Cloud Storage as a Hadoop Compatible File System (HCFS). Jobs based on data in the Hadoop Distributed File System (HDFS) can typically be converted to Cloud Storage with a simple file prefix change (more on HDFS vs. Cloud Storage here).Workflow Templates provides an easy-to-use mechanism for managing and executing workflows. You can specify a set of jobs to run on a managed cluster that gets created on demand and deleted when the jobs are finished. Dataproc Hub makes it easy to give data scientists, analysts, and engineers preconfigured Spark working environments in JupyterLab that automatically spawn and destroy Dataproc clusters without an administrator.   Now, with Dataproc Metastore, achieving flexible clusters is even easier for those clusters that want to share tables and schemas. Clusters of various shapes, sizes, and processing engines can safely and efficiently share the same tables and metadata simply by pointing a Dataproc cluster to a serverless Dataproc Metastore endpoint, as shown here:Serverless and code-free ETL/ELT with Dataproc Metastore and Data FusionWe’ve heard from customers that they’re able to use real-time data to improve customer service, network optimization, and more to save time and reach customers effectively. For companies building data pipelines, they can use Data Fusion, our fully managed, code-free, and cloud-native data integration service that lets you easily ingest and integrate data from various sources. Data Fusion is built with an open source core (CDAP), which offers a Hive source plugin. With this plugin, data scientists and other users of the data lake can share the structured results of their analysis using Dataproc Metastore, offering a shared repository that ETL/ELT developers can use to manage and productionize pipelines in the data lake. Below is one example of a workflow using Dataproc Metastore with Data Fusion to manage data pipelines, so you can go from unstructured raw data to a structured data warehouse without having to worry about running servers.Click to enlargeData scientists, data analysts, and data engineers log in to Dataproc Hub, which they use to spawn a personalized Dataproc cluster running a Juypter lab interface backed by Apache Spark processing. Unstructured raw data on Cloud Storage is analyzed, interpreted, and structured. Metadata about how to interpret Cloud Storage objects as structured tables is stored in Dataproc Metastore, allowing the personalized Dataproc cluster to be terminated without losing the metadata information.Data Fusion’s Hive connector uses the table created in the notebook as a data source via the thrift URL provided by Dataproc Metastore.Data Fusion reads the Cloud Storage data according to the structure provided by Dataproc Metastore. The data is harmonized with other data sources into a data warehouse table.The refined data table is written to BigQuery, Google Cloud’s serverless data warehouse.BigQuery tables are made available to Apache Spark on Jupyter Notebooks for further data lake queries and analysis with the Apache Spark BigQuery Connector.  Partner ecosystem accelerates Dataproc Metastore deployments across multi-cloud and hybrid data lakesAt Google, we believe in an open cloud, and Dataproc Metastore is built with our leading open source-centric partners in mind. Because Dataproc Metastore provides compatibility with open source Apache Hive Metastore, you can integrate Google Cloud partner services into your hybrid data lake architectures without having to give up metadata interoperability. Google Cloud-native services and open source applications can work in tandem. Collibra provides hybrid data lake visibility with Dataproc MetastoreIntegrating Dataproc Metastore with Collibra Data Catalog provides enterprises with enterprise-wide visibility across on-prem and cloud data lakes. Since Dataproc Metastore was built on top of Hive metastore, Collibra could quickly integrate into the solution without having to worry about proprietary data formats or APIs. “Dataproc Metastore provides a fully managed Hive metastore, and Collibra layers on data set discovery and governance, which is critical for any business looking to meet the strictest internal and external compliance standards,” says Chandra Papudesu, VP product management, Catalog and Lineage for Collibra.Qubole provides a single view of metadata across data lakesQubole’s open data lake platform provides end-to-end data lake services, such as continuous data engineering, financial governance, analytics, and machine learning with near-zero administration on any cloud. As enterprises continue to execute a multi-cloud strategy with Qubole, it’s critical to have one centralized view of your metadata for data discovery and governance. “Qubole’s co-founders led the Apache Hive project, which has spawned into many impactful projects and contributors globally,” said Anita Thomas, director of product management at Qubole. “Qubole’s platform has used a Hive metastore since its inception, and now with Google’s launch of an open metastore service, our joint customers have multiple options to deploy a fully managed, central metadata catalog for their machine learning, ad-hoc or streaming analytics applications,” Pricing During the alpha phase, you will not be charged for testing this service. However, under NDA, you can be provided a tentative price list to evaluate the value of Dataproc Metastore against the proposed fees. Sign up for the alpha testing program for Dataproc Metastore now.
Quelle: Google Cloud Platform

Google Cloud VMware Engine is now generally available

Let’s face it: bringing workloads to the public cloud isn’t always easy. And if you want to take full advantage of the elasticity, economics and innovation of the cloud, you usually have to write a new application. But that isn’t always an option, especially for existing applications, which may be from a third-party or written years ago. Compounding the challenge of rewriting those applications for the cloud is how you manage the application after you rebuild it—how you protect it from failures, monitor it, secure it, and so on. For many existing applications, this is done on a platform such as VMware®. So, the question becomes: how can these critical applications take advantage of the cloud when you don’t have a clear path to rearchitecting them outright? Google Cloud VMware Engine now generally availableToday, we’re happy to announce that Google Cloud VMware Engine is generally available, enabling you to seamlessly migrate your existing VMware-based applications to Google Cloud without refactoring or rewriting them. You can run the service in the us-east4 (Ashburn, Northern Virginia) & us-west2 (Los Angeles, California) regions, and we will  expand into other Google Cloud regions around the world in the second half of the year.Google Cloud VMware Engine provides everything you need to run your VMware environment natively in Google Cloud. The service delivers a fully managed VMware Cloud Foundation hybrid cloud platform, including VMware technologies vSphere, vCenter, vSAN, NSX-T, and HCX—in a dedicated environment on Google Cloud’s high performance and reliable infrastructure, to support your enterprise production workloads.With this service, you can extend or bring your on-premises workloads to Google Cloud in minutes—and without changes—by connecting to a dedicated VMware environment. Google Cloud VMware Engine is a first-party offering, fully owned, operated and supported by Google Cloud, that lets you seamlessly migrate to the cloud, without the cost or complexity of refactoring applications, and manage workloads consistently with your on-prem environment. You reduce your operational burden by moving to an on-demand, self-service model, while maintaining continuity with your existing tools, processes and skill sets, while also taking advantage of Google Cloud services to supercharge your VMware environment.Google Cloud VMware Engine is a unique solution for running VMware environments in the cloud, with four areas that provide a differentiated experience: a) user experience, b) enterprise-grade infrastructure, c) integrated networking and d) a rich services ecosystem. Let’s take a closer look.A simple user experienceLaunching a fully functional instance of Google Cloud VMware Engine is easy—all it takes is four clicks from the Google Cloud Console. Within a few minutes, you get a new environment, ready to consume. Compare that to the days and weeks it takes to design a new on-prem data center, ordering hardware and software, racking, stacking, cabling and infrastructure configuration. Not only that, but once the environment is live, you can expand or shrink it at the click of a button. To further simplify the experience, you can provision VMware environments using your existing Google Cloud identities. You also receive integrated support from Google Cloud—a one-stop shop for all support issues, whether in VMware or the rest of Google Cloud. The service is fully VMware certified and verified, and VMware’s support is fully integrated with Google Cloud support for a seamless experience. Consumption associated with the service is available in the standard billing views in the Google Cloud Console. And when you need to use native VMware tools, simply log into the familiar vCenter interface and manage and monitor VMware environment as you normally would.Dedicated, enterprise-grade infrastructureGoogle Cloud VMware Engine is built on high-performance, reliable and high-capacity infrastructure, giving you a fast and highly available VMware experience, at a low cost. The environment includes:Fully redundant and dedicated 100Gbps networking, providing 99.99% availability, low latency and high throughput to meet the needs of your most demanding enterprise workloads.Hyperconverged storage via the VMware vSAN stack on high-end, all-flash NVMe devices. This enables blazing fast performance with the scale, availability, reliability and redundancy of a distributed storage system.Recent generation CPUs (2nd Generation Intel Xeon Scalable Processors), delivering very high (2.6 GHz normal, 3.9 GHz burst) compute performance for your workloads. 768 GB of RAM, and 19.2TB of raw data capacity per node. Since VMware allows compute over-provisioning, many workloads in existing environments are often memory- or storage-constrained. The larger memory and storage capacity in Google Cloud VMware Engine nodes enables more workload VMs to be deployed per node, lowering your overall cost.The compute and storage infrastructure is single tenant—not shared by any other customer. The networking bandwidth to other hosts in a VMware vSphere cluster is also dedicated. This means that you get not only the privacy and security of a dedicated environment, but also highly predictable levels of performance. Integrated cloud networkingVMware environments in Google Cloud VMware Engine are configured directly on VPC subnets. This means you can use standard mechanisms such as Cloud Interconnect and Cloud VPN to connect to the service, as you would to any other service in Google Cloud. This eliminates the need to establish additional, expensive, bandwidth-limited connectivity.You also get direct, private, layer 3 networking access to workloads and services running on Google Cloud. You can connect between workloads in VMware and other services in Google Cloud with high-speed, low-latency connections, using private addresses. This provides faster access and higher levels of security for a wide variety of use cases such as hybrid applications, backup and centralized performance management. By eliminating a lot of networking complexity, you get a seamless, secure experience that is integrated with Google Cloud.A rich services ecosystemIn addition to its native capabilities, VMware users value the platform for its rich third-party ecosystem for disaster recovery, backup, monitoring, security—or any other imaginable IT need. Since the service provides a native VMware platform, you can continue to use those tools, with no changes.In Google Cloud VMware Engine, we have built unique capabilities to enable ecosystem tools. By elevating system privileges, you can install and configure third-party tools as you would on-prem. Third parties such as Zerto are taking advantage of this integration for mission-critical use cases such as disaster recovery.You can also benefit from native Google Cloud services and our ecosystem partners alongside your VMware-based applications. For instance, you can use Cloud Storage with a third-party data protection tool offered by companies such as Veeam, Dell, Cohesity, and Actifio to get a variety of availability and cost options for your backups. You can run third-party KMS tools externally and independently in your Compute Engine VMs to encrypt at-rest storage, making your environment even more secure.And then there are the native Google Cloud services. With your VMware-based databases and applications running inside Google Cloud VMware Engine, you can now manage them alongside your cloud-native workloads with our Operations family (formerly Stackdriver). You can interoperate VMware workloads with services such as Google Kubernetes Engine and Cloud Functions. You can use third-party solutions such as NetApp Cloud Volumes for extended VMware storage needs. And you can take advantage of the privacy and performance of Google Cloud VMware Engine to run cloud-native workloads directly next to your VMware workloads, with the help of Anthos deployed directly inside the service. Or supercharge analytics of your VMware data sources with BigQuery, and make it more intelligent with AI and machine learning services. Moving to the cloud doesn’t have to be hard. By migrating your VMware platform to Google Cloud, you can keep what you like about your on-prem application environment, and tap into next generation hardware and application services. To learn more about Google Cloud VMware Engine, check out our Getting Started guide, and be sure to watch our upcoming Google Cloud Next ‘20: OnAir session, Introducing Google Cloud VMware Engine during the week of July 27th.
Quelle: Google Cloud Platform