New Compute Engine A2 VMs—first NVIDIA Ampere A100 GPUs in the cloud

Machine learning and HPC applications can never get too much compute performance at a good price. Today, we’re excited to introduce the Accelerator-Optimized VM (A2) family on Google Compute Engine, based on the NVIDIA Ampere A100 Tensor Core GPU. With up to 16 GPUs in a single VM, A2 VMs are the first A100-based offering in the public cloud, and are available now via our private alpha program, with public availability later this year. Accelerator-Optimized VMs with NVIDIA Ampere A100 GPUs The A2 VM family was designed to meet today’s most demanding applications—workloads like CUDA-enabled machine learning (ML) training and inference, and high performance computing (HPC). Each A100 GPU offers up to 20x the compute performance compared to the previous generation GPU and comes with 40 GB of high-performance HBM2 GPU memory. To speed up multi-GPU workloads, the A2 uses NVIDIA’s HGX A100 systems to offer high-speed NVLink GPU-to-GPU bandwidth that delivers up to 600 GB/s. A2 VMs come with up to 96 Intel Cascade Lake vCPUs, optional Local SSD for workloads requiring faster data feeds into the GPUs and up to 100 Gbps of networking. Additionally, A2 VMs provide full vNUMA transparency into the architecture of underlying GPU server platforms, enabling advanced performance tuning. A whopping 16 GPUs per VMFor some demanding workloads, the bigger the machine, the better. For those, we have the a2-megagpu-16g instance with 16 A100 GPUs, offering a total of 640 GB of GPU memory and providing an effective performance of up to 10 petaflops of FP16 or 20 petaOps of int8 in a single VM when using the new sparsity feature. To maximize performance and support the largest datasets, the instance comes with 1.3 TB of system memory and an all-to-all NVLink topology with aggregate NVLink bandwidth up to 9.6 TB/s. We look forward to seeing how you use this infrastructure for your compute-intensive projects.Compute Engine’s new A2-MegaGPU VM: 16 A100 GPUs with up to 9.6 TB/s NVLINK BandwidthOf course, A2 VMs are available in smaller configurations as well, allowing you to match your application’s needs for GPU compute power. The A2 family of VMs come in two different CPU- and networking-to-GPU ratios, allowing you to match the preprocessing and multi-vm networking performance best suited for your application.Available A2 VM shapesNVIDIA’s new Ampere architecture The new Ampere GPU architecture for our A2 instances features several innovations that are immediately beneficial to many ML and HPC workloads. A100’s new Tensor Float 32 (TF32) format provides 10x speed improvement compared to FP32 performance of the previous generation Volta V100. The A100 also has enhanced 16-bit math capabilities supporting both FP16 and bfloat16 (BF16) at double the rate of TF32. INT8, INT4 and INT1 tensor operations are also supported now making A100 an equally excellent option for inference workloads. Also, the A100’s new Sparse Tensor Core instructions allow skipping the compute on entries with zero values, resulting in a doubling of the Tensor Core compute throughput of int8, FP16, BF16 and TF32. Lastly, the multi-instance group (mig) feature allows each GPU to be partitioned into as many as seven GPU instances, fully isolated from a performance and fault isolation perspective. All together, each A100 will have a lot more performance, increased memory, very flexible precision support, and increased process isolation for running multiple workloads on a single GPU. Getting startedWe want to make it easy for you to start using the A2 VM shapes with A100 GPUs. You can get started quickly on Compute Engine with our Deep Learning VM images, which come preconfigured with everything you need to run high-performance workloads. In addition, A100 support will be coming shortly to Google Kubernetes Engine (GKE), Cloud AI Platform, and other Google Cloud services. To learn more about the A2 VM family and request access to our alpha, either contact your sales team or sign up here. Public availability and pricing information will come later in the year.
Quelle: Google Cloud Platform

Enhancing multi-cloud data governance on Google Cloud

Data governance is an essential part of managing your cloud infrastructure, particularly if you’re taking advantage of multiple cloud providers. In many industries, you need to show where data has been stored, and how it’s been used, to meet regulations. In addition, using access controls and other data governance tools helps ensure that only those who need to see certain data are able to.Google’s data governance security primitives are built into our data warehousing service BigQuery. For example, fine-grained access controls at the column level—provide the ability to control data by assigning a policy tag based on the nature of the data itself (i.e., personally identifiable information, or PII) and control it across multiple data containers (tables, datasets, and more). Furthermore, with BigQuery table-level ACLs, it’s possible to assign permissions to a table-sized data container. We’re also now partnering with Collibra to offer a cloud-agnostic, source-agnostic data governance solution. Collibra’s technology will directly interface with Google Cloud security primitives, allowing your data governance policies to be natively enforced as direct-column, table-security elements at the storage layer. In addition, this serves as an independent control plane to provide visibility to data outside of Google Cloud. Integrations down to the column level of your dataCollibra is deploying its native SaaS managed service on Google Cloud, to be available within our console by the end of the year. Over the past two years, Google has been bringing more and more of its internal data governance practices to Google Cloud, particularly to data warehousing solution BigQuery.  As we introduce additional security primitives, Collibra plans to enable them in the policy provisioning of its own platform for increased data security, governance, and discoverability.Click to enlargeSolving for multi-cloud data governance journeysWith our joint roadmap, you can take advantage of a multi-cloud management plane for data governance, which includes a data acquisition workflow, lineage tracking, and the ability to maintain an enterprise dictionary, as well as access provisioning and enforcement of access through BigQuery. For example, you might request access to a data mart in Collibra’s Data Catalog, identify the relevant data containers in BigQuery, provision secure access to that data mart from BigQuery, then execute high-performing queries with a detailed audit log. You can ensure that data is protected and save time accessing data across multiple clouds.Click to enlargeCustomers including ATB Financial have used Google Cloud and Collibra together to enable users to access a single view of data from anywhere in their organizations.  Get more details in this recent session: How Google Cloud is bringing decades of Google’s data governance and security practices to the enterprise. Upcoming virtual events to learn moreAnd, check out this upcoming webinar with our joint customer: How ATB Financial built consistent governance across a hybrid data lake with Google Cloud and Collibra. Finally, join our breakout session at Google Cloud Next ‘20: OnAir, where we’ll explore best practices around data governance.
Quelle: Google Cloud Platform

Including everyone at Google Cloud Next ’20: OnAir

This year, amidst the ongoing global pandemic, we are reimagining our Google Cloud Next ‘20 event to connect our cloud community digitally. We’re continuing  to make our physical events diverse and inclusive, and are infusing these values to our digital events too—critically important to ensuring everyone feels welcome and included as we build together. Our Diversity, Equity, and Inclusion (DEI) track shares knowledge and creates space within our broader cloud community about the role we can all play in making sure we are building for everyone. You’ll hear about how we’re learning and advancing belonging at Google, which includes our work on equity engineering and product inclusion. We hope this helps you get started or adds to your allyship journey. We all  continue to make progress by having these conversations, and we encourage you to check out the sessions below. You can add any to your playlist by viewing our program track,  and watch for their weekly release, starting on July 14. In addition to these sessions, you can join us for our interactive I am Remarkable workshops, which empower women and other underrepresented groups to celebrate their achievements in the workplace and beyond.Sessions include: Google’s Diversity Strategy and How It WorksJoin Google’s Chief Diversity Officer Melonie Parker to see how Google is continuing to build a workforce that reflects all communities, and how the Diversity Annual Report reflects that. You’ll hear about efforts to better understand our global workforce, build a sense of belonging, and tackle challenges to advancing DEI.G Suite Accessibility Features To Empower InclusionWhen we’re talking about connecting digitally, we need to make sure that everyone can connect. Emails, calendar invites, video conferences, presentations, documents, and spreadsheets are all important digital connection options. G Suite’s accessibility features are designed so that anyone can use the tools to get more done, inclusive of those who have audio, visual, or motor impairment. Check out this session to get an overview of those features and how to use them on mobile or web.Equity Engineering—Impact and OpportunityThe lack of diversity in tech is a complex and pervasive challenge. Equity Engineering offers an opportunity to identify greater systemic organizational issues with both people and product development. This session explores what Equity Engineering is, how to build and develop it, and the impact it can have to increase equity and systemic transformation. Head of Equity Engineering Demma Rosa Rodriguez will take you through how this engineering initiative evaluates how systems can support equitable HR processes and build the best products for a diverse workforce and world, and what parts research, infrastructure, process change, and centers of excellence play.How Certification Impacted My CareerBecoming Google Cloud-certified has the power to boost careers, and the experience is unique for each person. Solution Engineer Jewel Langevine, who has three certification badges, will share her path to certification and how it plays a role in her career. During this session, Jewel will share insights from her journey from an upbringing in Guyana to her present position as a Solution Engineer at Google Cloud. Along the way, she’ll discuss how she was introduced to cloud computing, her experiences in mentorship, how she maximized networking opportunities, and how she continues to give back to underrepresented communities.The Case for Product Inclusion 2.0Historically, diversity, equity and inclusion has been focused internally, but over the past three years, Googlers have been expanding their DEI practices throughout the product design process to create better products for all users. Paying attention to the connection between people, process, and product has led to better user outcomes and more business opportunities. In a world where demographics are shifting rapidly and consumers have a myriad of choices, how do companies keep up with diverse users and truly build for all? Annie Jean-Baptiste, Google’s Head of Product Inclusion, will share details about Google’s product inclusion journey and the end-to-end system.Empowering Inclusion with Employee Resource Groups (ERGs)The unprecedented shift in our workplaces due to the COVID-19 pandemic has a lot of us searching for connections amongst the new work-from-home culture of our companies. Sherice Torres, Google’s Director of Inclusion, shares how to foster community and belonging through Employee Resource Groups (ERGs). Learn how ERGs uplevel inclusion beyond their own members and how they influence Google’s DEI strategy and accountability.This session will also profile and share learnings from Women@GoogleCloud. Find out how an underrepresented group of women created a global network with allies passionate about cultivating a culture for women to thrive and bring their whole selves to work.Encoding Gender into Technical Artifacts Such as EmojiSometimes at engineering-driven companies, there can be a preconceived notion there is a right or wrong way of designing. This talk will explore how the emoji program operates in the spectrum between this false binary. After all, if race is not a skin color and gender is not a haircut, how do you communicate the idea of “woman” at emoji sizes? You’ll hear how Google’s emoji team uses a blend of academic research and quantitative data to inform design practices and product decisions, and see how we build technology in an inclusive way that reflects a variety of communication needs.Thanks for building a more inclusive cloud with us. We look forward to continuing our allyship and advocacy journey with all of you at Next OnAir, starting July 14.
Quelle: Google Cloud Platform

Supporting federal agency compliance during the pandemic (and beyond)

If digital transformation was only a trend a few years ago, it’s now quickly becoming a reality for many federal government agencies. The COVID-19 pandemic has pushed all kinds of agencies to reconsider the timelines and impact of their digital initiatives, whether this means moving core technology infrastructure to the cloud, rolling out more modern productivity tools for government employees, or using artificial intelligence to better deliver citizen services.At Google Cloud, we continue to help federal agencies of all sizes tackle their trickiest problems as they rapidly transform and digitize. At the same time—building on our FedRAMP High authorization announcement from last year—we’re committed to pursuing the latest government certifications, such as the Department of Defense’s (DoD) upcoming Cybersecurity Maturity Model Certification (CMMC), to ensure federal agencies and the vendors that work with them are fully compliant.Applying intelligent automation to assist the U.S. Patent and Trademark OfficeRecently, Accenture Federal Services (AFS) was awarded a position on the U.S. Patent and Trademark Office (USPTO) Intelligent Automation and Innovation Support Services (IAISS) blanket purchase agreement (BPA), a multi-contract vehicle. The five-year BPA includes piloting, testing, and implementing advanced technologies, including intelligent automation, artificial intelligence (AI), microservices, machine learning, natural language processing, robotic process automation, and blockchain. The goal of IAISS is to transform business processes and enhance mission delivery, and it’s expected to be a model for the federal government nationwide. AFS and Google Cloud previously worked with the USPTO to help the agency’s more than 9,000 patent examiners rapidly perform more thorough searches by augmenting their on-premise search tools with Google’s AI. The new solution—created by merging Google’s machine learning models with Accenture’s design, prototyping, and data science capabilities—helps extend examiners’ expertise during the patent search process.Supporting secure cloud management at the Defense Innovation UnitWe also recently announced that Google Cloud was chosen by the Defense Innovation Unit (DIU)—an organization within the Department of Defense (DoD) focused on scaling commercial technology across the DoD—to build a secure cloud management solution to detect, protect against, and respond to cyber threats worldwide.The multi-cloud solution will be built on Anthos, Google Cloud’s app modernization platform, allowing DIU to prototype web services and applications across Google Cloud, Amazon Web Services, and Microsoft Azure—while being centrally managed from the Google Cloud Console. The solution will provide real-time network monitoring, access control, and full audit trails, enabling DIU to maintain its strict cloud security posture without compromising speed and reliability. As a pioneer in zero-trust security and deploying innovative approaches to protect and secure networks, we’re looking forward to partnering with DIU on this critical initiative.Supporting Cybersecurity Maturity Model Certification (CMMC) readinessFinally, while COVID-19 has driven a lot of how federal agencies are working day-to-day, the need for strong cybersecurity protections is as important as ever. At Google Cloud, meeting the highest standards for cybersecurity in the ever-evolving threat and regulatory landscape is one of our primary goals. In January of this year, the DoD published the Cybersecurity Maturity Model Certification (CMMC), a new standard designed to ensure cyber hygiene throughout the DoD supply chain. While the CMMC standard is not yet operational, the CMMC Advisory Board has advised cloud providers to conduct gap analysis against NIST SP 800-53, NIST SP 800-171, and preliminary versions of CMMC requirements. We’ve contracted with a third-party assessor to perform preliminary analyses of Google Cloud against the underlying CMMC controls, and we’re confident we’ll be able to meet the currently proposed controls—and to provide our customers with the right guidance to empower them in their CMMC journeys. For questions about Google’s existing compliance offerings, FedRAMP, or the CMMC, please contact Google Cloud sales. You can also visit our Compliance Resource Center and Government and Public Sector Compliance page to learn more about how we support your specific compliance needs. And to read more about our work with the public sector, including how we’re helping support agencies through the pandemic, visit our website.
Quelle: Google Cloud Platform

Grow your cloud career with high-growth jobs and skill badges

Cloud computing and data skills are especially in demand, as organizations are increasingly turning to digital solutions to transform the way they work and do business. The World Economic Forum predicts there will be close to a 30 percent increase in demand for data, AI, engineering, and cloud computing roles by 2022. Since April, Google Cloud learners have more than doubled year-over-year1. Of those who have started learning with us in 2020, many are looking to upskill or reskill into stable, well paying career paths.To help our expanding community of learners ramp quickly with their cloud careers, Google Cloud is unveiling a new Grow your cloud career webpage where you can find information on in-demand cloud career paths and free upskilling and reskilling resources. You can earn your first Google Cloud skill badges for your resume, which signify to employers that you have hands-on Google Cloud experience. We also have a special no cost learning section for small business leaders to help you build your first website and transform your business with data and AI.If you’re not sure which cloud role is right for you, we recommend exploring these three high-growth career paths.Data AnalystBy 2025, an estimated 463 exabytes of data is expected to be generated everyday. From online purchases, to personal health trackers, to smart factories, and more, the world generates massive amounts of data, but without Data Analysts this data is meaningless. Data Analysts interpret and gather insights from data, enabling better decision making. Their work is instrumental across several industries and for many business functions, including product development, supply chain management, and customer experience. You don’t need a technical background to get started in this role, but you will need to develop foundational skills in SQL (Structured Query Language), data visualization, and data warehousing. Cloud EngineerWith more than 88 percent of organizations now using cloud and planning to increase their usage, it’s no wonder that the Cloud Engineer role was one of the top in-demand job roles in the U.S. in 2019. Cloud Engineers play a critical role in setting up their company’s infrastructure, deploying applications, and monitoring cloud systems and operations. If you have education or experience in IT, the Cloud Engineer role may be the most natural path for you. It will give you a broad foundation in cloud and expose you to several different functions. Although working in cloud will require a shift in mindset for most with a traditional IT background, particularly in terms of automated infrastructure, scale, and agile workflows, there are several transferable IT skills that will continue to serve you well in this role.Cloud Application DeveloperFor those with a software development background, expanding your skills into cloud development is a must. Cloud offers developers several benefits, including scalability, better security, cost efficiencies, and ease of deployment. As a Cloud Developer, you are responsible for designing, building, testing, deploying, and monitoring highly scalable and reliable cloud-native applications. To upskill into this role, you will need to gain a deep understanding of cloud platforms, databases, and systems integration. If you’re ready to jumpstart your cloud career, visit our Grow your cloud career page where you can start upskilling and earning Google Cloud recognized skill badges for the Data Analyst, Cloud Engineer, or Cloud Developer job roles—get started at no cost here.1. According to internal data.
Quelle: Google Cloud Platform

Genomics analysis with Hail, BigQuery, and Dataproc

At Google Cloud, we work with organizations performing large-scale research projects. There are a few solutions we recommend to do this type of work, so that researchers can focus on what they do best—power novel treatments, personalized medicine, and advancements in pharmaceuticals. (Find more details about creating a genomics data analysis architecture in this post.)Hail is an open source, general-purpose, Python-based data analysis library with additional data types and methods for working with genomic data on top of Apache Spark. Hail is built to scale and has first-class support for multi-dimensional structured data, like the genomic data in a genome-wide association study (GWAS). The Hail team has made their software available to the community with the MIT license, which makes Hail a perfect augmentation to the Google Cloud Life Sciences suite of tools for processing genomics data. Dataproc makes open source data and analytics processing fast, easy, and more secure in the cloud and offers fully managed Apache Spark, which can accelerate data science with purpose-built clusters. And what makes Google Cloud really stand out from other cloud computing platforms is our healthcare-specific tooling that makes it easy to merge genomic data with data sets from the rest of the healthcare system. When genotype data is harmonized with phenotype data from electronic health records, device data, medical notes, and medical images, the hypothesis space becomes boundless. In addition, with Google Cloud-based analysis platforms like AI Platform Notebooks and Dataproc Hub, researchers can easily work together using state-of-the-art ML tools and combine datasets in a safe and compliant manner. Getting started with Hail and DataprocAs of Hail version 0.2.15, pip installations of Hail come bundled with a command-line tool, hailctl, which has a submodule called dataproc for working with Dataproc clusters configured for Hail. This means getting started with Dataproc and Hail is as easy as going to the Google Cloud console, then clicking the icon for Cloud Shell at the top of the console window. This Cloud Shell provides you with command-line access to your cloud resources directly from your browser without having to install tools on your system. From this shell, you can quickly install Hail by typing:Once Hail downloads and installs, you can create a Dataproc cluster fully configured for Apache Spark and Hail simply by running this command: Once the Dataproc cluster is created, you can click the button from the Cloud Shell that says Open Editor, which will take you to a built-in editor for creating and modifying code. From this editor, choose New File and call the file my-first-hail-job.py.In the editor, copy and paste the following code:By default, Cloud Shell Editor should save this file, but you can also explicitly save the file from the menu where the file was created. Once you have verified the file is saved, return to the command line terminal by clicking on the Open Terminal button. From the terminal, now run the command:Once the job starts, find the Dataproc section of the Google Cloud Console and review the output of your genomics job from the Jobs tab.Congratulations, you just ran your first Hail job on Dataproc! For more information, see Using Hail on Google Cloud Platform. Now, we’ll pull Dataproc and Hail into the rest of the clinical data warehouse. Create a Dataproc Hub environment for Hail As mentioned earlier, Hail version 0.2.15 pip installations come bundled with hailctl, a command-line tool that has a submodule called dataproc for working with Google Dataproc clusters. This includes a fully configured notebook environment that can be used simply by calling:However, to take advantage of notebook features specific to Dataproc, including the use of Dataproc Hub, you will need to use a Dataproc initialization action that provides a standalone version of Hail without the Hail-provided notebook. To create a Dataproc cluster that provides Hail from within Dataproc’s JuypterLab environment, run the following command:Once the cluster has been created (as indicated by the green check mark), click on the cluster name, choose the tab for Web Interfaces and click the component gateway link for JuypterLab.From within a Juypter IDE, you should have a kernel and console for Hail (see red box in image below):This running cluster can easily be translated into a Dataproc Hub configuration by running the Dataproc clusters export command:Use Dataproc Hub and BigQuery to analyze genomics data Now that the Juypter notebooks environment is configured with Hail for Dataproc, let’s take a quick survey of the ways we can interact with the BigQuery genotype and phenotype data stored in the insights zone.  Using BigQuery magic to query data into Pandas It is possible to run a GWAS study directly in BigQuery by using SQL logic to push the processing down into BigQuery. Then, you can bring just the query results back into a Pandas dataframe that can be visualized and presented in a notebook. From a Dataproc Juypter notebook, you can run BigQuery SQL, which returns the results in a Pandas dataframe simply by adding the bigquery magic command to the start of the notebook cell, like this: Find an example of a GWAS analysis performed in BigQuery with a notebook in this tutorial. A feature of BigQuery, BigQuery ML provides the ability to run basic regression techniques and K-means clusters using standard SQL queries.   More commonly, BigQuery is used for preliminary steps in GWAS/PheWAS: feature engineering, defining cohorts of data, and running descriptive analysis to understand the data. Let’s look at some descriptive statistics using the 1000 genome variant data hosted by BigQuery public datasets. Let’s say you wanted to understand what SNP data was available in chromosome 12 from the 1000 Genomes Project. In a Juypter cell, simply copy and paste the following into a cell:This query will populate a Pandas dataframe with very basic information from the 1000 Genomes Project samples in my cohort. You can now run standard Python and Pandas functions to review, plot, and understand the data available for this cohort.For more on writing SQL queries against tables that use the variant format, see the Advanced guide to analyzing variants using BigQuery. Using the Spark to BigQuery connector to work with BigQuery storage directly in Apache Spark When you need to process large volumes of genomic data for population studies and want to use generic classification and regression algorithms like Random Forest, Naive Bayes, or Gradient Boosted trees, or you need help with extracting or transforming features with algorithms like PCA or One Hot Encoding, Apache Spark offers these ML capabilities, among many others.Using the Apache Spark BigQuery connector from Dataproc, you can now treat BigQuery as another source to read and write data from Apache Spark. This is achieved nearly the same as any other Spark dataframe setup:Learn more here about the Apache Spark to BigQuery storage integration and how to get started.Run variant transforms to convert BigQuery into VCF for genomics tools like HailWhen you want to do genomics-specific tasks, this is where Hail can provide a layer on top of Spark that can be used to:Generate variant and sample annotationsUnderstand Mendelian violations in trios, prune variants in linkage disequilibrium, analyze genetic similarity between samples, and compute sample scores and variant loadings using PCAPerform variant, gene-burden and eQTL association analyses using linear, logistic, and linear mixed regression, and estimate heritabilityHail expects the data format to start with either VCF, BGEN, or PLINK. Luckily, BigQuery genomics data can easily be converted from the BigQuery VCF format into a VCF file using Variant Transforms. Once you create the VCF on Cloud Storage, call Hail’s import_vcf function, which transforms the file into Hail’s matrix table.To learn more about scalable genomics analysis with Hail, check out this YouTube series for Hail from the Broad Institute.
Quelle: Google Cloud Platform

11 best practices for operational efficiency and cost reduction with Google Cloud

As businesses consider the road ahead, many are finding they need to make tough decisions about what projects to prioritize and how to allocate resources. For many, the impact of COVID-19 has brought the benefits and limitations of their IT environment into focus. As these businesses plan their way forward, many will need to consider how to meet the needs of their new business realities with limited resources.This is a challenge ideally suited for IT—particularly any  business overly reliant on legacy infrastructure. A recent McKinsey study found that these legacy systems account for 74% of a company’s IT spend while hampering agility at the same time. Making fundamental IT changes like migrating on-premises workloads to the cloud can reduce costs, increase agility, and pay ROI dividends down the line.All of this is covered in our new eGuide, Solving for operational efficiency with Google Cloud. While modernization will look different for businesses of varying sizes and in varying industries, the benefits of moving to the cloud are broad and universal. These include:Increasing agility and reducing IT costs by adopting hybrid and multi-cloud strategies.Driving a higher return on ERP investments by migrating SAP systems to the cloud.Avoiding costly hardware refreshes and reducing on-premises infrastructure costs by migrating VMs to the cloud.Increasing scalability and gaining access to advanced analytics through data warehouse modernization.Make cluster management easier and more cost-effective by migrating on-premises Apache Hadoop clusters to the cloud.Gaining cost efficiencies by running specialized workloads in the cloud with a scalable Bare Metal Solution.Increasing flexibility and decreasing on-premises investments by migrating Windows workloads to the cloud.Embrace a modern architecture for scalability and cost efficiencies by offloading a mainframe environment.Leveraging AI to rapidly respond to customer needs and improve customer experience.Gaining more visibility and control to lower costs with billing and cost management tools.Improving productivity by transforming the way teams work together with cloud-native collaboration tools.With current business conditions, organizations need facts, knowledge, and best practices so they can prioritize investments and optimize costs. Our eGuide provides an overview of the key areas we see our customers prioritizing their investments and creating operational efficiencies, and highlights the many ways Google Cloud can support you in your journey. Read the eGuide. If you want more customized recommendations, take advantage of our IT Cost Assessment program which will analyze your individual IT spend against industry benchmark data and provide you with a view of cost optimization opportunities. Learn more here.Related ArticleNew IT Cost Assessment program: Unlock value to reinvest for growthOur new IT Cost Assessment program lets you understand how your company’s IT spend compares to your industry peers, so you can quickly id…Read Article
Quelle: Google Cloud Platform

Building a genomics analysis architecture with Hail, BigQuery, and Dataproc

We hear from our users in the scientific community that having the right technology foundation is essential. The ability to very quickly create entire clusters of genomics processing, where billing can be stopped once you have the results you need, is a powerful tool. It empowers the scientific community to spend more time doing their research and less time fighting for on-prem cluster time and configuring software.   At Google Cloud, we’ve developed healthcare-specific tooling that makes it easy for researchers to look at healthcare and genomic data holistically. Combining genotype data with phenotype data from electronic health records (EHRs), device data, medical notes, and medical images makes scientific hypotheses limitless. And, our analysis platforms like AI Platform Notebooks and Dataproc Hub let researchers easily work together using state-of-the-art ML tools and combine datasets in a safe and compliant manner. Building an analytics architecture for genomic association studiesGenome-wide association studies (GWAS) are one of the most prevalent ways to study which genetic variants are associated with a human trait, otherwise known as a phenotype. Understanding the relationships between our genetic differences and phenotypes such as diseases and immunity is key to unlocking medical understanding and treatment options. Historically, GWAS studies were limited to phenotypes gathered during a research study. These studies were typically siloed, separate from day-to-day clinical data. However, the increased use of EHRs for data collection, coupled with natural language processing (NLP) advances that unlock the data in medical notes, has created an explosion of phenotype data available for research. In fact, Phenome-wide association studies (PheWas) are gaining traction as a complementary way to study the same associations that GWAS provides, but starting from the EHR data. In addition, the amount of genomics data now being created is causing storage bottlenecks. This is especially relevant as clinical trials move toward the idea of basket trials, where patients are sequenced for hundreds of genes up front, then matched to a clinical trial for a gene variant. While all of this data is a boon for researchers, most organizations are struggling to provide their scientists with a unified platform for analyzing this data in a way that balances respecting patient privacy with sharing data appropriately with other collaborators. Google Cloud’s data lake empowers researchers to securely and cost-effectively ingest, store, and analyze large volumes of data across both genotypes and phenotypes. When this data lake infrastructure is combined with healthcare-specific tooling, it’s easy to store and translate a variety of healthcare formats, as well as reduce toil and complexity. Researchers can move at the speed of science instead of the speed of legacy IT. A recent epidemiology studycited BigQuery as a “cloud-based tool to perform GWAS,” and suggests that a future direction for PheWAS “would be to extend existing [cloud platform] tools to perform large-scale PheWAS in a more efficient and less time-consuming manner.” The architecture we’ll describe here offers one possible solution to doing just that. GWAS/PheWAS architecture on Google CloudThe goal of the below GWAS/PheWAS architecture is to provide a modern data analytics architecture that will: Safely and cost-effectively store a variety of large-scale raw data types, which can be interpreted or feature-engineered differently by scientists depending on their research tasksOffer flexibility in analysis tools and technology, so researchers can choose the right tool for the job, across both Google Cloud and open source softwareAccelerate the number of questions asked and increase the amount of scientific research that can be done by: Reducing the time scientists and researchers spend implementing and configuring IT environments for their various toolsIncreasing access to compute resources that can be provisioned as neededMake it easy to share and collaborate with outside institutions while maintaining control over data security and compliance requirements. Check out full details on our healthcare analytics platform, including a reference architecture. The architecture depicted below represents one of many ways to build a data infrastructure on Google Cloud. The zones noted in the image are logical areas of the platform that make it easier to explain the purpose for each area. These logical zones are not to be confused with Google Cloud’s zones, which are physical definitions of where resources are located. This particular architecture is designed to enable data scientists to perform GWAS and PheWAS analysis using Hail, Dataproc Hub, and BigQuery.Click to enlargeHere’s more detail on each of the components.Landing zoneThe landing zone, also referred to by some customers as their “raw zone,” is where data is ingested in its native format without transformations or making any assumptions about what questions might be asked of it later. For the most part, Cloud Storage is well-suited to serve as the central repository for the landing zone. It is easy to bring genomic data stored in raw variant call format (VCF) or SAM/BAM/CRAMfiles into this durable and cost-effective storage. A variety of other sources, such as medical device data, cost analysis, medical billing, registry databases, finance, and clinical application logs are also well suited for this zone, with the potential to be turned into phenotypes later. Take advantage of storage classes to get low-cost, highly durable storage on infrequently accessed data. For clinical applications that use the standard healthcare formats of HL7v2, DICOM, and FHIR, the Cloud Healthcare API makes it easy to ingest the data in its native format and tap into additional functionality, such as: Automated de-identificationDirect exposure to the AI Platform for machine learningEasy export into BigQuery, our serverless cloud data warehouseTransformation and harmonizationThe goal of this particular architecture is to prepare our data for use in BigQuery. Cloud Data Fusion has a wide range of prebuilt plugins for parsing, formatting, compressing, and converting data. Cloud Data Fusion also includes Wrangler, a visualization tool that interactively filters, cleans, formats, and projects the data, based on a small sample (1000 rows) of the dataset. Cloud Data Fusion generates pipelines that run on Dataproc, making it easy to extend Data Fusion pipelines with additional capabilities from the Apache Spark ecosystem. Fusion can also help track lineage between the landing and refined zones. For a more complete discussion of preparing health data for BigQuery, check out Transforming and harmonizing healthcare data for BigQuery. Direct export to BigQuery BigQuery is used as the centerpiece of our refined and insights zones, so many healthcare and life science formats can be directly exported into BigQuery. For example, a FHIR store can be converted to a BigQuery dataset with a single command line call of gcloud beta healthcare fhir-stores export bq.See this tutorial for more information on ingesting FHIR to BigQuery. When it comes to VCF files, the Variant Transforms tool can load VCF files from Cloud Storage into BigQuery. Under the hood, this tool uses Dataflow, a processing engine that can scale to loading and transforming hundreds of thousands of samples and billions of records. Later in this post, we’ll discuss using this Variant Transforms tool to convert data back from BigQuery and into VCF. Refined zoneThe refined zone in this genomics analysis architecture contains our structured, yet somewhat disconnected data. Datasets tend to be associated with specific subject areas but standardized by Cloud Data Fusion to use specific structures (for example, aligned on SNOWMED, single VCF format, unified patient identity, etc). The idea is to make this zone the source of truth for your tertiary analysis. Since the data is structured, BigQuery can store this data in the refined zone, but also start to expose analysis capabilities, so that:Subject matter experts can be given controlled access to the datasets in their area of expertiseETL/ELT writers can use standard SQL to join and further normalize tables that combine various subject areasData scientists can run ML and advanced data processing on these refined datasets using Apache Spark on Dataproc via the BigQuery connector with Spark. Insights zoneThe insights zone is optimized for analytics and will include the datasets, tables, and views designed for specific GWAS/PheWAS studies. BigQuery authorized views lets you share information with specified users and groups without giving them access to the underlying tables (which may be stored in the refined zone). Authorized views is often an ideal way to share data in the insights zone with external collaborators. Keep in mind that BigQuery (in both the insights and refined zones) offers a separation of storage from compute, so you only need to pay for the processing needed for your study. However, BigQuery still provides many of the data warehouse capabilities that are often needed for a collaborative insights zone, such as managed metadata, ACID operations, snapshot isolation, mutations, and integrated security. For more on how BigQuery storage provides a data warehouse without the limitations associated with traditional data warehouse storage, check out Data warehouse storage or a data lake? Why not both?Research and analysis For the actual scientific research, our architecture uses managed Jupyter Lab notebook instances from AI Platform Notebooks. This enterprise notebook experience unifies the model training and deployment offered by AI Platform with the ingestion, preprocessing, and exploration capabilities of Dataproc and BigQuery. This architecture uses Dataproc Hub, which is a notebook framework that lets data scientists select a Spark-based predefined environment that they need without having to understand all the possible configurations and required operations. Data scientists can combine this added simplicity with genomics packages like Hail to quickly create isolated sandbox environments for running genomic association studies with Apache Spark on Dataproc. To get started with genomics analysis using Hail and Dataproc, check out part two of this post.
Quelle: Google Cloud Platform

Presto optional component now available on Dataproc

Presto is an open source, distributed SQL query engine for running interactive analytics queries against data sources of many types. We are pleased to announce the GA release of the Presto optional component for Dataproc, our fully managed cloud service for running data processing software from the open source ecosystem. This new optional component brings the full suite of support from Google Cloud, including fast cluster startup times and integration testing with the rest of Dataproc. The Presto release of Dataproc comes with several new features that improve on the experience of using Presto, including supporting BigQuery integration out of the box, Presto UI support in Component Gateway, JMX and logging integrations with Cloud Monitoring, Presto Job Submission for automating SQL commands, and improvements to the Presto JVM configurations. Why use Presto on DataprocPresto provides a fast and easy way to process and perform ad hoc analysis of data from multiple sources, across both on-premises systems and other clouds. You can seamlessly run federated queries across large-scale Dataproc instances and other sources, including BigQuery, HDFS, Cloud Storage, MySQL, Cassandra, or even Kafka. Presto can also help you plan out your next BigQuery extract, transform, and load (ETL) job. You can use Presto queries to better understand how to link the datasets, determine what data is needed, and design a wide and denormalized BigQuery table that encapsulates information from multiple underlying source systems. Check out a complete tutorial of this. With Presto on Dataproc, you can accelerate data analysis because the Presto optional component takes care of much of the overhead required to get started with Presto. Presto coordinators and workers are managed for you and you can use an external metastore such as Hive to manage your Presto catalogs. You also have access to Dataproc features like initialization actions and component gateway, which now includes the Presto UI. Here are additional details about the benefits Presto on Dataproc offers:Better JVM tuningWe’ve configured the Presto component to have better garbage collection and memory allocation properties based on the established recommendations of the Presto community. To learn more about configuring your cluster, check out the Presto docs.Integrations with BigQueryBigQuery is Google Cloud’s serverless, highly scalable and cost-effective cloud data warehouse offering. With the Presto optional component, the BigQuery connector is available by default to run Presto queries on data in BigQuery by making use of the BigQuery Storage API. To help get you started out of the box, the Presto optional component also comes with two BigQuery catalogs installed by default: bigquery for accessing data in the same project as your Dataproc cluster, and bigquery_public_data for accessing BigQuery’s public datasets project. You can also add your own catalog when creating a cluster via cluster properties. Adding the following properties to your cluster creation command will create a catalog named bigquery_my_other_project for access to another project called my-other-project:Note: This is only currently supported on Dataproc image version 1.5 or preview image version 2.0, as Presto version 331 or above is required for the BigQuery connector.Use an external metastore to keep track of your catalogsWhile catalogs can be added to your Presto cluster at creation time, you can also keep track of your Presto catalogs by using an external metastore such as Hive and adding this to your cluster configuration. When creating a cluster, add the following properties:The Dataproc Metastore, now accepting alpha customers, provides a completely managed and serverless option for keeping your Presto metadata information accessible from multiple Dataproc clusters and lets you share tables between other processing engines like Apache Spark and Apache Hive.Create a Dataproc cluster with the Presto optional componentYou can create a Dataproc cluster by selecting a region with the Presto, Anaconda and Jupyter optional components and component gateway enabled with the following command. You can also include the Jupyter optional component and necessary Python dependencies to run Presto commands from a Jupyter Notebook:Submit Presto jobs with the gcloud commandYou can use Dataproc’s Presto Jobs API to submit Presto commands to your Dataproc cluster. The following example will execute the “SHOW CATALOGS;” Presto command and return the list of catalogs available to you:You should then see the output:Query BigQuery public datasetsBigQuery datasets are known as schemas in Presto. To view the full list of datasets, you can use the SHOW SCHEMAS command:Then, run the SHOW TABLES command to see which tables are in the dataset. For this example, we’ll use the chicago_taxi_trips dataset.Then submit a Presto SQL query against the table taxi_trips using this code:You can also submit jobs using Presto SQL queries saved in a file. Create a file called taxi_trips.sql and add the following code to the file:Then, submit this query to the cluster by running the following query:Submit Presto SQL queries using Jupyter NotebooksUsing Dataproc Hub or the Jupyter optional component, with ipython-sql, you can execute Presto SQL queries from a Jupyter Notebook. In the first cell of your notebook, run the following command:Now, run ad hoc Presto SQL queries from your notebook:Access the Presto UI directly from the Cloud ConsoleYou can now access the Presto UI without needing to SSH into the cluster, thanks to Component Gateway, which creates a link that you can access from the cluster page in Cloud Console. With the Presto UI, you can monitor the status of your collaborators and workers, as well as the status of your Presto jobs.Logging, monitoring and diagnostic tarball integrationsPresto jobs are now integrated into Cloud Monitoring and Cloud Logging to better track their status. By default, Presto job information is not shown in the main cluster monitoring page for Dataproc clusters. However, you can easily create a new dashboard using Cloud Monitoring and the metrics explorer. To create a chart for all Presto jobs on your cluster, select the resource type Cloud Dataproc Cluster and metric Job duration. Then apply the filter to only show job_type = PRESTO_JOB and use the aggregator mean.In addition to Cloud Monitoring, Presto server and job logs are available in Cloud Logging, as shown here:Last, Presto config and log information will also now come bundled in your Dataproc diagnostic tarball. You can download this by running the following command:To get started with Presto on Cloud Dataproc, check out this tutorial on using Presto with Cloud Dataproc. And use the Presto optional component to create your first Presto on Dataproc cluster.
Quelle: Google Cloud Platform

Not just compliance: reimagining DLP for today’s cloud-centric world

As the name suggests, data loss prevention (DLP) technology is designed to help organizations monitor, detect, and ultimately prevent attacks and other events that can result in data exfiltration and loss. The DLP technology ecosystem—covering network DLP, endpoint DLP, and data discovery DLP—has a long history, going back nearly 20 years, and with data losses and leaks continuing to impact organizations, it continues to be an important security control.In this blog, we’ll look back at the history of DLP before discussing how DLP is useful in today’s environment, including compliance, security, and privacy use cases.DLP History Historically, however, DLP technologies have presented some issues that organizations have found difficult to overcome, including: Disconnects between business and ITMismatched expectationsDeployment headwindsDLP alert triage difficultiesDLP solutions were also born in the era when security technologies were typically hardware appliances or deployable software—while the cloud barely existed as a concept—and most organizations were focused on perimeter security. This meant that DLP was focused largely on blocking or detecting data as it crossed the network perimeter. With the cloud and other advances, this is not the reality today, and often neither users nor the applications live within the perimeter.This new reality means we have to ask new questions: How do you reinvent DLP for today’s world where containers, microservices, mobile phones, and scalable cloud storage coexist with traditional PCs and even mainframes?How does DLP apply in the world where legacy compliance mandates coexist with modern threats and evolving privacy requirements? How does DLP evolve away from some of the issues that have hurt its reputation among security professionals?DLP todayLet’s start with where some of the confusion around DLP use cases comes from. While DLP technology is rarely cited as a control in regulations today (here’s an example), for a few years it was widely considered primarily a compliance solution. Despite that compliance focus, some organizations used DLP technologies to support their threat detection mission, using it to detect intentional data theft and risky data negligence. Today, DLP is employed to support privacy initiatives and is used to monitor (and minimize the risk to) personal data in storage and in use. Paradoxically, at some organizations these DLP domains sometimes conflict with each other. For example, if the granular monitoring of employees for insider threat detection is implemented incorrectly it may conflict with privacy policies.The best uses for DLP today live under a triple umbrella of security, privacy, and compliance. It should cover use cases from all three domains, and do so without overburdening the teams operating it. Modern DLP is also a natural candidate for cloud migration due to its performance profile. In fact, DLP needs to move to the cloud simply because so much enterprise data is quickly moving there.To demonstrate how DLP can work for compliance, security, and privacy in this new cloud world, let’s break down a Cloud DLP use case from each domain to illustrate some tips and best practices.ComplianceMany regulations focus on protecting one particular type of data—payment data, personal health information, and so on. This can lead to challenges like how to find that particular type of data so that you can protect it in the first place. Of course, every organization strives to have well-governed data that can be easily located. We also know that in today’s world, where large volumes of data are stored across multiple repositories, this is easier said than done. Let’s look at the example of the Payment Card Industry Data Security Standard (PCI DSS), an industry mandate that covers payment card data. (Learn more about PCI DSS on Google Cloud here.) In many cases going back 10 years or more, the data that was in scope for PCI DSS—i.e. payment card numbers—was often found outside of what was considered to be a Cardholder Data Environment (CDE). This pushed data discovery to the forefront, even before cloud environments became popular. Today, the need to discover “toxic” data—i.e. data that can lead to possibly painful compliance efforts, like payment card numbers—is even stronger, and data discovery DLP is a common method for finding this “itinerant” payment data. When moving to the cloud, the same logic applies: you need to scan your cloud resources for card data to assure that there is no regulated data outside the systems or components designated to handle it. This use case is something that should become part of what PCI DSS now calls “BAU,” or business as usual, rather than an assessment-time activity. A good practice is to conduct a periodic broad scan of many locations followed by a deep scan of “high-risk” locations where such data has been known to accidentally appear. This may also be combined with a deep and broad scan before each audit or assessment, whether it’s quarterly or even annually. For specific advice on how to optimally configure Google Cloud DLP for this use case, review these pages. SecurityDLP technologies are also useful in security risk reduction projects. With data discovery, for example, somes obvious security use cases include detecting sensitive data that’s accessible to the public when it should not be and detecting access credentials in exposed code. DLP equipped with data transformation capabilities can also address a long list of use cases focused on making sensitive data less sensitive, with the goal of making it less risky to keep and thus less appealing to cyber criminals. These use cases range from the mundane, like tokenization of bank account numbers, to esoteric, like protecting AI training data pipelines from intentionally corrupt data. This approach of rendering valuable, “theft-worthy” data harmless is underused in modern data security practice, in part because of a lack of tools that make it easy and straightforward, compared to, say, merely using data access controls. Where specifically can you apply this method? Account numbers, access credentials, other secrets, and even data that you don‘t want a particular employee to see, such as customer data, are great candidates. Note that in some cases, the focus is not on making the data less attractive to external attackers, but reducing the temptation to internal attackers looking for a low hanging fruit.PrivacyUsing DLP for privacy presented a challenge when it was first discussed. This is because some types of DLP—such as agent-based endpoint DLP—collect a lot of information about the person using the system where the agent is installed. In fact, DLP was often considered to be a privacy risk, not a privacy protection technology. Google Cloud DLP, however, was born as a privacy protection technology even before it became a security technology.However, types of DLP that can discover, transform, and anonymize data—whether in storage or in motion (as a stream)—present clear value for privacy-focused projects. The range of use cases that involve transforming data that’s a privacy risk is broad, and includes names, addresses, ages (yes, even age can reveal the person’s identity when small groups are analyzed), phone numbers, and so on.For example, let’s look at the case when data is used for marketing purposes (such as trend analysis), but the production datastores are queried. It would be prudent in this case to transform the data in a way that retains its value for the task at hand (it still lets you see the right trend), but destroys the risk of it being misused (such as by removing the bits that can lead to person identification). There are also valuable privacy DLP use cases in the area where two datasets with lesser privacy risk are combined, creating a data set with dramatically higher risks. This may come, for example, from a retailer merging a customer’s shopping history with their location history (such as visits to the store). It makes sense to measure the re-identification risks and transform the datasets either before or after merging to reduce the risk of unintentional exposure.What’s nextWe hope that these examples help show that modern cloud-native DLP can be a powerful solution for some of today’s data challenges.If you’d like to learn more about Google Cloud DLP and how it can help your organization, here are some things to try:First, adopt DLP as an integral part of your data security, compliance, or privacy program, not a thing to be purchased and used standaloneSecond, review your needs and use cases, for example the types of sensitive data you need to secureThird, review Google Cloud DLP materials, including this video and these blogs. For privacy projects, review our guidance on de-identification of personal data, specifically.Fourth, implement one or a very small number of use cases to learn the specific lessons of applying DLP in your particular environment. For example, for many organizations the starting use case is likely to be scanning to discover one type of data in a particular repository.We built Google Cloud DLP for this new era, its particular use cases, and its cloud-native technology. Check out our Cloud Data Loss Prevention page for more resources on getting started.
Quelle: Google Cloud Platform