Catalyst Black: the next-generation mobile multiplayer action game built on Google Cloud

At Super Evil Megacorp our vision is to push the creative and technical boundaries of video games. This was the driving force behind Vainglory, released in 2014, which showcased AAA-quality games can be successful on mobile while filling a gap for multiplayer online battle arena games on mobile. Our vision continues with the development of our next free-to-play game, Catalyst Black, a AAA mobile multiplayer action game hosted on Google Cloud.Setting up the stage for a new kind of real-time mobile gameOur community enjoys developing complex skills and strategies whilst playing our games; they expect beautiful games that can keep up with their quick dexterity and creative thinking. Catalyst Black has been developed with this in mind. To match player expectations, we partnered with Google Cloud very early in our development process. We had a number of learnings from the development of Vainglory that Google Cloud was primed with solutions for. For starters, we wanted to roll our own deployment process and to automate certain aspects of our infrastructure management. Previously, we had five engineers rotating on-call so if a server went down someone could get it back up or spin up a new server manually in case of a sudden spike in players. Now, we use Google Kubernetes Engine to automatically scale our servers up and down as needed. By leveraging an infrastructure that takes care of itself, we can focus instead on creating an engrossing game with exciting features. This allows our team to push the boundaries of how multiplayer games on mobile can foster human connections, with new developments such as a drop-in/drop-out feature that allows players to join a friend’s match with minimal delay. The game is also cross-platform, which means that players on both Android and iOS can team up, no matter what device they use. Tailoring gaming experiences to players with data analyticsTo create great games we need to learn from our players, and for that we need a robust analytics pipeline with low latency. Data analytics enables us to recognize individual players and how they like to play so we can serve them a relevant, tailored gaming experience they’ll enjoy. When players log in, for example, we build a profile for them on our analytics pipeline to understand what kind of player they are. Based on their performance during the game tutorial and subsequent matches, we can understand their skill level, whether they’re interested in purchasing gear for loadouts in-play, what style they like to play, and so on. With this, we can then match a player with others that have similar skill levels and adjust our push notifications depending on how often they join the game.For our data analytics pipeline which informs all of our product decisions, we rely on BigQuery and Looker to scale our analytics frameworks and understand the players’ journey through our live operations, statistical modeling, and A/B testing. We also rely heavily on Looker for observability. That includes looking at crash rates, application responsiveness rates, uptime of all services, and latency across all regions so that players can interact with each other in near real time, which is a must for combat games. Capturing large amounts of detailed data is critical in helping us understand how players behave and how the game itself is performing. Thanks to Google Cloud, we have gained the scalability and compute power needed to support this kind of big data analysis that we rely on daily. With Catalyst Black in beta, with a small user base, we generated millions of events per day. That will go up several orders of magnitude once we launch. That’s a big operation to handle but relying on Google Cloud means we have no concerns that the technical side of things will keep up; we trust Google Cloud to scale.We also need data to make efficient decisions around marketing and products to sustain our business so that we can continue to offer free-to-play games. We use Google Analytics to maximize ROI on our paid media spend through traditional social media and advertising channels. We also use BigQuery ML models to estimate the lifetime of a user and predict ROS and ROI so we can optimize our ad spend.What we’ve noticed so far is that Google Cloud offers outstanding performance and speed for analyzing data and addressing potential issues because its analytics and infrastructure solutions are very well integrated in the Google Cloud environment. For example, by monitoring our player activities we can notice if there’s a sudden dip in latency in any region around the world, and simply spin up new servers to better serve the population there using Compute Engine. By migrating to Google Cloud we’re now spending only a fifth on a per DAU basis compared to what we spent with our previous provider while working on Vainglory. Democratizing great gaming experiencesThe launch of a new game is a big day for us, but it’s also the start of a long journey ahead. We expect to see a significant spike in the number of players following the launch of Catalyst Black, and we also expect to see those numbers eventually flatten down to hundreds of thousands daily. By then, we’ll have a better understanding of who Catalyst Black players are and what their vision is for the future of the game, which we can then collaborate on together.Additionally, with our in-house E.V.I.L. game engine, we can perform on devices that are much older. We can run on Samsung S7s, a device originally released in 2016, with the same 30 frames per second that the latest devices can run. The idea is for anyone to be able to access great games, no matter where they are and what device they have. On 25 May 2022, we launched Catalyst Black. Collaborating with Google Cloud means that we’re prepared for that global scalability. Our games are live operations that organically change and grow, and by relying on GCP we can anticipate our infrastructure needs to support players anywhere in the world throughout that journey, so we’re excited to see where we’ll go from here.Related ArticleHow a top gaming company transformed its approach to security with Splunk and Google CloudAristocrat overhauled its security and compliance processes using Splunk and Google Cloud technologies.Read Article
Quelle: Google Cloud Platform

How Redbox improved service resilience and availability across clusters with GKE and Anthos multicluster ingress

Americans are used to seeing movie-stuffed red kiosks at stores around the country. The company that offers these movie kiosks—along with On Demand entertainment—is Redbox. Redbox started their cloud-native journey by having microservices deployed across one region with another cloud provider and this was primarily on a single Compute cluster in the region. With business demand, they eventually saw the need to move towards a multi-region deployment (east and west) leveraging the same application architecture for their microservices but deploying the applications across the two regions. Most of the traffic originated from the east with a growing volume from the west and as such, a multi-region solution became increasingly important for the business to be able to serve their customer traffic efficiently. Lin Meyer at Redbox shares how they solved this problem.What were the challenges that traditional multi-cluster solutions couldn’t solve?Redbox was generally able to keep their applications running with a two 9s availability using the single region architecture and were looking to move to a three 9s availability using a multi regional approach. This was primarily driven by availability issues mostly in the evenings and weekends when there was an increased demand for streaming services. They started exploring multi-region/cluster solutions but they quickly noticed a couple of key challenges with the existing toolsets.Traffic management challenges. While there were approaches to shift traffic from one cluster to another, a lot of the ownership of implementing that was left to the individual operators. There was an expectation on operators of the environment to rely on telemetry to configure traffic management rules in the event of an application or infrastructure failure to route to the available cluster.Complexity. The options that were currently available with this cloud provider relied on a lot of custom scripting and engineering as well as configuration across various cloud subsystems (including networking, security and compute) In order to achieve the multi cluster topology that the business required.Why GKE, Anthos and multi-cluster ingress?Redbox started exploring managed Kubernetes services specifically to address the availability issue back in late 2019. Redbox turned to Kubernetes to see if there was a built-in solution that addressed this multi-region need. They started off by looking at other cloud managed services initially to determine if there was a more elegant way to achieve their multi cluster requirement. Based on their assessment they determined that the current solutions were not going to work for a couple of reasons.Platform to build other platforms. Through their research they determined that other managed Kubernetes services were a platform that organizations had to build other capabilities onto. An example is the node autoscaling feature. While they had ways to deal with it, it was an expectation of the cluster operator to configure the base cluster with these services. Redbox was looking for a managed service that had these infrastructure level add ons available or easily enabled.Lack of a dedicated multi cluster/region solution. They determined that they could leverage a DNS service to achieve this capability but it was a lot more DIY and not a dedicated multi-cluster solution which would have led to far more engineering efforts and a potentially more brittle solution. They were ideally looking for a more sophisticated multi-cluster solution.They started looking at GKE as a possibility and quickly came across the multi-cluster Service (MCS) and multi-cluster Ingress (MCI) services and saw that as a real potential for their multi-region requirements. They were definitely impressed with GKE but MCI and MCS were the key drivers that made them consider GKE.What did the multi-cluster Ingress and multi-cluster Service get you?There were several reasons why Redbox decided to go down the MCS path.Dedicated Service. Unlike the other approaches that required a lot of engineering effort this service was fully managed and removed a lot of complexity and engineering effort from the operator’s point of view. The DevOps team could focus on their service, which they wanted to enable this capability for and the MCS and MCI controller took care of all the underlying details from a networking and load balancing perspective. This level of abstraction was exactly what the Redbox team was looking for. Declarative Configuration. The fact that the MCS service supported the use of YAML worked very nicely with the rest of the Kubernetes based artifacts. There was no need to click around the console and make updates, which was Redbox’s preferred approach. This also fit very nicely with their CI/CD tool chain as well as they could version control the configuration very easily. The Redbox team was able to move forward with this service very quickly by enabling a few APIs at the project level and were subsequently able to get their MCS service stood up and accepting traffic in a matter of days. Within the next week, they were able to complete all their failover load tests and within 2 weeks, they had everything stood up and deployed in production.Figure 1: Redbox high-level architecture showing multi-region / multi-cluster topology with ingress controlled via an external multi-cluster Ingress and multi-cluster Services packed by nginx backendsWhat benefits are you seeing from this deployment?The Redbox team has currently been using this service for about two years in production. To date, here are some key benefits that they are seeing.Availability. This service has significantly improved application availability and uptime. They are now able to achieve a four 9s availability for their services by simply leveraging the MCS service. The MCI service has seamlessly handled the failover from one cluster to another in the event of an issue providing virtually no disruption for their end user applications.Simplified Deployment. By supporting MCS services as native Kubernetes objects, the DevOps team can now include the declarative configuration of services for multi-region deployment as part of their standard configuration deployment process.  Regular Maintenance. An added benefit of the MCS service is that the DevOps team can now perform scheduled maintenance on the regional clusters without taking any downtime. For example they currently run Istio in each cluster and typically an upgrade of Istio requires a cluster upgrade and also application restarts. With MCS, they can now perform these maintenance activities without taking any downtime as MCS continues to guarantee application availability. This has contributed to a much higher uptime. Inter-service communication. MCS has also dramatically improved the data path for inter-service communication. Redbox currently runs multiple environments that are segregated by data category (PCI and non-PCI). By deploying a single GKE fleet for the PCI and non-PCI clusters and subsequently leveraging MCS to expose the services in a multi-regional manner, PCI services can now talk to non-PCI services through their MCS endpoints. This allows MCS to function as a Service Registry for multi-cluster services with the service endpoint discovery and invocation handled seamlessly. It also presents a more efficient data path by connecting from one service to another without having to traverse through an internal or external load balancer. SummaryAt Redbox we knew we needed to modernize our infrastructure and deployment platform to meet the needs of our DVD kiosk rental business and rapidly growing digital streaming services. When looking at options for faster, safer deployments we found Google Kubernetes Engine and opted to use Multi Cluster Ingress and Multi Cluster Services to host our customer facing applications across multiple GCP regions. With GKE and MCI we have been able to continue our digital transformation to the cloud, getting new features and products to our customer’s faster than ever. MCI has enabled us to do this with excellent reliability and response times by routing traffic to the closest available cluster at a moment’s notice.To learn more about Anthos and MCI, please visit https://cloud.google.com/anthosRelated ArticleStandardization, security, and governance across environments with Anthos Multi-CloudAnthos Configuration Management, Policy Controller, and Service Mesh help you to form a design for standardization, security, and governa…Read Article
Quelle: Google Cloud Platform

Getting started with ML: 25+ resources recommended by role and task

Wondering how to get started with Vertex AI? Below, we’ve collected a list of resources to help you build and hone your skills across data science, machine learning, and artificial intelligence on Google Cloud.We’ve broken down the resources by what we think a Data Analyst, Data Scientist, ML Engineer, or a Software Engineer might be most interested in. But we also recognize there’s a lot of overlap between these roles, so even if you identify as a Data Scientist, for example, you might find some of the resources for ML Engineers or Developers just as useful!Data Analyst From data to insights, and perhaps some modeling, data analysts look for ways to help their stakeholders understand the value of their data.Data exploration and Feature Engineering[Guide] Exploratory Data Analysis for Feature Selection in Machine Learning[Documentation] Feature preprocessing in BigQuery Data visualization[Guide] Visualizing BigQuery data using Data Studio[Blog] Go from Database to Dashboard with BigQuery and LookerData ScientistAs a data scientist, you might be interested in generating insights from data, primarily through extensive exploratory data analysis, visualization, feature engineering, and modeling. If you’d like one place to start, check out Best practices for implementing machine learning on Google Cloud. Model registry[Video] AI/ML Notebooks how to with Apache Spark, BigQuery ML and  Vertex AI Model RegistryModel training[Codelab] Train models with the Vertex AI Workbench notebook executor[Codelab] Use autopackaging to fine tune Bert with Hugging Face on Vertex AI Training[Blog] How To train and tune PyTorch models on Vertex AILarge scale model training[Codelab] Multi-Worker Training and Transfer Learning with TensorFlow[Blog] Optimize training performance with Reduction Server on Vertex AI[Video] Distributed training on Vertex AI Workbench Model tuning[Codelab] Hyperparameter tuning[Video] Faster model training and experimentation with Vertex AIModel serving[Blog] How to deploy PyTorch models on Vertex AI[Blog] 5 steps to go from a notebook to a deployed modelML EngineerBelow are resources for an ML Engineer, someone whose focus area is MLOps, or the operationalization of feature management, model serving and monitoring, and CI/CD with ML pipelines.Feature management[Blog] Kickstart your organization’s ML application development flywheel with the Vertex Feature Store[Video] Introduction to Vertex AI Feature StoreModel Monitoring[Blog] Monitoring feature attributions: How Google saved one of the largest ML services in troubleML Pipelines[Blog] Orchestrating PyTorch ML Workflows on Vertex AI Pipelines[Codelab] Intro to Vertex Pipelines[Codelab] Using Vertex ML Metadata with PipelinesMachine Learning Operations[Guide] MLOps: Continuous delivery and automation pipelines in machine learningSoftware Engineer with ML applicationsHere are some resources if you work more as a traditional software engineer who spends more time on using ML in applications and less time on data wrangling, model building, or MLOps.[Blog] Find anything blazingly fast with Google’s vector search technology[Blog] Using Vertex AI for rapid model prototyping and deployment[Video] Machine Learning for developers in a hurryLooking for resources?Are you looking for more information but you can’t seem to find them? Let us know! Reach out to us on Linkedin:Nikita NamjoshiPolong LinRelated ArticlePick your AI/ML Path on Google CloudYour ultimate AI/ML decision treeRead Article
Quelle: Google Cloud Platform

Reimagining AutoML with Google research: announcing Vertex AI Tabular Workflows

Earlier this year, we shared details about our collaboration with USAA, a leading provider of insurance and financial services to U.S. military members and veterans, who leveraged AutoML models to accelerate the claims process. Boasting a peak 28% improvement relative to baseline models, the automated solution USAA and Google Cloud produced can predict labor costs and car part repair/replace decisions based on photos of damaged vehicles, potentially redefining how claims are assessed and handled. This use case combines a variety of technologies that extend well beyond the insurance industry, among them a particularly sophisticated approach to tabular data, or data structured into tables with columns and rows (e.g., vehicle make/model and points of damage, in the case of USAA). Applying machine learning (ML) to tabular data can unlock tremendous value for businesses of all kinds, but few tools have been both user-friendly and appropriate for enterprise-scale jobs. Vertex AI Tabular Workflows, announced at Google Cloud Applied ML Summit, aims to change this. Applying Google AI research to solving customer problemsGoogle’s investment in rigorous artificial intelligence (AI) and ML research makes cutting-edge technologies not only more widely available, but also easier to use, faster to deploy, and efficient to manage. Our researchers publish over 800 papers per year, generating hundreds of academic citations. Google Cloud has successfully turned the results of this research into a number of award-winning, enterprise-grade products and solutions.For example, Neural Architecture Search (NAS) was first described in a November 2016 research paper and later became Vertex AI NAS, which lets data science teams train models with higher accuracy, lower latency, and low power requirements. Similarly, Matching Engine was first described in an August 2019 paper before translating into an open-sourced TensorFlow implementation called ScaNN in 2020, and then into Vertex AI Matching Engine in 2021, which helps data teams address the “nearest neighbor search” problem. Other recent research-based releases include the ability to run AlphaFold, DeepMind’s revolutionary protein-folding system, on Vertex AI. In tabular data, the research into evolutionary and “learning-to-learn” methods led to the creation of AutoML Tables andAutoML Forecast in Vertex AI. Data scientists and analysts have enjoyed using AutoML for its ability to abstract the inherent complexity of ML into simpler processes and interfaces without sacrificing scalability or accuracy. They can train models with fewer lines of code, harness advanced algorithms and tools, and deploy models with a single click. A number of high-profile customers have already successfully reaped the benefits of our AutoML products. For example, Amaresh Siva, senior vice president for Innovation, Data and Supply Chain Technology at Lowe’s said, “Using Vertex AI Forecast, Lowe’s has been able to create accurate hierarchical models that balance between SKU and store-level forecasts. These models take into account our store-level, SKU-level, and region-level inventory, promotions data and multiple other signals, and are yielding more accurate forecasts.” These and many other success stories helped Vertex AI AutoML become the leading Automated Machine Learning Framework in the market, according to the Kaggle “State of Data Science and Machine Learning 2021” report. Expanding AutoML with Vertex AI Tabular WorkflowsWhile we have been thrilled by adoption of our AI platforms, we are also well aware of requests for more control, flexibility and transparency in AutoML for tabular data. Historically, the only solution to these requests was to use Vertex AI Custom Training. While it provided the necessary flexibility, it also required engineering the entire ML pipeline from scratch using various open source tools, which would often need to be maintained by a dedicated team. It was clear that we needed to provide options “in the middle” between AutoML and Custom Training—something that is powerful and leverages Google’s research, yet is flexible enough to allow many customizations. This is why we are excited to announce Vertex AI Tabular Workflows- integrated, fully managed, scalable pipelines for end-to-end ML with tabular data. These include AutoML products and new algorithms from Google Research teams and open source projects. Tabular workflows are fully managed by the Vertex AI team, so users don’t need to worry about updates, dependencies and conflicts. They easily scale to large datasets, so teams don’t need to re-engineer infrastructure as workloads grow. Each workflow is paired with an optimal hardware configuration for best performance. Lastly, each workflow is deeply integrated with the rest of Vertex AI MLOps suite, like Vertex Pipelines and Experiments tracking, allowing teams to run many more experiments in less time.AutoML Tables workflow is now available on Vertex AI Pipelines, bringing many powerful improvements, such as support for 1TB datasets with 1,000 columns, and the ability to control model architectures evaluated by the search algorithm and change the hardware used in the pipeline to improve training time. Most importantly, each AutoML component can be inspected in a powerful pipelines graph interface that lets customers see the transformed data tables, evaluated model architectures and many more details. Every component now also gets extended flexibility and transparency, such as being able to customize parameters, hardware, view process status, logs and more. Customers are taken from a world with controls for the whole pipeline into a world with controls for every step in the pipeline.Google’s investment in tabular data ML research has also led to the creation of multiple novel architectures such as TabNet,Temporal Fusion Transformers and Wide & Deep. These models have been well received by the research community, resulting in hundreds of academic citations. We are excited to offer fully managed, optimized pipelines for TabNet and Wide & Deep in Tabular Workflows. Our customers can experience the unique features of these models, like built-in explainability tools, without worrying about implementation details or selecting the right hardware.New workflows are added to help improve and scale feature engineering work. For example, our Feature Selection workflow can quickly rank the most important features in datasets with over 10,000 columns. Customers can use it to explore their data or combine it with TabNet or AutoML pipelines to enable training on very large datasets. We hope to see many more interesting stories of customers using multiple Tabular Workflows together.Vertex AI Tabular Workflows makes all of this collaboration and research available to our customers, as an enterprise-grade solution, to help accelerate the deployment of ML in production. It packages the ease of AutoML along with the ability to interpret each step in the workflow and choose what is handled by AutoML versus by custom engineering. Managed AutoML pipeline is glassbox, letting data scientists and engineers see and interpret each step in the model building and deployment process, including the ability to flexibly tune model parameters and more easily refine and audit models. Elements of Vertex AI Tabular Workflows can also be integrated into existing Vertex AI pipelines. We’ve added new managed algorithms including advanced research models like TabNet, new algorithms for feature selection, model distillation and much more. Future noteworthy components will include implementation of Google advanced models such as Temporal Fusion Transformers, and popular open source models like XGBoost. Today’s research projects are tomorrow’s enterprise ML catalystsWe look forward to seeing Tabular Workflows improve ML operation across multiple industries and domains. Marketing budget allocations can be improved because feature ranking can identify well performing features from a large variety of internal datasets. These new features can boost the accuracy of user churn prediction models and campaign attributions. Risk and fraud operations can benefit from models like TabNet, where built-in explainability features allow for better model accuracy while satisfying regulatory requirements. In manufacturing, being able to train models on hundreds of gigabytes of full, unsampled sensor data can significantly improve the accuracy of equipment breakdown predictions. A better preventative maintenance schedule means more cost-effective care with fewer breakdowns. There is a tabular data use case in virtually every business and we are excited to see what our customers achieve. As our history of AI and ML product development and new product launches demonstrate, we’re dedicated to research collaborations that help us productize the best of Google and Alphabet AI technologies for enterprise-scale tasks and workflows. We look forward to continuing this journey and invite you to check out the keynote from our Applied ML Summit to learn more.Related ArticleWhat is Vertex AI? Developer advocates share moreDeveloper Advocates Priyanka Vergadia and Sara Robinson explain how Vertex AI supports your entire ML workflow—from data management all t…Read Article
Quelle: Google Cloud Platform

Accelerating ML with Vertex AI: From retail and finance to manufacturing and automotive

Artificial intelligence (AI) and machine learning (ML) are transforming industries around the world, from trailblazing new frontiers in conversational human-computer interactions and speech-based analysis, to improving product discovery in retail,to unlocking medical research with advancements like AlphaFold. But underpinning all ML advancements is a common challenge: fast-tracking the building and deployment of ML models into production, and abstracting the most technically complex processes into unified platforms that open ML to more users. Our mission is to remove every barrier in the way of deploying useful and predictable ML at scale. This is why, in May 2021, we announced the general availability of Vertex AI, a managed ML platform designed specifically to accelerate the deployment and maintenance of ML models. Leveraging Vertex AI, data scientists can speed up ML development and experimentation by 5x, with 80% fewer lines of code required.In the year since the launch, customers across diverse industries have successfully accelerated the deployment of machine learning models in production with Vertex AI. In fact, through Vertex AI and BigQuery, we have seen 2.5 times more machine learning predictions generated in 2021 compared to the previous year. Additionally, customers are seeing great value in Vertex AI’s unified data and AI story. This is best represented by the 25x growth in active customers we have seen for Vertex AI Workbench over the last six months.Let’s take a look at how some of these organizations are using Vertex AI today. Accelerating ML in retail: ML at Wayfair, Etsy, Lowe’s and Magalu Our research of over 100 global retail executives identified that AI and ML-powered applications have the potential to drive $230-515 billion in business value.  Whether the use cases involve optimizing inventory or bettering customer experience, retail is among the industries where ML adoption has been strongest. For example, online furniture and home goods retailer Wayfair has been able to run large model training jobs 5-10x faster by leveraging Vertex AI. “We’re doing ML at a massive scale, and we want to make that easy. That means accelerating time-to-value for new models, increasing reliability and speed of very large regular re-training jobs, and reducing the friction to build and deploy models at scale,” said Matt Ferrari, Head of Ad Tech, Customer Intelligence, and Machine Learning at Wayfair, in a Forbes article. Vertex AI helps the company to “weave ML into the  fabric of how we make decisions,” he added. Elsewhere, Etsy estimates it  has reduced the time it takes to go from ideation to a live ML experiment by about 50%. “Our training and prototyping platform largely relies on Google Cloud services like Vertex AI and Dataflow, where customers can experiment freely with the ML framework of their choice,” the company notes in a blog post. “These services let customers easily leverage complex ML infrastructure (such as GPUs) through comfortable interfaces like Jupyter Notebooks. Massive extract transform load (ETL) jobs can be run through Dataflow while complex training jobs of any form can be submitted to Vertex AI for optimization.”Forecasting in particular is a major retail use case that can be significantly bettered with the power of ML. Vertex AI Forecast is already helping Lowe’s with a range of models at the company’s more than 1,700 stores, according to Amaresh Siva, senior vice president for Innovation, Data and Supply Chain Technology at Lowe’s.“Using Vertex AI Forecast, Lowe’s has been able to create accurate hierarchical models that balance between SKU and store-level forecasts. These models take into account our store-level, SKU-level, and region-level inventory, promotions data and multiple other signals, and are yielding more accurate forecasts,” said Siva.Brazilian retailer Magalu has similarly deployed Vertex AI to reduce inventory prediction errors. With Vertex AI, “four-week live forecasting showed significant improvements in error (WAPE) compared to our previous models,” said Fernando Nagano, director of Analytics and Strategic Planning at Magalu. “This high accuracy insight has helped us to plan our inventory allocation and replenishment more efficiently to ensure that the right items are in the right locations at the right time to meet customer demand and manage costs appropriately.”From memory to manufacturing to mobile payments: ML at Seagate, Coca Cola Bottlers Japan, and Cash App Retail is not the only industry leveraging the power of AI and ML. According to our research, 66% of manufacturers who use AI in their day-to-day operations report that their reliance on AI is increasing.Google joined forces with Seagate, our HDD original equipment manufacturer (OEM) partner for Google’s data centers, to leverage ML for improved prediction of frequent HDD problems, such as disk failure. The Vertex AI AutoML model generated for the effort achieved a precision of 98% with a recall of 35%, compared to precision for 70-80% and recall of 20-25% for the competing custom ML model. Coca Cola Bottlers Japan (CCBJ) is also ramping up its ML efforts, using Vertex AI and BigQuery to process billions of data records from 700,000 vending machines, helping the company to make strategic decisions about when and where to locate products. “We have created a prediction model of where to place vending machines, what products are lined up in the machines and at what price, how much they will sell, and implemented a mechanism that can be analyzed on a map,” said Minori Matsuda,Data Science Manager / Google Developer Expert at CCBJ, in a blog post.  “We were able to realize it in a short period of time with a sense of speed, from platform examination to introduction, prediction model training, on-site proof of concept to rollout.”Turning to finance, Cash App, a platform from the U.S.-based financial services company Square, is leveraging products from Google Cloud and NVIDIA to achieve a roughly 66% improvement in completion time for core ML processing workflows. “Google Cloud gave us critical control over our processes,” said Kyle De Freitas, a senior software engineer at Dessa, which was acquired by Cash App in 2020. “We recognized that Compute Engine A2 VMs, powered by the NVIDIA A100 Tensor Core GPUs, could dramatically reduce processing times and allow us to experiment much faster. Running NVIDIA A100 GPUs on Google Cloud’s Vertex AI gives us the foundation we need to continue innovating and turning ideas into impactful realities for our customers.”Driving toward an ML-fueled future: ML at Cruise and SUBARUIn the automotive space, manufacturers throughout the world have invested billions to digitize operations and invest in AI to both optimize design and enable new features. For instance, self-driving car service Cruise has millions of miles of autonomous travel under its belt, with Vertex AI helping the company to quickly train and update ML models that power crucial functions like image recognition and scene understanding. “After we ingest and analyze that data, it’s fed back into our dynamic ML Brain, a continuous learning machine that actively mines from the collected data to automatically train new models that exceed the performance of the older models,” explained Mo Elshenawy, Executive Vice President of Engineering at Cruise, in a blog post. “This is done with the help of Vertex AI, where we are able to train hundreds of models simultaneously, using  hundreds of GPU years every month!”Meanwhile, SUBARU is turning to ML to eliminate fatal accidents caused by its cars. SUBARU Lab uses Google Cloud to analyze images from the company’s EyeSight  stereo cameras, for example. The team uses a combination of NVIDIA A100 GPUs and Compute Engine for processing muscle, with data scientists and data engineers using Vertex AI to build models. “I chose Google Cloud from many platforms because it had multiple managed services such as Verex AI, the managed notebooks option, and Vertex AI Training that were useful for AI development. It was also fascinating to have high-performance hardware that could handle large-scale machine learning operations,” said Thossimi Okubo, Senior Engineer of AI R&D at SUBARU. Working together to accelerate ML deploymentWe are very encouraged by the adoption of Vertex AI, and we are excited to continue working with key customers and partners to expand our thinking around the challenges data scientists face in accelerating deployment of ML models in production. Watch our Google Cloud Applied ML Summit session with Smi-tha Shyam, Director of Engineering for Uber AI, and Bryan Goodman, Director of AI and Cloud at Ford, to get a sense of how we’re working with partners and customers in this journey. To learn more, check out additional expert commentary at our Applied ML Summit, peruse our latest Vertex AI updates, or visit our Data Science on Google Cloud page to learn more about our unified data and AI story.Related ArticleWhat is Vertex AI? Developer advocates share moreDeveloper Advocates Priyanka Vergadia and Sara Robinson explain how Vertex AI supports your entire ML workflow—from data management all t…Read Article
Quelle: Google Cloud Platform

Accelerate the deployment of ML in production with Vertex AI

As part of today’s Google Cloud Applied ML Summit, we’re announcing a variety of product features and technology partnerships to help you more quickly and efficiently build, deploy, manage, and maintain machine learning (ML) models in production. Our performance tests found a 2.5x increase in the number of ML predictions generated through Vertex AI and BigQuery in 2021, and a 25x increase in active customers for Vertex AI Workbench in just the last six months. Customers have made clear that managed and integrated ML platforms are crucial to accelerating the deployment of ML in production. For example, Wayfair accelerated large model training jobs by 5-10x with Vertex AI, enabling increased experimentation, reduced coding, and more models making it to production. Likewise, Seagate used AutoML to build a ML model with 98% precision, compared to only 70-80% from their earlier custom models.Bryan Goodman, Director of AI and Cloud at Ford, said, “Vertex AI is an integral part of the Ford machine learning development platform, including accelerating our efforts to scale AI for non-software experts.“This momentum is tremendous, but we know there is more work to be done to help enterprises across the globe fast-track the digitization of operations with AI. According to Gartner*, “Only 10% of organizations have 50% or more of their software engineers trained on machine learning skills.” [Source: Gartner: Survey Analysis: AI Adoption Spans Software Engineering and Organizational Boundaries – Van Baker, Benoit Lheureux – November 25, 2021] Similarly, Gartner states that “on average, 53% of [ML] projects make it to production.” [Source: Gartner: 4 Machine Learning Best Practices to Achieve Project Success – Afraz Jaffri, Carlie Idoine, Erick Brethenoux – December 7, 2021].These findings speak to the primary challenge of not only gaining ML skills or abstracting technology dependencies so more people can participate in the process of ML deployment, but also to applying those skills to deploy models in production, continuously monitor, and drive business impact.Let’s take a look at how our announcements will help you remove the barriers to deploying useful and predictable ML at scale. Four pillars for accelerating ML deployment in productionThe features we’re announcing today fit into the following four-part framework that we’ve developed in discussions with customers, partners, and other industry thought leaders.Providing freedom of choiceData scientists work most effectively when they have the freedom to choose the ML frameworks, deployment instances, and compute processors they’ll work with. To this end, we partnered with NVIDIA earlier this year to launch One Click Deploy of NVIDIA AI software solutions to Vertex AI Workbench. NVIDIA’s NGC catalog lets data scientists start their model development on Google Cloud, speeding the path to building and deploying state-of-the-art AI. The feature simplifies the deployment of Jupyter Notebooks from over 12 complex steps to a single click, abstracting away routine tasks to help data science teams focus on accelerating ML deployment in production.We also believe this power to choose should not come at a cost. With this in mind, we are thrilled to announce the availability of Vertex AI Training Reduction Server, which supports both Tensorflow and PyTorch. Training Reduction Server is built to optimize bandwidth and latency of multi-node distributed training on NVIDIA GPUs. This significantly reduces the training time required for large language workloads, like BERT, and further enables cost parity across different approaches. In many mission-critical business scenarios, a shortened training cycle allows data scientists to train a model with higher predictive performance within the constraints of a deployment window. Meeting users where they are Whether ML tasks involve pre-trained APIs, AutoML, or custom models built from the ground up, skills proficiency should not be the gating criteria for participation in an enterprise-wide strategy. This is the only way to get your data engineers, data analysts, ML researchers, MLOps engineers, and data scientists to participate in the process of ML acceleration across the organization. To this end, we’re announcing the preview of Vertex AI Tabular Workflows, which includes a glassbox and managed AutoML pipeline that lets you see and interpret each step in the model building and deployment process. Now, you can comfortably train datasets of over a terabyte, without sacrificing accuracy, by picking and choosing which parts of the process you want AutoML to handle versus which parts you want to engineer yourself. Elements of Tabular Workflows can also be integrated into your existing Vertex AI pipelines. We’ve added new managed algorithms including advanced research models like TabNet, new algorithms for feature selection, model distillation and much more. Future noteworthy components will include implementation of Google proprietary models such as Temporal Fusion Transformers, and Open Source models like XGboost and Wide & Deep. Uniting data and AITo fast track the deployment of ML models into production, your organization needs a unified data and AI strategy. To further integrate data engineering capabilities directly into the data science environment, we’re announcing features to address all data types: structured data, graph data, and unstructured data. First up, for structured data, we are announcing the preview of Serverless Spark on Vertex AI Workbench. This allows data scientists to launch a serverless spark session within their notebooks and interactively develop code. In the space of graph data, we are excited to introduce a data partnership with Neo4j that unlocks the power of graph-based ML models, letting data scientists explore, analyze, and engineer features from connected data in Neo4j and then deploy models with Vertex AI, all within a single unified platform. With Neo4j Graph Data Science and Vertex AI, data scientists can extract more predictive power from models using graph-based inputs, and get to production faster across use cases such as fraud and anomaly detection, recommendation engines, customer 360 , logistics, and more.In the space of unstructured data, our partnership with Labelboxis all about helping data scientists leverage the power of unstructured data to build more effective ML models on Vertex AI. Labelbox’s native integration with Vertex AI reduces the time required to label unstructured image, text, audio, and video data, which helps accelerate model development for image classification, object detection, entity recognition, and various other tasks. With the integration only available on Google Cloud, Labelbox and Vertex AI create a flywheel for accelerated model development. Managing and maintaining ML modelsFinally, our customers demand tools to easily manage and maintain ML models. Data scientists shouldn’t need to be infrastructure engineers or operations engineers to keep models accurate, explainable, scaled, disaster resistant, and secure, all in an ever-changing environment. To address this need, we’re announcing the preview of Vertex AI Example-based Explanations. This novel Explainable AI technique helps data scientists identify mislabeled examples in their training data or discover what data to collect to improve model accuracy. Using example-based explanations to quickly diagnose and treat issues, data scientists can now maintain a high bar on model quality.Ford and Vertex AIAs mentioned, we’ve seen our customers achieve great results with our AI and ML solutions. Ford, for example, is leveraging Vertex AI across many use cases and user types.“We’re using Vertex AI pipelines to build generic and reusable modular machine learning workflows. These are useful as people build on the work of others and to accelerate their own work,” explained Goodman. “For low code and no code users, AutoML models are useful for transcribing speech and basic object detection, and we like that there is integrated deployment for trained models. It really helps people get things into use, which is important. For power users, we are extensively leveraging Vertex AI’s custom model deployment for our in-house models. It’s ideal for data scientists and data engineers not to have to master skills in infrastructure and software. This is critical for growing the community of AI builders at Ford, and we’re seeing really good success.”Customer stories and enthusiasm propel our efforts to continue creating better products that make AI and ML more accessible, sustainable, and powerful. We’re thrilled to have been on this journey with you so far, and we can’t wait to see what you do with our new announcements. To learn more, check out additional expert commentary at our Applied ML Summit, and visit our Data Science on Google Cloud page to learn more about how Google Cloud is helping you fast-track the deployment of ML in production. *GARTNER is a registered trademark and service of Gartner, Inc. and/or its affiliates in the U.S. and internationally and is used herein with permission. All rights reserved.Related ArticleAccelerating ML with Vertex AI: From retail and finance to manufacturing and automotiveHow businesses across industries are accelerating deployment of machine learning models into production with VertexAI.Read Article
Quelle: Google Cloud Platform

TELUS: Solving for workers’ safety with edge computing and 5G

Editor’s note: In February 2021, Google Cloud and TELUS announced a 10-year strategic alliance to drive innovation of new services and solutions across data analytics, machine learning, and go-to-market strategies that support digital transformation within key industries, including communications technology, healthcare, agriculture, and connected home. By December 2021, TELUS had completed a pilot for a use case that leveraged Google Cloud AI and Machine Learning solutions and Telco Edge Anthos to increase safety in the workplace and save lives in manufacturing facilities. The use case leverages Multi-Access Edge Computing (MEC) to move the processing and management of traffic from a centralized cloud to the edge of TELUS’ 5G network, making it possible to deploy applications and process content closer to its customers, and thus yielding several benefits including better performance, security, and customization. Today, we invite Samer Geissah, Head of Technology Strategy and Architecture at TELUS, to share how the company is delivering on its promise to use this technology to drive meaningful change, starting with workers’ well-being. Whenever a new technology buzzword comes along I think: what problems does this solve, and for whom is this going to make a real difference? That’s because at TELUS, we see innovation as a means to act on our social purpose to drive meaningful change, from modernizing healthcare and making our food supply more sustainable, to reducing our environmental footprint and connecting Canadians in need. Multi-Access Edge Computing (MEC) is a buzzword that offers an opportunity to do just this. That’s why we want to leverage cloud capabilities and optimize our network’s edge computing potential, tapping into our award-winning high-speed 5G connectivity to help solve some of industry’s most complex challenges. The reason why this presents such a great opportunity is that companies across industries still rely on maintenance-heavy on-premises systems to manage core computing tasks. But, with cloud capabilities delivered at the edge of our 5G network, we open a new world of possibilities for them. For example, manufacturers who currently rely on IoT-enabled equipment in their facilities can deliver new experiences by running advanced AI-based visual inspections directly from 5G-enabled devices–all without the need for local processing power or extra on-site space. In fact, it’s this example that inspired our new use case, where our Connected Worker Safety solution can be applied across a range of business verticals to help improve safety, prevent injury, and save lives, demonstrating how the perfect combination of skilled people and digital technology can make the world a safer place.Empowering intelligent decision making at the edgeBe it a farm, manufacturing facility, hospital, or a factory floor, workers should be able to work in environments where their health and safety are held as the highest priority. But how can employers ensure that their remote, frontline, and in-office employees are safe and healthy at all times? We’ve found the answer by combining Google Cloud AI/ML capabilities and Anthos as a platform for delivering workloads, with our network’s infrastructure.Together with Google Cloud, we have been leveraging solutions with the power of MEC and 5G to develop a workers’ safety application in our Edmonton Data Center that enables on-premise video analytics cameras to screen manufacturing facilities and ensure compliance with safety requirements to operate heavy-duty machinery. The CCTV (closed-circuit television) cameras we used are cost-effective and easier to deploy than RTLS (real time location services) solutions that detect worker proximity and avoid collisions. This is a positive, proactive step to steadily improve workplace safety. For example, if a worker’s hand is close to a drill, that drill press will not bore holes in any surface until the video analytics camera detects that the worker’s hand has been removed from the safety zone area.A few milliseconds could make all the difference when you are operating heavy equipment without guards in place. So, to power the solution’s predetermined actions with immediate response times, we worked with Accenture and hosted the application on an Anthos bare metal Google Cloud environment running on our TELUS multi-edge access computing. Because all the conditions in our model are programmable, this solution can be replicated at scale across a variety of practical scenarios other than factory floors. The actions in response to the analysis are also programmable, which means companies can use this technology to look at workers’ conditions and decide the best course of action to educate, assist, and protect them. All this is done through a single pane of glass ecosystem, making it easy to customize this solution to meet various business needs.     Meanwhile, leveraging our existing global networks to process data and compute cycles at the edge eliminates the need to transport data to a central location for real-time computation. This means that we can offer this solution to partners while optimizing latency and lowering costs. Powering blink-of-an-eye communication with AnthosTo put the importance of lowering speed into perspective, consider that the average latency of blinking your eye is about 300 milliseconds. From a safety point of view, preventative processes need to be much faster than that. For this use case, our machine learning models running on edge are currently processing data at a tenth of the time it takes for you to blink your eyes, and we’re aiming to lower that latency further to help build even safer systems.Our plan is to deploy Anthos clusters on bare metal to our customers across Canada to take advantage of our existing enterprise infrastructure, making it possible for us to run our solution closer to partners and eventually enable just one millisecond of latency. At that point, we’ll be able to power new use cases that require near real-time feedback, leaving absolutely no room for error. This could include remote surgery, platooning of fleets on autonomous vehicles, and many other cellular vehicle-to-everything (V2X) solutions that require high-speed communication for platform operators to manage remote edge fleets in far-away places. Improving workers’ safety while enabling new sources of revenueAlthough edge computing and 5G have been around for a while, we believe that use cases like this are only just starting to demonstrate the incredible speed of change and high potential that these models provide. The next step for us is to develop our workers’ safety solution and get it to market, making TELUS an early adopter of new 5G solutions at the edge that can help our business and industry partners make workplaces safer.It’s a great win to be able to combine efforts with Google Cloud and reduce latency in a context where timing can impact and save lives, and I’m confident that workers’ safety is just the beginning of a series of industry challenges that we’ll address together.Related ArticleTELUS accelerates modernization with data scienceTELUS, a Canadian communications and information technology company, has transformed their approach to data science with Google Cloud ser…Read Article
Quelle: Google Cloud Platform

Infrastructure Security in Google Cloud

The security of the infrastructure that runs your applications is one of the most important considerations in choosing a cloud vendor. Google Cloud’s approach to infrastructure security is unique. Google doesn’t rely on any single technology to secure its infrastructure. Rather, it has built security through progressive layers that deliver defense in depth. Defense in depth at scaleData center physical security – Google data centers feature layered security with custom-designed electronic access cards, alarms, vehicle access barriers, perimeter fencing, metal detectors, biometrics, and laser beam intrusion detection. They are monitored 24/7 by high-resolution cameras that can detect and track intruders. Only approved employees with specific roles may enter. Hardware infrastructure – From the physical premises to the purpose-built servers, networking equipment, and custom security chips to the low-level software stack running on every machine, the entire hardware infrastructure is controlled, secured, and hardened by Google.Service deployment – Any application binary that runs on Google infrastructure is deployed securely. No trust is assumed between services, and multiple mechanisms are used to establish and maintain trust. Google infrastructure was designed from the start to be multitenant. Storage services –  Data stored on Google’s infrastructure is automatically encrypted at rest and distributed for availability and reliability. This helps guard against unauthorized access and service interruptions.User identity – Identities, users, and services are strongly authenticated. Access to sensitive data is protected by advanced tools like phishing-resistant security keys.Internet communications – Communications over the internet to Google cloud services are encrypted in transit. The scale of the infrastructure enables it to absorb many denial-of-service (DoS) attacks, and multiple layers of protection further reduce the risk of any DoS impact. Operational and device security – Google operations teams develop and deploy infrastructure software using rigorous security practices. They work to detect threats and respond to incidents 24 x 7 x 365. Because Google runs on the same infrastructure that is made available to Google Cloud customers, all customers  directly benefit from this security operations and expertise.End-to-end provenance and attestationGoogle’s hardware infrastructure is custom-designed “from chip to chiller” to precisely meet specific requirements, including security. Google servers and software are designed for the sole purpose of providing Google services. These servers are custom built and don’t include unnecessary components like video cards or peripheral interconnects that can introduce vulnerabilities. The same goes for software, including low-level software and the server OS, which is a stripped-down, hardened version of Linux. Further, Google designed and included hardware specifically for security. Titan, for example, is​​ a purpose-built chip to establish a hardware root of trust for both machines and peripherals in cloud infrastructure. Google also built custom network hardware and software to improve performance and security. This all rolls up to Google’s custom data center designs, which includes multiple layers of physical and logical protection.Tracking provenance from the bottom of this hardware stack to the top enables Google to control the underpinnings of its security posture. This helps Google greatly reduce the “vendor in the middle problem”; if a vulnerability is found, steps can be immediately taken to develop and roll out a fix. This level of control results in greatly reduced exposure for both Google Cloud and its customers.That was a bird’s eye view of Google Cloud infrastructure security and some services that help protect your infrastructure in Google Cloud. For a more in-depth look into this topic check out the whitepaper.  For more #GCPSketchnote, follow the GitHub repo. For similar cloud content follow me on Twitter @pvergadia and keep an eye out on thecloudgirl.dev
Quelle: Google Cloud Platform

Even more pi in the sky: Calculating 100 trillion digits of pi on Google Cloud

Records are made to be broken. In 2019, we calculated 31.4 trillion digits of π — a world record at the time. Then, in 2021, scientists at the University of Applied Sciences of the Grisons calculated another 31.4 trillion digits of the constant, bringing the total up to 62.8 trillion decimal places. Today we’re announcing yet another record: 100 trillion digits of π.This is the second time we’ve used Google Cloud to calculate a record number1 of digits for the mathematical constant, tripling the number of digits in just three years. This achievement is a testament to how much faster Google Cloud infrastructure gets, year in, year out. The underlying technology that made this possible is Compute Engine, Google Cloud’s secure and customizable compute service, and its several recent additions and improvements: the Compute Engine N2 machine family, 100 Gbps egress bandwidth, Google Virtual NIC, and balanced Persistent Disks. It’s a long list, but we’ll explain each feature one by one.Before we dive into the tech, here’s an overview of the job we ran to calculate our 100 trillion digits of π. Program: y-cruncher v0.7.8, by Alexander J. YeeAlgorithm: Chudnovsky algorithmCompute node: n2-highmem-128 with 128 vCPUs and 864 GB RAMStart time: Thu Oct 14 04:45:44 2021 UTCEnd time: Mon Mar 21 04:16:52 2022 UTCTotal elapsed time: 157 days, 23 hours, 31 minutes and 7.651 secondsTotal storage size: 663 TB available, 515 TB usedTotal I/O: 43.5 PB read, 38.5 PB written, 82 PB totalHistory of π computation from ancient times through today. You can see that we’re adding digits of π exponentially, thanks to computers getting exponentially faster.Architecture overviewCalculating π is compute-, storage-, and network-intensive. Here’s how we configured our Compute Engine environment for the challenge.   For storage, we estimated the size of the temporary storage required for the calculation to be around 554 TB. The maximum persistent disk capacity that you can attach to a single virtual machine is 257 TB, which is often enough for traditional single node applications, but not in this case. We designed a cluster of one computational node and 32 storage nodes, for a total of 64 iSCSI block storage targets.The main compute node is a n2-highmem-128 machine running Debian Linux 11, with 128 vCPUs and 864 GB of memory, and 100 Gbps egress bandwidth support. The higher bandwidth support is a critical requirement for the system as we adopted a network-based shared storage architecture.Each storage server is a n2-highcpu-16 machine configured with two 10,359 GB zonal balanced persistent disks. The N2 machine series provides balanced price/performance, and when configured with 16 vCPUs it provides a network bandwidth of 32 Gbps, with an option to use the latest Intel Ice Lake CPU platform, which makes it a good choice for high-performance storage servers.Automating the solutionWe used Terraform to set up and manage the cluster. We also wrote a couple of shell scripts to automate critical tasks such as deleting old snapshots, and restarting from snapshots (we didn’t need to use this though). The Terraform scripts created OS guest policies to help ensure that the required software packages were automatically installed. Part of the guest OS setup process was handled by startup scripts. In this way, we were able to recreate the entire cluster with just a few commands.We knew the calculation would run for several months and even a small performance difference could change the runtime by days or possibly weeks. There are also a number of combinations of parameters in the operating system, infrastructure, and application itself. Terraform helped us test dozens of different infrastructure options in a short time. We also developed a small program that runs y-cruncher with different parameters and automated a significant portion of the measurement. Overall, the final design for this calculation was about twice as fast as our first design. In other words, the calculation could’ve taken 300 days instead of 157 days!The scripts we used are available on GitHub if you want to look at the actual code that we used to calculate the 100 trillion digits.Choosing the right machine type for the jobCompute Engine offers machine types that support compute- and I/O-intensive workloads. The amount of available memory and network bandwidth were the two most important factors, so we selected n2-highmem-128 (Intel Xeon, 128 vCPUs and 864 GB RAM). It satisfied our requirements: high-performance CPU, large memory, and 100 Gbps egress bandwidth. This VM shape is part of the most popular general purpose VM family in Google Cloud.   100 Gbps networkingThe n2-highmem-128 machine type’s support for up to 100 Gbps of egress throughput was also critical. Back in 2019 when we did our 31.4-trillion digit calculation, egress throughput was only 16 Gbps, meaning that bandwidth has increased by 600% in just three years. This increase was a big factor that made this 100-trillion experiment possible, allowing us to move 82.0 PB of data for the calculation, up from 19.1 PB in 2019.We also changed the network driver from virtio to the new Google Virtual NIC (gVNIC). gVNIC is a new device driver and tightly integrates with Google’s Andromeda virtual network stack to help achieve higher throughput and lower latency. It is also a requirement for 100 Gbps egress bandwidth.Storage designOur choice of storage was crucial to the success of this cluster – in terms of capacity, performance, reliability, cost and more. Because the dataset doesn’t fit into main memory, the speed of the storage system was the bottleneck of the calculation. We needed a robust, durable storage system that could handle petabytes of data without any loss or corruption, while fully utilizing the 100 Gbps bandwidth.Persistent Disk (PD) is a durable high-performance storage option for Compute Engine virtual machines. For this job we decided to use balanced PD, a new type of persistent disk that offers up to 1,200 MB/s read and write throughput and 15-80k IOPS, for about 60% of the cost of SSD PDs. This storage profile is a sweet spot for y-cruncher, which needs high throughput and medium IOPS.Using Terraform, we tested different combinations of storage node counts, iSCSI targets per node, machine types, and disk size. From those tests, we determined that 32 nodes and 64 disks would likely achieve the best performance for this particular workload.We scheduled backups automatically every two days using a shell script that checks the time since the last snapshots, runs the fstrim command to discard all unused blocks, and runs the gcloud compute disks snapshot command to create PD snapshots. The gcloud command returns and y-cruncher resumes calculations after a few seconds while the Compute Engine infrastructure copies the data blocks asynchronously in the background, minimizing downtime for the backups.To store the final results, we attached two 50 TB disks directly to the compute node. Those disks weren’t used until the very last moment, so we didn’t allocate the full capacity until y-cruncher reached the final steps of the calculation, saving four months worth of storage costs for 100 TB.ResultsAll this fine tuning and benchmarking got us to the one-hundred trillionth digit of π — 0. We verified the final numbers with another algorithm (Bailey–Borwein–Plouffe formula) when the calculation was completed. This verification was the scariest moment of the entire process because there is no sure way of knowing whether or not the calculation was successful until it finished, five months after it began. Happily, the Bailey-Borwein-Plouffe formula found that our results were valid. Woo-hoo! Here are the last 100 digits of the result:code_block[StructValue([(u’code’, u’4658718895 1242883556 4671544483 9873493812 1206904813 rn2656719174 5255431487 2142102057 7077336434 3095295560′), (u’language’, u”)])]You can also access the entire sequence of numbers on our demo site.So what?You may not need to calculate trillions of decimals of π, but this massive calculation demonstrates how Google Cloud’s flexible infrastructure lets teams around the world push the boundaries of scientific experimentation. It’s also an example of the reliability of our products – the program ran for more than five months without node failures, and handled every bit in the 82 PB of disk I/O correctly. The improvements to our infrastructure and products over the last three years made this calculation possible. Running this calculation was great fun, and we hope that this blog post has given you some ideas about how to use Google Cloud’s scalable compute, networking, and storage infrastructure for your own high performance computing workloads. To get started, we’ve created a codelab where you can create and calculate pi on a Compute Engine virtual machine with step-by-step instructions. And for more on the history of calculating pi, check out this post on The Keyword. Here’s to breaking the next record!1. We are actively working with Guinness World Records to secure their official validation of this feat as a “World Record”, but we couldn’t wait to share it with the world. This record has been reviewed and validated by Alexander J. Yee, the author of y-cruncher.Related ArticlePi in the sky: Calculating a record-breaking 31.4 trillion digits of Archimedes’ constant on Google Cloud[Editor’s note: It’s been two years since Googler Emma Haruka Iwao set a world record for calculating the most digits of Pi using Google …Read Article
Quelle: Google Cloud Platform

Google Cloud supports higher education with Cloud Digital Leader program

College and university faculty can now easily teach cloud literacy and digital transformation with the Cloud Digital Leader track, part of the Google Cloud career readiness program. The new track is available for eligible faculty who are preparing their students for a cloud-first workforce. As part of the track, students will build their cloud literacy and learn the value of Google Cloud in driving digital transformation, while also preparing for the Cloud Digital Leader certification exam. Apply today!Cloud Digital Leader career readiness trackThe Cloud Digital Leader career readiness track is designed to equip eligible faculty with the resources needed to prepare their students for the Cloud Digital Leader certification. This Google Cloud certification requires no previous cloud computing knowledge or hands-on experience. The training path enables students to build cloud literacy and learn how to evaluate the capabilities of Google Cloud in preparation for future job roles. The curriculumFaculty members can access this curriculum as part of the Google Cloud Career Readiness program. Faculty from eligible institutions can apply to lead students through the no-cost  program which provides access to the four-course on-demand training, hands-on practice to supplement the learning, and additional exam prep resources. Students who complete the entire program are eligible to apply for a certification exam discount. The Cloud Digital Leader track is the third program available for classroom use, joining the Associate Cloud Engineer and Data Analyst tracks. Cloud resources for your classroomReady to get started? Apply today to access the Cloud Digital Leader career readiness track for your classroom. Read the eligibility criteria for faculty. You can preview the course content at no cost.Related ArticleRead Article
Quelle: Google Cloud Platform