Mai 2021 - Seite 40 von 53 - Cloud Computing Köln

As you’ve probably noticed by now, our team is all about our customers. Earlier in the year, the New York Times shared how their data analytics team went from staying up until three in the morning trying to keep their legacy system running to relaxing while eating ice cream after theirmigration over to Google Cloud. We also explored the details of why Verizon Media picked BigQuery for scale, performance and cost. And who could forget the awesome story of how the Golden State Warriors transform on-court data into competitive advantage, pulling raw data from AWS into Google Cloud for fast analytics.Leaders in industry show us the way!In April, we highlighted the best practices two incredible organizations are using to turn data into value at an incredible pace. Carrefour, a leading global retailer with over 12,000 stores in over 30 countries, published an outstanding set of best practices on their corporate blogdescribing how Google Cloud fueled the company’s digital transformation. Yann Barraud, the company’s Head of Data Platforms, revealed how they managed to migrate a 700TB data lake to Google Cloud in just a few months without any service interruption—and it’s already scaling again with more than 2TB of new data each day. In addition, the total cost of ownership is lower than before despite serving more than 80 applications and executing 100 million API calls per month.You might also enjoy hearing how the team at Broadcom modernized their data lake with Dataproc, Cloud SQL and Bigtable, migrating around 80 applications with a data pipeline that receives telemetry data from millions of devices around the world. The move increased the company’s enterprise agility and translated to a reduction of 25% in monthly support calls.Watch a quick interview below with the team that made it happen:Broadcom rethinks their cybersecurity data lake with Google CloudBroadcom got rid of their “noisy neighbor” problem for their security analytics team by moving to Google Cloud. This move reduced their data lake support issues by 25%.If you like hearing data analytics success stories, you should check out how online food delivery network Delivery Hero turned to Google BigQuery to improve data accessibility and sharing across 174 datasets and 2.7 petabytes of data.And you’ll love reading this Forbes piece about how Zulily established a set of data-driven basics to guide them to business success. These principles help remind data science and engineering teams that ultimately technology is meant to serve customer needs. If it’s failing to do that—it’s time to question why you’ve got it. One of our final favorite stories from this past month kicked off with the opening day of Major League Baseball’s 2021 season. MLB’s data cloud does more than provide insights that increase viewership and sell jerseys—it’s about bringing fans a richer appreciation for the game with applications like their baseball metrics platform Statcast, which is built on Google Cloud. Statcast uses cameras to collect data on everything from pitch speed to ball trajectories to player poses. This data then gets fed into the Statcast data pipeline in real time and turned into on-screen analytics that announcers use as part of their in-game commentary. Want to get a taste for what that looks like? Check out the video below:Funny baseball moments of 2020 (Statcast style!)Check out some of the funniest moments of the 2020 season, analyzed by Statcast and presented by @Google Cloud!And that’s just a few of the many incredible journeys we witness every month. Join us on May 26th, 2021 for the Data Cloud Summit to hear more about how leading companies like Equifax, PayPal, Rackspace, Keybank, Deutsche Bank, and many more are using Google Cloud to transform their organizations. You’ll also hear the latest updates (and a few surprises) from our data management, data analytics, and business intelligence product teams about where we’re headed in the future. Be sure to save your seat for freenow! The need for speed, intelligence, and engagementIn case you missed it, we also had a great webinar with FaceIT last month. As the world’s biggest independent competitive gaming platform, FaceIT has more than 18 million users that compete in over 20 million game sessions each month. During the webinar, Director of Data & Analytics Maria Laura Scuri talked with us about how her team leveraged BigQuery BI Engine to create better gaming experiences. Here are the main takeaways from our conversation, along with some of the latest innovations from Google Cloud and Looker that customers are using to build better data experiences:Speed is key for succeeding with data.High throughput is critical when it comes to streaming data in real time. We introduced a new streaming API for ingesting data into BigQuery. The BigQuery Storage Write API not only includes stream-level transactions and automatic schema update detection but it also comes with a very cost-effective pricing model of $0.025 per GB with the first 2 TB per month free.Engagement drives rich customer experiences. According to the Mobile Gaming Analysis in 2019, most mobile games only see a 25% retention rate for users after the first day. Machine learning is a game changer for understanding the likelihood of specific users returning to applications or websites. This developer tutorial takes you through how to run propensity models for churn prediction using a BigQuery ML, Firebase, and Google Analytics. Intelligent data services deliver new avenues for enriching data experiences.Enabling business users to easily transform data based on their needs not only reduces load on IT teams, it puts powerful insights right where they need to be to deliver the most value. Our newest solution uses Google Cloud Dataprep to help teams enrich survey data, find new insights, and visualize results with Looker, Data Studio, or another BI tool. BigQuery Pushdown for Trifacta data prep flows allows teams to use intelligent technology to execute transforms natively inside BigQuery, yielding up to 20X faster job executions and significant cost savings. Another exciting announcement from April was our new support for choropleth maps of BigQuery GEOGRAPHY polygons. Now, you can use Data Studio to visualize BigQuery GIS data in a Google Maps-based interface. You can play with it today for free using our BigQuery free trialand any of our public datasets. This quick tutorial will show you how to visualize the affordability of rental properties in Washington state on a map. Give it a spin and let us know what you think!More for your SAP dataWe know that many of you want to do more with SAP data. That’s why we created the SAP Table Batch Source for Cloud Data Fusion, our fully managed, cloud-native data integration service. This new capability allows you to seamlessly integrate data from SAP Business Suite, SAP ERP and S/4HANA with the Google data platform, including BigQuery, Cloud SQL, and Spanner. With the SAP Table Batch Source, you can leverage best-in-class machine learning capabilities and combine SAP data with other datasets. One awesome example is running machine learning on IoT data joined with ERP transactional data to do predictive maintenance, run application to application integration with SAP and Cloud SQL-based applications, fraud detection, spend analytics, demand forecasting, and more.For more details about the benefits of the SAP Table Batch Source in Cloud Data Fusion, I highly recommend reading the introduction blog post. At Google Cloud, we’re always striving to enable you to do more with data, regardless of where the data is stored and how you’d like to visualize it. And expect more to come in the future—our work is far from done. If you want to hear more about what’s coming next, don’t forget to join us on May 26th, 2021 for the Data Cloud Summit to hear from leading companies about how Google Cloud is helping transform their organizations. I hope to see you there!Related ArticleRead Article
Quelle: Google Cloud Platform

8. Mai 2021

da Agency

Google Cloud and Seagate: Transforming hard-disk drive maintenance with predictive ML

Data centers may be in the midst of a flash revolution, but managing hard disk drives (HDDs) is still paramount. According to IDC, stored data will increase 17.8% by 2024 with HDD as the main storage technology. At Google Cloud, we know first-hand how critical it is to manage HDDs in operations and preemptively identify potential failures. We are responsible for running some of the largest data centers in the world—any misses in identifying these failures at the right time can potentially cause serious outages across our many products and services. In the past, When a disk was flagged for a problem, the main option was to repair the problem on site using software. But this procedure was expensive and time-consuming. It required draining the data from the drive, isolating the drive, running diagnostics, and then re-introducing it to traffic.That’s why we teamed up with Seagate, our HDD original equipment manufacturer (OEM) partner for Google’s data centers, to find a way to predict frequent HDD problems. Together, we developed a machine learning (ML) system, built on top of Google Cloud, to forecast the probability of a recurring failing disk—a disk that fails or has experienced three or more problems in 30 days.Let’s take a peek. Managing disks by the millions is hard workThere are millions of disks deployed in operation that generate terabytes (TBs) of raw telemetry data. This includes billions of rows of hourly SMART(Self-Monitoring, Analysis and Reporting Technology) data and host metadata, such as repair logs, Online Vendor Diagnostics (OVD) or Field Accessible Reliability Metrics (FARM) logs, and manufacturing data about each disk drive.That’s hundreds of parameters and factors that must be tracked and monitored across every single HDD. When you consider the number of drives in an enterprise data center today, it’s practically impossible to monitor all these devices based on human power alone. To help solve this issue, we created a machine learning system to predict HDD health in our data centers.Reducing risk and costs with a predictive maintenance systemOur Google Cloud AI Services team (Professional Services), along with Accenture, helped Seagate build a proof of concept based on the two most common drive types. The ML system was built on the following Google Cloud products and services:Terraform helped us configure our infrastructure and manage resources on Google Cloud.Google internal technologies enabled us to migrate data files to Google Cloud. BigQuery andDataflow allowed us to build highly scalable data pipelines to ingest, load, transform, and store TB of data, including raw HDD health data, features (used for training and prediction), labels, prediction results, and metadata. We built, trained, and deployed our time-series forecasting ML model using:AI Platform Notebooks for experimentationAutoML Tables for ML model experimentation and developmentCustom Transformer-based Tensorflow model trained on Cloud AI Platform. UI views in Data Studio and BigQuery made it easy to share results for executives, managers, and analysts.Composer, Cloud Functions, and our Cloud operations suite provided end-to-end automation and monitoring.In the past, when we flagged a disk problem, the main fix was to repair the disk on site using software. But this procedure was expensive and time-consuming. It required draining the data from the drive, isolating the drive, running diagnostics, and then re-introducing it to traffic.”End-to-end automated MLOps using Google Cloud products from data ingestion to model training, validation and deployment added significant value to the project.” according to Vamsi Paladugu, Director of Data and Analytics at Seagate. Vamsi also added, “Automated implementation of infrastructure as code using Terraform and DevOps processes, aligning with Seagate security policies and flawless execution of the design and setup of the infrastructure is commendable.”Now, when an HDD is flagged for repair, the model takes any data about that disk before repair (i.e. SMART data and OVD logs) and uses it to predict the probability of recurring failures.Data is critical—build a strong data pipeline Making device data useful through infrastructure and advanced analytics tools is a critical component of any predictive maintenance strategy. Every disk has to continuously measure hundreds of different performance and health characteristics that can be used to monitor and predict its future health. To be successful, we needed to build a data pipeline that was both scalable and reliable for both batch and streaming data processes for a variety of different data sources, including:SMART system indicators from storage devices to detect and anticipate imminent hardware failures.Host data, such as notifications about failures, collected from a host system made up of multiple drives. HDD logs (OVD and FARM data) and disk repair logs. Manufacturing data for each drive, such as model type and batch number. Important note: We do not share user data at any time during this process. With so much raw data, we needed to extract the right features to ensure the accuracy and performance of our ML models. AutoML Tables made this process easy with automatic feature engineering. All we had to do was use our data pipeline to convert the raw data into AutoML input format. BigQuery made it easy to execute simple transformations, such as pivoting rows to columns, joining normalized tables, and defining labels, for petabytes of data in just a few seconds. From there, the data was imported directly into AutoML Tables for training and serving our ML models.Choosing the right approach — two models put to the testOnce we had our pipeline, it was time to build our model. We pursued two approaches to build our time-series forecasting model: an AutoML Tables classifier and a custom deep Transformer-based model.The AutoML model extracted different aggregates of time-series features, such as the minimum, maximum, and average read error rates. These were then concatenated with features that were not time-series, such as drive model type. We used a time-based split to create our training, validation, and testing subsets. AutoML Tables makes it easy to import the data, generate statistics, train different models, tune hyperparameter configurations, and deliver model performance metrics. It also offers an API to easily perform and batch online predictions. For comparison, we created a custom Transformer-based model from scratch using Tensorflow. The Transformer model didn’t require feature engineering or creating feature aggregates. Instead, raw time series data was fed directly into the model and positional encoding was used to track the relative order. Features that were not time-series were fed into a deep neural network (DNN). Outputs from both the model and the DNN were then concatenated and a sigmoid layer was used to predict the label. So, which model worked better? The AutoML model generated better results, outperforming the custom transformer model or statistical model system. After we deployed the model, we stored our forecasts in our database and compared the predictions with actual drive repair logs after 30 days. Our AutoML model achieved a precision of 98% with a recall of 35% compared to precision of 70-80% and recall of 20-25% from custom ML model). We were also able to explain the mode by identifying the top reasons behind the recurring failures and enabling ground teams to take proactive actions to reduce failures in operations before they happened.Our top takeaway: MLOps is the key to successful productionThe final ingredient to ensure you can deploy robust, repeatable machine learning pipelines is MLOps. Google Cloud offers multiple options to help you implement MLOps, using automation to support an end-to-end lifecycle that can add significant value to your projects. For this project, we used Terraform to define and provision our infrastructure and GitLab for source control versioning and CI/CD pipeline implementation.Our repository contains two branches for development and production, which corresponds to an environment in Google Cloud. Here is our high-level system design of the model pipeline for training and serving:Click to enlargeWe used Cloud Composer, our fully managed workflow orchestration service, to orchestrate all the data, training, and serving pipelines we mentioned above. After an ML engineer has evaluated the performance-trained model, they can trigger an activation pipeline that promotes the model to production by simply appending an entry in a metadata table.”Google’s MLOps environment allowed us to create a seamless soup-to-nuts experience, from data ingestion all the way to easy to monitor executive dashboards.” said Elias Glavinas, Seagate’s Director of Quality Data Analytics, Tools & Automation. Elias also noted, “AutoML Tables, specifically, proved to be a substantial time and resource saver on the data science side, offering auto feature engineering and hyperparameter tuning, with model prediction results that matched or exceeded our data scientists’ manual efforts. Add to that the capability for easy and automated model retraining and deployment, and this turned out to be a very successful project.”What’s coming nextThe business case for using an ML-based system to predict HDD failure is only getting stronger. When engineers have a larger window to identify failing disks, not only can they reduce costs but they can also prevent problems before they impact end users. We already have plans to expand the system to support all Seagate drives—and we can’t wait to see how this will benefit our OEMs and our customers! AcknowledgementsWe’d like to give thanks to Anuradha Bajpai, Kingsley Madikaegbu, and Prathap Parvathareddy for implementing the GCP infrastructure and building critical data ingestion segments. We’d like to give special thanks to Chris Donaghue, Karl Smayling, Kaushal Upadhyaya, Michael McElarney, Priya Bajaj, Radha Ramachandran, Rahul Parashar, Sheldon Logan, Timothy Ma and Tony Oliveri for their support and guidance throughout the project. We are grateful to Seagate team (Ed Yasutake, Alan Tsang, John Sosa-Trustham, Kathryn Plath and Michael Renella) and our partner team from Accenture (Aaron Little, Divya Monisha, Karol Stuart, Olufemi Adebiyi, Patrizio Guagliardo, Sneha Soni, Suresh Vadali, Venkatesh Rao and Vivian Li) who partnered with us in delivering this successful project.Related ArticleKey requirements for an MLOps foundationMLOps aims to unify ML system development with ML system operations and these Google Cloud tools help.Read Article
Quelle: Google Cloud Platform

8. Mai 2021

da Agency

SRE fundamentals 2021: SLIs vs. SLAs. vs SLOs

A big part of ensuring the availability of your applications is establishing and monitoring service-level metrics—something that our Site Reliability Engineering (SRE) team does every day here at Google Cloud. The end goal of our SRE principles is to improve services and in turn the user experience.The concept of SRE starts with the idea that metrics should be closely tied to business objectives. In addition to business-level SLAs, we also use SLOs and SLIs in SRE planning and practice. Defining the terms of site reliability engineeringThese tools aren’t just useful abstractions. Without them, you won’t know if your system is reliable, available, or even useful. If the tools don’t tie back to your business objectives, then you’ll be missing data on whether your choices are helping or hurting your business.As a refresher, here’s a look at SLOs, SLAs, and SLIS, as discussed by our Customer Reliability Engineering team in their blog post, SLOs, SLIs, SLAs, oh my – CRE life lessons.1. Service-Level Objective (SLO)SRE begins with the idea that availability is a prerequisite for success. An unavailable system can’t perform its function and will fail by default. Availability, in SRE terms, defines whether a system is able to fulfill its intended function at a point in time. In addition to its use as a reporting tool, the historical availability measurement can also describe the probability that your system will perform as expected in the future.When we set out to define the terms of SRE, we wanted to set a precise numerical target for system availability. We term this target the availability Service-Level Objective (SLO) of our system. Any future discussion about whether the system is running reliably and if any design or architectural changes to it are needed must be framed in terms of our system continuing to meet this SLO.Keep in mind that the more reliable the service, the more it costs to operate. Define the lowest level of reliability that is acceptable for users of each service, then state that as your SLO. Every service should have an availability SLO—without it, your team and your stakeholders can’t make principled judgments about whether your service needs to be made more reliable (increasing cost and slowing development) or less reliable (allowing greater velocity of development). Excessive availability has become the expectation, which can lead to problems. Don’t make your system overly reliable if the user experience doesn’t necessitate it, and especially if you don’t intend to commit to always reaching that level. You can learn more about this by participating in The Art of SLOs training. Within Google Cloud, we implement periodic downtime in some services to prevent a service from being overly available. You could also try experimenting with occasional planned-downtime exercises with front-end servers, as we did with one of our internal systems. We found that these exercises can uncover services that are using those servers inappropriately. With that information, you can then move workloads to a more suitable place and keep servers at the right availability level.2. Service-Level Agreement (SLA)At Google Cloud, we distinguish between an SLO and a Service-Level Agreement (SLA). An SLA normally involves a promise to a service user that the service availability SLO should meet a certain level over a certain period. Failing to do so then results in some kind of penalty. This might be a partial refund of the service subscription fee paid by customers for that period, or additional subscription time added for free. Going out of SLO will hurt the service team, so they will push hard to stay within SLO. If you’re charging your customers money, you’ll probably need an SLA.Because of this, and because of the principle that availability shouldn’t be much better than the SLO, the availability SLO in the SLA is normally a looser objective than the internal availability SLO. This might be expressed in availability numbers: for instance, an availability SLO of 99.9% over one month, with an internal availability SLO of 99.95%. Alternatively, the SLA might only specify a subset of the metrics that make up the internal SLO.If you have an SLO in your SLA that is different from your internal SLO (as it almost always is), it’s important for your monitoring to explicitly measure SLO compliance. You want to be able to view your system’s availability over the SLA calendar period, and quickly see if it appears to be in danger of going out of SLO. You’ll also need a precise measurement of compliance, usually from logs analysis. Since we have an extra set of obligations (described in the SLA) to paying customers, we need to measure queries received from them separately from other queries. This is another benefit of establishing an SLA—it’s an unambiguous way to prioritize traffic.When you define your SLA’s availability SLO, be careful about which queries you count as legitimate. For example, if a customer goes over quota because they released a buggy version of their mobile client, you may consider excluding all “out of quota” response codes from your SLA accounting.3. Service-Level Indicator (SLI)Our Service-Level Indicator (SLI) is a direct measurement of a service’s behavior, defined as the frequency of successful probes of our system. When we evaluate whether our system has been running within SLO for the past week, we look at the SLI to get the service availability percentage. If it goes below the specified SLO, we have a problem and may need to make the system more available in some way, such as by running a second instance of the service in a different city and load-balancing between the two. If you want to know how reliable your service is, you must be able to measure the rates of successful and unsuccessful queries as your SLIs.If you’re building a system from scratch, make sure that SLIs and SLOs are part of your system requirements. If you already have a production system but don’t have them clearly defined, then that’s your highest priority work.Cloud Monitoring provides predefined dashboards for the Google Cloud services that you use. These dashboards require no setup or configuration effort. Learn how to set SLOs in Cloud Monitoring here.Learn more about these concepts in our practical guide to setting SLOs, and make use of our shared training materials to teach others in your organization.Related ArticleMeeting reliability challenges with SRE principlesFollowing SRE principles can help you build reliable production systems. When getting started, you may encounter three common challenges….Read Article
Quelle: Google Cloud Platform