Developing high-quality ML solutions

When a deployed ML model produces poor predictions, it can be due to a wide range of problems. It can be the result of bugs that are typical in any program—but it can also be the result of ML-specific problems. Perhaps data skews and anomalies are causing model performance to degrade over time. Or the data format is inconsistent between the model’s native interface and the serving API. If  models aren’t monitored, they can fail silently. When a model is embedded into an application, issues like this can create poor user experiences. If the model is part of an internal process, the issues can negatively impact business decision-making. Software engineering has many processes, tools, and practices to ensure software quality, all of which help make sure that the software is working in production as intended. These tools include software testing, verification and validation, and logging and monitoring. In ML systems, the tasks of building, deploying, and operating the systems present additional challenges that require additional processes and practices. Not only are ML systems particularly data-dependent because they inform decision-making from data automatically, but they’re also dual training-serving systems. This duality can result in training-serving skew. ML systems are also prone to staleness in automated decision-making systems.These additional challenges mean that you need different kinds of testing and monitoring for ML models and systems than you do for other software systems—during development, during deployment, and in production. Based on our work with customers, we’ve created a comprehensive collection of guidelines for each process in the MLOps lifecycle. The guidelines cover how to assess, ensure, and control the quality of your ML solutions. We’ve published this complete set of guidelines on the Google Cloud site. To give you an idea of what you can learn, here’s a summary of what the guidelines cover:Model development: These guidelines are about building an effective ML model for the task at hand by applying relevant data preprocessing, model evaluation, and model testing and debugging techniques. Training pipeline deployment: These guidelines discuss ways to implement a CI/CD routine that automates the unit tests for model functions and the integration tests of the training pipeline components. The guidelines also help you apply an appropriate progressive delivery strategy for deploying the training pipeline.Continuous training: These guidelines provide recommendations for extending your automated training workflows with steps that validate the new input data for training, and that validate the new output model that’s produced after training. The guidelines also suggest ways to track the metadata and the artifacts that are generated during the training process.Model deployment: These guidelines address how to implement a CI/CD routine that automates the process of validating compatibility of the model and its dependencies with the target deployment infrastructure. These recommendations also cover how to test the deployed model service and how to apply progressive delivery and online experimentation strategies to decide on a model’s effectiveness.Model serving: These guidelines concern ways to monitor the deployment model throughout its prediction serving lifetime to check for performance degradation and dataset drift. They also provide suggestions for monitoring the efficiency of model service.Model governance: These guidelines concern setting model quality standards. They also cover techniques for implementing procedures and workflows to review and approve models for production deployment, as well as managing the deployed model in production.To read the full list of our recommendations, read the document Guidelines for developing high-quality ML solutions.Acknowledgements: Thanks to Jarek Kazmierczak, Renato Leite, Lak Lakshmanan, and Etsuji Nakai for their valuable contributions to the guide.
Quelle: Google Cloud Platform

Build a data mesh on Google Cloud with Dataplex, now generally available

Democratizing data insights and accelerating data-driven decision making is a top priority for most enterprises seeking to build a data cloud. This often requires building a self-serve data platform that can span data silos and enable at-scale usage and application of data to drive meaningful business insights.  Organizations today need the ability to distribute ownership of data across teams that have the most business context, while ensuring that the overall data lifecycle management and governance is consistently applied across their distributed data landscape.Today we are excited to announce the general availability of Dataplex, an intelligent data fabric that enables you to centrally manage, monitor, and govern data across data lakes, data warehouses, and data marts, and make this data securely accessible to a variety of analytics and data science tools.With Dataplex, enterprises can easily delegate ownership, usage, and sharing of data, to data owners who have the right business context, while still having a single pane of glass to consistently monitor and govern data across various data domains in their organization. With built-in data intelligence, Dataplex automates the data discovery, data lifecycle management, and data quality, enabling data productivity and accelerating analytics agility.  Here is what some of our customers have to say, “We have PBs of data stored in GCS and BigQuery in GCP, accessed by 1000s of internal users daily” said Saral Jain, Director of Engineering, Snap Inc. “Dataplex enables us to deliver a business domain specific, self-service data platform across distributed data, with de-centralized data ownership but centralized governance and visibility. It significantly reduces the manual toil involved in data management, and automatically makes this data queryable via both BigQuery and open source applications. We are very excited to adopt Dataplex as a central component for building a unified data mesh across our analytics data.”“As the central data team at Deutsche Bank, we are building a data mesh to standardize data discovery, access control and data quality across the distributed domains,” said Balaji Maragalla, Director Big Data Platform at Deutsche Bank. “To help us on this journey, we are excited to use Dataplex to enable centralized governance for our distributed data. Dataplex formalizes our data mesh vision and gives us the right set of controls for cross-domain data organization, data security, and data quality.”“As one of the largest entertainment companies in Japan, we generate TBs of data everyday and use it to make business critical decisions”,  said Iwao-san, Director of Data Analytics at DeNA. “While we manage each product independently as a separate domain, we want to centralize governance of data across our products. Dataplex enables us to effectively manage and standardize data quality, data security, and data privacy for data across these domains. We are looking forward to building trust in our data with Google Cloud’s Dataplex.”One of the key use cases that Dataplex enables is a data mesh architecture. Let’s take a closer look at how you can use Dataplex as the data fabric that enables a data mesh. What is a Data Mesh?With enterprise data becoming more diverse and distributed, and the number of tools and users that need access to this data growing, organizations are moving away from monolithic data architectures that are domain agnostic. While monolithic, centrally managed architectures create data bottlenecks and impact analytics agility, a completely decentralized architecture where business domains maintain their own purpose-built data lakes also has its pitfalls and results in data duplication and silos, making governance of this data impossible. Per Gartner, Through 2025, 80% of organizations seeking to scale digital business will fail because they do not take a modern approach to data and analytics governance.The data mesh architecture, first proposed in this paper by Zamak Deghani, describes a modern data stack that moves away from a monolithic data lake or data warehouse architecture to a distributed domain-specific architecture that enables autonomy of data ownership, provides agility with decentralized domain aware data management while providing the ability to centrally govern and monitor data across domains. To learn more, refer to this Build a Modern Distributed Data Mesh Whitepaper.  How to make Data Mesh real with Google Cloud Dataplex provides a data management platform to easily build independent data domains within a data mesh that spans your organization while still maintaining central controls for governing and monitoring the data across domains. “Dataplex is embodying the principles of Data Mesh as we have envisioned in Adeo. Having a first party, cloud-native, product to architect a Data Mesh in GCP is crucial for effective data sharing and data quality amongst teams. Dataplex streamlines productivity, allowing teams to build data domains and orchestrate data curation across the enterprise. I only wish we had Dataplex three years ago.” —Alexandre Cote, Product Leader with ADEOImagine you have the following domains in your organization,With Dataplex you can logically organize your data and related artifacts such as code, notebooks, and logs, into a Dataplex Lake which represents a data domain. You can model all the data in a particular domain as a set of Dataplex Assets within a lake without physically moving data or storing it into a single storage system. Assets can refer to Cloud Storage buckets and BigQuery datasets, stored in multiple Google Cloud projects, and manage both analytics and operational data, structured and unstructured data that logically belongs to a single domain. Dataplex Zones enable you to group assets and add structure that capture key aspects of your data – its readiness, the workloads it is associated with, or the data products it is serving.  The lakes and data zones in Dataplex enable you to unify distributed data and organize it based on the business context. This forms the foundation for managing metadata, setting up governance policies, monitoring data quality and so on, giving you the ability to manage your distributed data at scale.  Now let’s take a look at one of the domains in a little more detail.Automatically discover metadata across data sources: Dataplex provides metadata management and cataloging that enables all members of the domain to easily search, browse and discover the tables and filesets as well as augment them with business and domain-specific semantics. Once data is added as assets, Dataplex automatically extracts associated metadata and keeps it up-to-date as data evolves. This metadata is made available for search, discovery, and enrichment via integration with Data Catalog.Enable interoperability of tools: The metadata curated by Dataplex is automatically made available as runtime metadata to power federated open source analytics via Apache SparkSQL, HiveQL, Presto, and so on. Compatible metadata is also automatically published as external tables in BigQuery to enable federated analytics via BigQuery. Govern data at scale: Dataplex enables data administrators and stewards to consistently and scalably manage their IAM data policies to control data access across distributed data. It provides the ability to centrally govern data across domains while enabling autonomous and delegated ownership of data. It provides the ability to manage reader/writer permissions on the domains and the underlying physical storage resources. Dataplex integrates with Stackdriver to provide observability including audit logs, data metrics and logs.Enable access to high quality data: Dataplex provides built-in data quality rules that can automatically surface issues in your data. You can run these rules as data quality tasks across your data in BigQuery and GCS. One-click data exploration: Dataplex enables data engineers, data scientists and data analysts with a built-in, self-serve, serverless data exploration experience to interactively explore data and metadata, iteratively develop scripts, and deploy and monitor data management workloads. It provides content management across SQL scripts and Jupyter notebooks that makes it easy to create domain-specific code artifacts and share or schedule them from that same interface. Data management: You can also leverage the built-in data management tasks that address common tasks such as tiering, archiving or refining data. It integrates with Google Cloud’s native data tools such as Dataproc Serverless, Dataflow, Data Fusion, and BigQuery to provide an integrated data management platform. With the collective of data, metadata, policies, code, interactive and production analytics infrastructure, and data monitoring, Dataplex delivers on the core value proposition of a data mesh: data as the product.“Consistent data management and governance of distributed data remains a top priority for most of our clients today. Dataplex enables a business-centric data mesh architecture and significantly lowers the administrative overhead associated with managing, monitoring, and governing distributed data. We are excited to collaborate with the Dataplex team to enable enterprise clients to be more data-driven and accelerate their digital transformation journeys.”—Navin Warerkar, Managing Director, Deloitte Consulting LLP, and US Google Cloud Data & Analytics GTM LeaderNext stepsGet started with Dataplex today by using this quickstart guide, this data mesh tutorial or contact the Google Cloud sales team.Related ArticleIntroducing Dataplex—an intelligent data fabric for analytics at scaleDataplex unifies distributed data to help automate data management and power analytics at scale.Read Article
Quelle: Google Cloud Platform

Google data experts share top data practitioner skills needed in 2022

It’s 2022 and nanosatellites, NFTs, and autonomous cars that deliver your pizza are in full force. In a world where people rely on simple technology to untangle complex problems, companies must deliver simple experiences to be successful in today’s landscape. For many cloud providers this means enabling tightly integrated data offerings that simplify the data delivery process without losing sight of the sophisticated needs of the modern data consumer. But while the name of the game is helping companies reach informed decisions from their data simpler and faster, what about the data practitioners – data analysts, data engineers, database administrators, developers, etc – who use these cloud data tools and technologies everyday? To proactively stay ahead of data cloud market trends in 2022 should data practitioners invest their time in specializing their data cloud skill sets (e.g. go deep in, say, data pipelining skills) or instead invest their time generalizing their data cloud skill sets (e.g. growing proficiencies in a mix of data analytics, databases, AI/ML, and more domains)? Skill deep or wide with data – that is the questionFor Abdul Razack, VP, Solutions Engineering, Technology Solutions and Strategy at Google Cloud, the answer is a bit of both. “Data practitioners need to be broad in terms of their technology skills, but specialized with respect to the domain or domains in which they apply them. The reason why is because many things that used to be separate skill sets are now converging – like business analytics, streaming, machine learning, data pipelines, and data warehousing. Data practitioners need to be able to implement end-to-end workflows that solve specific business problems using skills from each category.”It’s true, thousands of customers are choosing Google’s data cloud because it offers a unified and open approach to cloud that enables their practitioners to break down silos, begin and end projects without leaving the data platform, and innovate faster across their organization. The data practitioners who mirror Google data cloud’s frame of mind of being smart and agile across data domains in their skilling and learning will reap the benefits of solving more nuanced problems – building out internet-scale applications, fine tuning smart processes with analytics and AI, constructing data meshes that make product building simple, etc – at a larger scale than they would if they specialized in just one or two areas alone.Of course at the end of the day it depends on what tools a data practitioner is using to complete their workflows. There’s only so much you can learn and skills you can develop when you’re using limited tools.“Of course at the end of the day it depends on what tools a data practitioner is using to complete their workflows. There’s only so much you can learn and skills you can develop when you’re using limited tools. Growing data proficiencies across the board is made a lot easier when you’re using a data platform like BigQuery to address all these needs. BigQuery eliminates the choices you have to make – for instance you don’t have to choose between streaming data and data at rest, batch and realtime, or business intelligence and data science. This freedom gives data professionals a huge advantage when they’re building their skill sets and taking on more complex projects.” -Abdul Razack – VP, Solutions Engineering, Technology Solutions and Strategy, Google CloudKnowing your value is half the battle when upskillingWhile some experts think technology is the limiting factor of whether or not you can even go wide or go deep in the first place, others like Google Cloud’s Head of Data and Analytics Bruno Aziza purport that it also depends on who you are, who you wish to be, and what investments your company is making to ensure you can become that person. “If you wish to set yourself up to be a Chief Data Officer, then you’ll want to understand how technologies fit together across your data estate first” said Aziza. “Only after you feel like you’re the go-to ‘data person’ can you then decide which part of the technology stack you want to double-down on.”Only after you feel like you’re the go-to ‘data person’ can you then decide which part of the technology stack you want to double-down on.But technology isn’t everything. Aziza notes, “Make sure you focus on the  business impact that your data work provides.  You want to spend as much time as you can with your business counterparts to understand their business goals and challenges. The Harvard Business Review provides great guidance on how to succeed as a Chief Data Officer.”Even if you don’t have your sights set on a C-suite role, both Aziza and Razack contend that the number one skill data practitioners should tackle in 2022 is actually a broad and perhaps abstract one: develop and exercise the curiosity to solve problems with a data-driven strategy. That is, today’s data practitioners should always be interested in educating themselves in the industry and continually upskilling in something. And their employers should also be invested in helping practitioners develop those interests, most likely through exposure to learning materials, engaging in career conversations, subsidized courses, or incentives attached to pursuing a new certification or skill.  “Every industry is going through a digital transformation and the ability to identify what data to collect, how to prepare the data, and how to derive insights from it is critical. Therefore, the ability to find business challenges and formulate a data-driven approach to address those problems is the most important skill to have.” Abdul Razack – VP, Solutions Engineering, Technology Solutions and Strategy, Google CloudWhether you’re a data engineer, data analyst, citizen data scientist, or data practitioner by any other name, asking more questions and being curious to learn more should be that thing that you gravitate towards in those spare moments…Be a constant learner. New concepts pop up all the time and you want to be the person who can learn the fastest so you can advance your company’s mission and contribute back to the community.Take the example of the “Data Mesh” I just wrote about in VentureBeat.  You’ll find 3 types of attitudes towards this new concept. There are Disciples who encourage continued learning only from the source – like the author of a new book or the creator of a theory. There are Distractors who tell you that new skills, trends, and technologies are fake news. And there are Distorters like vendors who will sell you one easy fix solution. But it’s the data practitioner who needs to proceed with caution when interacting with all three types  and forge their own path to discovering the truth when they’re learning and building skills. And for better or worse, this comes with trial and error, experimentation, and an eagerness to grow relative to where they began.”Ready to start data upskilling? Start here.For those interested in keeping up their data curiosities, check out our Data Journeys video series. Each week Bruno Aziza investigates a new authentic customer’s data journey – from migrating to cloud or  building a data platform to carrying out new data for good initiatives. Learn how they did it, their data dos and don’ts, and what’s next for them on their journey. These videos include a flavor of both specializing your data competencies and broadening your data competencies. For those interested in deepskilling, connect with Google’s data community at our upcoming virtual event: Latest Google Cloud data analytics innovations. Register and save your spot now to get your data questions answered live by GCP’s top data leaders and watch demos from our latest products and features including BigQuery, Dataproc, Dataplex, Dataflow, and more.If you have any questions or need support along your learning journey – we’re here for you! Sign up to be a Google Cloud Innovator, and join the Google Cloud Data Analytics Community.Related ArticleThe top three insights we learned from data analytics customers in 2021Google Cloud announces the top data analytics stories from 2021 including the top three trends and lessons they learned from customers th…Read Article
Quelle: Google Cloud Platform

Scaling to new heights with Cloud Memorystore and Envoy

Modern applications need to process large-scale data at millisecond latency to provide experiences like instant gaming leaderboards, fast analysis of streaming data from millions of IoT sensors, or real-time threat detection of malicious websites. In-memory datastores are a critical component to deliver the scale, performance, and availability required by these modern applications. Memorystore makes it easy for developers building applications on Google Cloud to leverage the speed and powerful capabilities of the most loved in-memory store: Redis. Memorystore for Redis Standard Tier instances are a popular choice for applications requiring a highly available Redis instance. Standard Tier provides a failover replica across zones for redundancy and provides fast failover with a 99.9% SLA. However, in some cases, your applications may need to scale beyond the limitations of a single Standard Tier instance. Read replicas allow you to scale to a higher read throughput, but your application may require higher write throughput or a larger keyspace size as well. In these scenarios, you can implement a strategy to partition your cache usage across multiple independent Memorystore instances which is known as client-side sharding. In this post, we’ll discuss how you can implement your own client-side sharding strategy to scale infinitely with Cloud Memorystore and Envoy. Architectural Overview Let’s start by discussing an architecture of GCP native services alongside open-source software which can scale Cloud Memorystore beyond its usual limits. To do this, we’ll be sharding a cache such that the total keyspace is split among multiple otherwise independent Memorystore instances. Sharding can pose challenges to client applications which must then be rewritten for awareness of the appropriate place to search for a specific key and must be updated to scale the backend. However, client-side sharding can be easier to implement and maintain by encapsulating the sharding logic in a proxy, allowing your application and sharding logic to be updated independently. You’ll find a sample architecture below and we’ll briefly detail each of the major components.Memorystore for Redis Cloud Memorystore for Redis enables GCP users to quickly deploy a managed Redis instance within a GCP project. A single node Memorystore instance can support a keyspace as large as 300 GB and a maximum network throughput of 16gbps. With Standard Tier you get a highly available Redis instance with built in health checks and fast automatic failover.Today, we’ll show you how to deploy multiple Standard Tier Cloud Memorystore instances which can be used together to scale beyond the limits of a single instance for an application with increased scale demands. Each individual Memorystore instance will be deployed as a standalone instance that is unaware of the other instances within its shared host project. In this example, you’ll deploy three Standard Tier instances which will be treated as a single unified backend. By using Standard Tier instances instead of self-managed Redis instances on GCE, you get the benefit of: Highly available backends: Standard Tier provides high availability without requiring any additional work from you. Enabling high availability on self-managed Redis instances on GCE can add additional complexities and failure points.Integrated monitoring: Memorystore is integrated with Cloud Monitoring and you can easily monitor the individual shards using Cloud Monitoring, compared to having to deploy and manage monitoring agents on self managed instancesMemtier Benchmark Memtier Benchmark is a commonly used command line utility for load generation and benchmarking of key-value databases. You will deploy and use this utility to demonstrate the ability to easily scale to high query volume. Similar benchmarking tools or your own Redis client application could be used instead of Memtier Benchmark. Envoy​​Envoy is an open-source network proxy designed for service oriented architectures. Envoy supports many different filters which allow it to support network traffic from many different software applications and protocols. For this use case, you will deploy Envoy with the Redis filter configured. Rather than connecting directly to Memorystore instances, the Redis clients will connect to the Envoy proxy. By appropriately configuring Envoy, you can take a collection of independent Memorystore instances and define them as a cluster where inbound traffic will be load balanced among the individual instances. By leveraging Envoy, you decrease the likelihood of needing a significant application rewrite to leverage more than one Memorystore instance for higher scale. To ensure compatibility with your application, you’ll want to review the list of the Redis commands which Envoy currently supports.  Let’s get started. PrerequisitesTo follow along with this walkthrough, you’ll need a GCP project with permissions to do the following: Deploy Cloud Memorystore for Redis instances (permissions)Deploy GCE instances with SSH access (permissions)Cloud Monitoring viewer access (permissions) Access to Cloud Shell or another gCloud authenticated environment Deploying the Memorystore Backend You’ll start by deploying a backend cache which will serve all of your application traffic. As you’re looking to scale beyond the limits of a single node, you’ll deploy a series of Standard Tier instances. From an authenticated cloud shell environment, this can be done as follows:$ for i in {1..3}; do gcloud redis instances create memorystore${i} –size=1 –region=us-central1 –tier=STANDARD –async; doneIf you do not already have the Memorystore for Redis API enabled in your project, the command will ask you to enable the API before proceeding. While your Memorystore instances deploy, which typically takes a few minutes, you can move onto the next steps. Creating a Client and Proxy VM Next, you need a VM where you can deploy a Redis client and the Envoy proxy. You’ll be creating a single GCE instance where you deploy these two applications as containers. This type of deployment is referred to as a “sidecar architecture” which is a common Envoy deployment model. Deploying in this fashion nearly eliminates any added network latency as there is no additional physical network hop that takes place. While you are deploying a single vertically scaled client instance, in practice, you’ll likely deploy many clients and proxies, so the steps outlined in the following sections could be used to create a reusable instance template or repurposed for GKE. You can start by creating the base VM: $ gcloud compute instances create envoy-memtier-client –zone=us-central1-a –machine-type=e2-highcpu-32 –image-family cos-stable –image-project cos-cloud We’ve opted for a Container-Optimized OS instance as you’ll be deploying Envoy and Memtier Benchmark as containers on this instance. Configure and Deploy the Envoy Proxy Before deploying the proxy, you need to gather the necessary information to properly configure the Memorystore endpoints. To do this, you need the host IP addresses for the Memorystore instances you have already created. You can gather these programmatically: $ for i in {1..3}; do gcloud redis instances describe memorystore${i} –region us-central1 –format=json | jq -r “.host”; doneCopy these IP addresses somewhere easily accessible as you’ll use them shortly in your Envoy configuration. Next, you’ll need to connect to your newly created VM instance, so that you can deploy the Envoy Proxy. You can do this easily via SSH in the Google Cloud Console. More details can be found here.After you have successfully connected to the instance, you’ll create the Envoy configuration. Start by creating a new file named envoy.yaml on the instance with your text editor of choice. Use the following .yaml file, entering the three IP addresses of the instances you created:The IP addresses need to be inserted into the highlighted portions of each endpoint configuration near the bottom of the file. If you chose to create a different number of Memorystore instances, simply add or remove endpoints from the configuration file. Before you move on, take a look at a few important details of the configuration: We’ve configured the Redis Proxy filter to support the Redis traffic which you’ll be forwarding to Cloud Memorystore We’ve configured the Envoy proxy to listen for client Redis traffic on port 6379 We’ve chosen MAGLEV as the load balancing policy for the Memorystore instances which make up the client-side sharded cluster. You can learn more about the various types of load balancing available here. Scaling up and down the number of Memorystore backends requires rebalancing data and configuration changes which are not covered in this tutorial.Once you’ve added your Memorystore instance IP addresses, save the file locally to your container OS VM where it can be easily referenced. Now, you’ll use Docker to pull the official Envoy proxy image and run it with your own configuration. $ docker run –rm -d -p 8001:8001 -p 6379:6379 -v $(pwd)/envoy.yaml:/envoy.yaml envoyproxy/envoy:v1.21.0 -c /envoy.yaml Now that Envoy is deployed, you can test it by visiting the admin interface from the container VM: $ curl -v localhost:8001/stats If successful, you should see a print out of the various Envoy admin stats in your terminal. Without any traffic yet, these will not be particularly useful, but they allow you to ensure that your container is running and available on the network. If this command does not succeed, we recommend checking that the Envoy container is running. Common issues include syntax errors within your envoy.yaml and can be found by running your Envoy container interactively and reading the terminal output. Deploy and Run Memtier Benchmark While you’re still ssh’ed into the container OS VM, you will also deploy the Memtier Benchmark utility which you’ll use to generate artificial Redis traffic. Since you are using Memtier Benchmark, you do not need to provide your own dataset. The utility will populate the cache for you using a series of set commands. You can run a series of benchmark tests: $ for i in {1..15}; do docker run –network=”host” –rm -d redislabs/memtier_benchmark:1.3.0 -s 127.0.0.1 -p 6379 –test-time=300 –key-maximum=10000; doneHere are some configuration options of note: If you have configured Envoy to listen on another port, specify the appropriate port after the `-p` flagWe have chosen to run the benchmark for a set period of time (5 minutes, specified in seconds)  by using the –test-time flag rather than a set number of requests which is the default behavior. By default, the utility uses a uniform random pattern for getting and setting keys. You will not modify this, but it can be specified using the –keypattern flag.The utility works by performing gets and sets based on the minimum and maximum values of the key range as well as the specified key pattern which we just discussed. We will decrease the size of this key range by setting the –key-maximum parameter. This allows us to ensure a higher cache hit ratio which is more representative of most real world applications. The –ratio flag allows us to modify the set to get ratio of commands issued by the utility. By default, the utility issues 10 get commands for every set command. You can easily modify this ratio to better match your workload’s characteristics.   You can increase the load generated by the utility by increasing the number of threads with the `–threads` flag and/or by increasing the number of clients per thread with the `–clients` flag. The above command uses the default number of threads (4) and clients (50). Observe the Redis TrafficOnce you have kicked off the load tests, you can confirm that traffic is being balanced across the individual Memorystore instances via Cloud Monitoring. You can easily set up a custom dashboard that shows the Calls per minute for each of the Memorystore instances. Let’s start by navigating to the Cloud Monitoring Dashboards page. Next, you’ll click “Create Dashboard”. You will see many different types of widgets on the left side of the page which can be dragged onto the canvas on the right side of the page. You’ll select a “Line” chart and drag it onto the canvas. You then need to populate the line chart with data from the Memorystore instances. To do this, you’ll configure the chart via “MQL” which can be selected at the top of the chart configuration pane. For ease, we’ve created a query which you can simply paste into your console to populate your chart:If you have created your Memorystore instances with a different naming convention or have other Memorystore instances within the same project, you may need to modify the resource.instance_id filter. Once you’re finished, ensure that your chart is viewing the appropriate time range, and you should see something like:You should see nearly perfect distribution of the client workload across the Memorystore instances, effectively allowing infinite horizontal scalability for demanding workloads. More details on creating and managing custom dashboards can be found here. As you modify the parameters of your own testing, you’ll also want to keep the performance of the client and proxy in mind. As you vertically scale the number of operations sent by a client, you’ll eventually need to horizontally scale the number of clients and sidecar proxies which you have deployed to scale smoothly. You can view the Cloud Monitoring graphs for GCE instances as well. More details can be found here. Clean Up If you have followed along, you’ll want to spend a few minutes cleaning up resources to avoid accruing unwanted charges. You’ll need to delete the following: Any deployed Memorystore instances Any deployed GCE instancesMemorystore instances can be deleted like: $ gcloud redis instances delete <instance-name> –region=<region>If you followed the tutorial, you can use a command like: $ for i in {1..3}; do gcloud redis instances delete memorystore${i} –region=us-central1 –async; doneNote: You’ll need to manually acknowledge the deletion of each instance via the terminal The GCE container OS instance can be deleted like: $ gcloud compute instances delete <instance-name>If you created additional instances, you can simply chain them in a single command separated by spaces. Conclusion Client-side sharding is one strategy to address high scale use cases with Cloud Memorystore. Envoy and its Redis filter make implementation simple and extensible. The outline provided above is a great place to get started. These instructions can easily be extended to support other client deployment models including GKE and can be scaled out horizontally to reach even higher scale. As always, you can learn more about Cloud Memorystore through our documentation or request desired features via our public issue tracker.Related ArticleGet 6X read performance with Memorystore for Redis Read ReplicasMemorystore for Redis supports Read Replicas preview, allowing you to scale up to five replicas and achieve over one million read request…Read Article
Quelle: Google Cloud Platform

Cloud Spanner myths busted

Intro to Cloud SpannerCloud Spanner is an enterprise-grade, globally distributed, externally consistent database that offers unlimited scalability and industry-leading 99.999% availability. It requires no maintenance windows and offers a familiar PostgreSQL interface. It combines the benefits of relational databases with the unmatched scalability and availability of non-relational databases. As organizations modernize and simplify their tech stack, Spanner provides a unique opportunity to transform the way they think about and use databases as part of building new applications and customer experiences.But choosing a database for your workload can be challenging; there are so many options in the market and each one has a different onboarding and operating experience. At Google Cloud we know it’s hard to navigate this choice and are here to help you. In this blog post, I want to bust the seven most common misconceptions that I regularly hear about Spanner so that you can confidently make your decision.Myth #1 Only use Spanner if you have a massive workloadThe truth is that Spanner powers Google’s most popular, globally available products, like YouTube, Drive, and Gmail, and has enabled many large scale transformations including that of Uber, Niantic and Sharechat. It is also true that Spanner processes more than 1 Billion queries per second at peak.At the same time, many customers also use Spanner for their smaller workloads (both in terms of transactions per second and storage size) for availability and scalability reasons. For example, Google Password Manager has small workloads that run on Spanner. These customers cannot tolerate downtime, require high availability to power their applications and seek scale insurance for future growth scenarios.Limitless scalability with the highest availability is critical in many industry verticals such as gaming and retail, especially when a newly launched game goes viral and becomes an overnight success or when a retailer has to handle a sudden surge in traffic due to a  Black Friday/Cyber Monday sale.  Regardless of workload size, every customer on the journey to the cloud wants the benefits of scalability and availability while reducing the operational burden and the costs associated with patching, upgrades and other maintenance.Myth #2 Spanner is too expensiveThe truth is, when looking at the cost of a database, it is better to consider Total Cost of Ownership (TCO) and the value it offers rather than the raw list price. We deliver significant value to our customers starting at this price including critical things like availability, price performance, and reduced operational costs. Availability: Spanner provides high availability and reliability by synchronously replicating data. When it comes to Disaster Recovery, Spanner offers 0-RPO and 0-RTO for zonal failures in case of a regional instance and regional failure in case of multi-regional instances. Less downtime, more revenue!Price-performance: Spanner offers one of the industry’s leading price-performance ratios which makes it a great choice if you are running a demanding, performance sensitive application. Great customer experiences require consistent, optimal latencies!Reduced operational cost: With Spanner, customers enjoy zero downtime upgrades and schema changes, and no maintenance windows. Sharding is automatically handled so the challenges associated with scaling up traditional databases don’t exist. Spend more time innovating, and less time administering!Security & Compliance: By default, Spanner already offers encryption for data-in-transit via its client libraries and for data-at-rest using Google-managed encryption keys. CMEK support for Spanner lets you now have complete control of the encryption keys. Spanner also provides VPC Service Controls support and has compliance certifications and necessary approvals so that it can be used for workloads requiring ISO 27001, 27017, 27018, PCI DSS, SOC1|2|3, HIPAA and FedRAMP.With Spanner, you have peace of mind knowing that your data’s security, availability and reliability won’t be compromised.And best of all, with the introduction of Granular Instance Sizing, you can now get started for as little as $65/month and unlock the tremendous value spanner offers.Pro tip : Use the auto-scaler to right size your Spanner instances. Take advantage of TTL to reduce the amount of data stored.Myth #3 You have to make a trade off between scale, consistency, and latencyThe truth is, depending on the use case and instance configuration, users can use Spanner such that they don’t have to pick between consistency, latency and scale.To provide strong data consistency, Spanner uses a synchronous, Paxos-based replication scheme, in which replicas acknowledge every write request. A write is committed when a majority of the replicas (e.g 2 out of 3), called a quorum, agree to commit the write. In the case of regional instances, the replicas are within the region and hence the writes are faster than in the case of multi-region instances, where the replicas are distributed across multiple regions. In the latter case, forming a quorum on writes can result in slightly higher latency. Nevertheless, Spanner multi-regions are carefully designed in geographical configurations that ensure that the replicas can communicate fast enough and write latencies are acceptably low.A read can be served strong (by default) or stale. A strong read is a read at a current timestamp and is guaranteed to see all the data that has been committed up until the start of the read. A stale read is a read executed at a timestamp in the past. In case of a strong read, the serving replica ​​will guarantee that you will see all data that has been committed up until the start of the read. In some cases, this means that the serving replica has to contact the leader to ensure that it has the latest data. In case of a multi-region instance where the read is served from a non-leader replica, this would mean that read latency can be slightly higher than if it was served from a leader region. Stale reads are performed over data that was committed at a  timestamp in the past and can, therefore, be served at very low latencies by the closest replica that is caught up until that timestamp. If your application is latency sensitive, stale reads may be a good option and we recommend using a stale read value of 15 seconds. Myth #4 Spanner does not have a familiar interfaceThe truth is that Spanner offers the flexibility to interact with the database via a SQL dialect based on ANSI 2011 standard as well as via a REST or gRPC API interface, which are optimized for performance and ease-of-use. In addition to Spanner’s interface, we recently introduced a PostgreSQL interface for Spanner, that leverages the ubiquity of PostgreSQL to meet development teams using an interface that they are familiar with. The PostgreSQL interface provides a rich subset of the open-source PostgreSQL SQL dialect, including common query syntax, functions, and operators. It supports a core collection of open-source PostgreSQL data types, DDL syntax, and information schema views. You get the PostgreSQL familiarity, and relational semantics at Spanner scale. Learn more about our PostgreSQL interface here.Myth #5 The only way to get observability data is via the Spanner Console​​The truth is that Spanner client libraries support OpenCensus Tracing and Metrics, which gives insight into the client internals and aids in debugging production issues. For instance, client-side traces and metrics include sessions and transactions related information. Spanner also supports the OpenTelemetery receiver, which provides an easy way for you to process and visualize metrics from Cloud Spanner System tables, and export these to the Application Monitoring (APM) tool of your choice. This could be either an open source combination of a time-series database like Prometheus coupled with a Grafana dashboard, or it could be a commercial offering like Splunk, Datadog, Dynatrace, NewRelic or AppDynamics. We’ve also published reference Grafana dashboards, so that you can debug the most common user journeys such as “Why is my tail latency high” or “Why do I see a CPU spike when my workload did not change”. Here is a sample docker service, to show how the Cloud Spanner receiver can work with Prometheus exporter and Grafana dashboards.We are continuing to embrace open standards, and continuing to integrate with our partner ecosystem. We also continue to evolve the observability experience offered by the Google console so that our customers get the best experience wherever they are. Myth #6 Spanner is only for global workloads requiring copies in multiple regions The truth is that, while Spanner offers a range of multi-region instance configurations, it also offers regional configuration in each GCP region. Each regional node is replicated in 3 zones within the region, while a multi-regional node is replicated at least 5 times across multiple regions. A regional configuration offers 4 nines of availability and protection against zonal failures.Typically, multi-regional instance configurations are indicated if your application runs workloads in multiple geographical locations or your business needs 99.999% of availability and protection against regional failures. Learn more here.Myth #7 Spanner schema changes require expensive locksThe truth is that Spanner never has table level locks. Spanner uses a multi-version concurrency control architecture to manage concurrent versions of schema and  data allowing ad-hoc and online qualified schema changes that do not require any downtime, additional tools, migration pipelines or complex rollback/backup plans. When issuing a schema update you can continue writing and reading from the database without interruption while Spanner backfills the update, whether you have 10 rows or 10 billon rows in your table.The same mechanism can be used for Point-in-time recovery (PITR) and snapshot queries using stale reads to restore both schema and the state of data at a given query-condition and timestamp up to a maximum of seven days.Now that we’ve learned the truth about Cloud Spanner, I invite you to get started – visit our website.Related ArticleImproved troubleshooting with Cloud Spanner introspection capabilitiesCloud-native database Spanner has new introspection capabilities to monitor database performance and optimize application efficiency.Read Article
Quelle: Google Cloud Platform

Announcing Google Cloud 2022 Summits [frequently updated]

Register for our 2022 Google Cloud Summit series, and be among the first to learn about new solutions across data, machine learning, collaboration, security, sustainability, and more. You’ll hear from experts, explore customer perspectives, engage with interactive demos, and gain valuable insights to help you accelerate your business transformation. Bookmark the Google Cloud Summit series website to easily find updates as news develops. Can’t join us for a live broadcast? You can still register to enjoy all summit content, which becomes available for on-demand viewing immediately following each event. Upcoming eventsData Cloud Summit | April 6, 2022Mark your calendars for the Google Data Cloud Summit, April 6, 2022. Join us to explore the latest innovations in AI, machine learning, analytics, databases, and more. Learn how organizations are using a simple, unified, open approach with Google Cloud to make smarter decisions and solve their most complex business challenges.At the event, you will gain insights that can help move you and your organization forward. From our opening keynote to customer spotlights to sessions, you’ll have the chance to uncover up-to-the-minute insights on how to make the most of your data.Equip yourself with the technology, the confidence, and the experience to capitalize on the next wave of data solutions. Register today for the 2022 Google Data Cloud Summit.Related ArticleRead Article
Quelle: Google Cloud Platform

Strengthen protection for your GCE VMs with new FIDO security key support

With the release of OpenSSH 8.2 almost two years ago, native support for FIDO authentication became an option in SSH. This meant that you could have your SSH private key protected in a purpose-built security key, rather than storing the key locally on a disk where it may be more susceptible to compromise. Building on this capability, today we are excited to announce in public preview that physical security keys can be used to authenticate to Google Compute Engine (GCE) virtual machine (VM) instances that use our OS Login service for SSH management. These advances in OpenSSH made it easier to protect access to sensitive VMs by setting up FIDO authentication to these hosts and physically protecting the keys used to grant access. And while we’ve seen adoption of this technology, we also know that management of these keys can be challenging, particularly around the manual process of generating and storing FIDO keys. Additionally, physical security key lifecycle issues could leave you without access to your SSH host. And if you lose or misplace your security key, you could be locked out.At Google Cloud we’ve been working hard on integrating our industry-first account level support for FIDO security keys with SSH in a way that makes it simple to get all the benefits of using FIDO security keys for SSH login, without any of the drawbacks.Now, when you enable security key support through OS Login for your GCE VMs, and one of your security keys will be required to complete the login process, any of the security keys configured on your Google account will be accepted during login. If you ever lose a security key, you can easily update your security key configuration (i.e. delete the lost key and add a new one) and your VMs will automatically start accepting the new configuration on next login.If desired, OS Login’s FIDO security key support can further be combined with 2 Step Verification to add an extra layer of security with two-factor authentication (2FA). When this is enabled, a user is required to both have their security key available, and prove authorized access to their Google Account at the time of logging in to their GCE instance through additional factors.If you’d like to learn more or try this capability out on your own instances, visit our documentation to get started.Related ArticleRead Article
Quelle: Google Cloud Platform

Reduce your cloud carbon footprint with new Active Assist recommendations

Last year, we analyzed the aggregate data from all customers across Google Cloud, and found over 600,000 gross kgCO2e in seemingly idle projects that could be cleaned up or reclaimed — which would have a similar impact to planting almost 10,000 trees1. Today, we’re making it easy for you to identify if any of those idle workloads are yours, with new Active Assist sustainability recommendations.  Active Assist is a part of Google Cloud’s AIOps solution that uses data, intelligence, and machine learning to reduce cloud complexity and administrative toil. Under the Active Assist portfolio, we have products and tools like Policy Intelligence, Network Intelligence Center, Predictive Autoscaler, and a collection of Recommendations for various Google Cloud services — all focused on helping you achieve your operational goals. Today, we are broadening the scope of Active Assist to help you achieve your sustainability targets and reduce the carbon footprint of your workloads.The carbon emissions associated with your cloud infrastructure can be a big part of your overall environmental footprint. Choosing to run on Google Cloud is a great first step — we’ve matched the energy used by our data centers with 100% renewable energy since 2017, and are committed to running our operations on carbon-free energy 24/7 by 2030. But once you’re running on Google Cloud, if you want to reduce the gross carbon emissions of your workload you can take action to optimize your usage.Assessing the gross carbon impact of unattended projectsYou can now estimate the gross carbon emissions you’ll save by removing these idle projects with Active Assist Unattended Project Recommender, which provides rich utilization insights for all the projects in your organization, and uses machine learning to identify ones that are idle and most likely unattended. The data points Active Assist surfaces as a part of its utilization insights now include the carbonFootprintDailyKgCO2 field, which allows you to estimate carbon emissions associated with any given project. Recommendations also estimate the impact of removing an idle project in terms of kilograms of CO2 reduced per month. The capability is available via the Recommender API, Recommendation Hub, the Carbon Footprint dashboard, and BigQuery export of recommendations, making it easy for you to integrate with your company’s existing tools and workflows.Example unattended project in Recommendation HubIntroducing the Carbon Sense suiteIncreasing the sustainability of digital applications and infrastructure is a priority for 90% of global IT leaders2, and we’ll be continuing to invest across a number of product areas in Google Cloud, including AIOps features like Active Assist’s recommendations, to help you make progress towards your sustainability goals. To make it easy for you to find and consume these new features, we’re bundling our existing and future product work into the Carbon Sense suite — a collection of features that makes it easy to accurately report your carbon emissions, and reduce them. Active Assist joins products like Carbon Footprint, which provides you with the ability to understand and measure the gross carbon emissions of your Google Cloud usage, and our low-carbon signals, which help users choose cleaner regions to run their workloads, in the Carbon Sense suite. Stay tuned for more updates on Carbon Sense in the coming months. Getting started with sustainability recommendationsTo get started with Active Assist sustainability recommendations, check the Carbon Footprint dashboard and Recommendation Hub to review projects that may be idle and assess the carbon emissions associated with them. See recommendations in Google Cloud Console.To view the recommendations, you will need IAM permissions for Unattended Project Recommender itself and permissions to view resources in a given organization.You can also automatically export the recommendations from your Organization to BigQuery and then investigate any idle projects with DataStudio or Looker. Or, you can use Connected Sheets to use Google Workspace Sheets to interact with the data stored in BigQuery without having to write SQL queries.As with any other Recommender, you can choose to opt out of data processing for your organization or your projects at any time by disabling the appropriate data groups in the Transparency & Control tab under Privacy & Security settings.We hope you use Unattended Project Recommender to reduce the carbon footprint associated with your idle cloud resources, and can’t wait to hear your feedback and thoughts about this feature! Please feel free to reach us at active-assist-feedback@google.com. We also invite you to sign up for our Active Assist Trusted Tester Group if you would like to get early access to new features as they are developed.1. https://www.epa.gov/energy/greenhouse-gas-equivalencies-calculator2. https://inthecloud.withgoogle.com/it-leaders-research-21/sustainability-dl-cd.html
Quelle: Google Cloud Platform

Supercharge your event-driven architecture with new Cloud Functions (2nd gen)

Today, we are introducing Cloud Functions (2nd gen), Google Cloud’s next-generation Functions-as-a-Service product. This next generation version of Cloud Functions comes with an advanced feature set giving you more powerful infrastructure, advanced control over performance and scalability, more control around the functions runtime and triggers from over 90 event sources. Further, the infrastructure is powered by Google Cloud’s cutting-edge serverless and eventing infrastructure, Cloud Run and Eventarc.Infrastructure that meets your workloads’ needsCloud Functions adds a range of new capabilities for 2nd gen functions, such as concurrency (up to 1,000 concurrent requests per function instance), larger instances (16 GB memory and 4 vCPUs) and longer processing time for HTTP functions (up to 60 mins) and minimum instances (prewarmed instances). Longer request processing – Run your 2nd gen cloud functions for up to 60 mins for HTTP functions, making it easier to run longer-request workloads such as processing large streams of data from Cloud Storage or BigQuery.Larger instances – Leverage up to 16GB of RAM and 4 vCPUs on 2nd gen cloud functions, allowing larger in-memory, compute-intensive and more parallel workloads.Concurrency – Leverage up to 1000 concurrent requests with a single function, minimizing cold starts and improving latency and cost when scaling.Minimum instances – Provide for pre-warmed instances to cut your cold starts and make sure the bootstrap time of your application does not impact application performance.Traffic splitting – 2nd gen cloud functions support multiple revisions of your functions, splitting traffic between different revisions and rolling your function back to a prior version.Broader event coverage and CloudEvents support2nd gen cloud functions now include native support for Eventarc, which brings over 90+ event sources from direct sources and Cloud Audit logs (e.g., BigQuery, Cloud SQL, Cloud Storage, Firebase…). And of course, Cloud Functions still supports events from custom sources by publishing to Pub/Sub directly. These event-driven functions adhere to industry-standard CloudEvents, regardless of the source, to ensure a consistent developer experience.New developer experienceCloud Functions features an enhanced UI, customizable dashboard, improved developer experience and accessibility updates. A new seamless onboarding experience makes it easy to quickly develop and deploy your 1st gen and 2nd gen functions in one place. A deployment progress tracker navigates through the process of the 2nd gen function deployment and helps to spot the errors associated with each step. The UI also simplifies integrations with Eventarc using new menus and badges to help you find information about your function.Portability based on OSS buildpacks and Functions Frameworks2nd gen functions are built using open-source buildpacks and Functions Frameworks, giving you the portability to run your functions anywhere. Check out the new Cloud FunctionsWe are excited to see what you build with Cloud Functions. You can learn more about Cloud Functions here and get started using Quickstarts: Cloud Functions.Related ArticleNew Cloud Functions min instances reduces serverless cold startsSetting ‘min instances’ on your Cloud Functions applications translates to lower startup times.Read Article
Quelle: Google Cloud Platform

Data modernization with Google Cloud and MongoDB Atlas

What does modernization mean?As an IT leader or architect, you may notice that your software architecture is encountering performance issues. You may be considering moving your datastore from a mainframe or a traditional relational database (RDBMS) to a more modern database to take advantage of advanced analytics, scale at a faster rate, and opportunities to cut costs. Such is the impetus for modernization.An approach to modernization can be defined as, “An open, cross-functional collaboration dedicated to building new design systems and patterns that support evolving computing capabilities, information formats, and user needs.”Within the same spirit of modernization we can say that MongoDB works along with Google Cloud technologies to provide joint solutions and some reference architectures to help our customers leverage this partnership.Principles of modern technology solutionsA point of view to Modernization is understood through four basic principles that focus on outcomes for our customers. These principles can be applied to envision what a modern solution should achieve or to identify whether a given solution is modern or not.Help users get more done. Bring quality information forward and make it actionable in context. Actions are the new blue links.Feed curiosity. Open doorways to rich, endless discovery. Remove dead ends for users who want to engage more.Reflect the world, in real time. Surface fresh, dynamic content. Help users be in the know.Be personal, then personalize. Encourage the user’s personal touch to surface personal content and personalized experiences. Be stateful and contextual.Modern applications should be capable of presenting information in a way that enables users to not only make decisions, but also transform those decisions into actions. This requires the use of variable data formats and integration mechanisms that will allow the end user to interact with various systems and produce real-time results, without the need to log in to each one of them.MongoDB Atlas, a modern database management systemIf we are to use the four principles of modernization as a reference to identify modern solutions, then MongoDB Atlas reflects these directly. Altas helps database and infrastructure administrators get more done faster and with less effort than managing MongoDB on premises. It is a fully managed database service that takes care of the most critical and time-consuming tasks related to providing a continuous and reliable service, including security and compliance features out of the box, freeing administrators’ and developers’ time to focus on innovation.The third principle talks about reflecting the world in real time. This is the most cumbersome and daunting task for anybody who is responsible for the design of a modern technology system, since it requires an architecture capable of receiving, processing, storing, and producing results from data streams originated by different systems, at different velocity rates, and in different formats. Atlas frees the solution architect from this burden. As a managed service, it takes care of the networking, processing, and storage resources allocation, so it will scale as needed, when needed. And as a document-based database, it also allows for flexibility in regards to the format and organization of incoming data, Developers can focus on the actual process rather than spend their time modeling the information to make it fit into the RDBMS, as so often happens with traditional relational database schemas. It also provides real-time data processing features that allow for the execution of code or the consumption of external APIs residing in separate applications or even in various clouds.Of course, the combination of the first three principles leads to the fourth, which is to personalize the experience to the end user. Businesses must be able to solve specific user needs, rather than limit their processes solely to what their database or application is capable of. Putting the user first invariably leads to a better and modern experience—and that starts with choosing the best cloud provider and a database that aligns with these principles.A reference architecture for data modernizationLet’s dive into a general view of the migration reference architecture that enables the four aforementioned principles.An Operational Data Layer (or ODL) is an architectural pattern that centrally integrates and organizes siloed enterprise data, making it available to consuming applications. It enables a range of board-level strategic initiatives such as Legacy Modernization and Data as a Service, and use cases such as single view, real-time analytics and mainframe offload.An Operational Data Layer is an intermediary between existing data sources and consumers that need to access that data. An ODL deployed in front of legacy systems can enable new business initiatives and meet new requirements that the existing architecture can’t handle— without the difficulty and risk of a full rip and replace of legacy systems.For an initial migration that will keep the current architecture in place while replicating records that are produced over the production system, the following reference shows some components that can be taken into account to achieve a starting point in time backup and restore on MongoDB Atlas, while at the same time enabling real time synchronization.Figure 1. One-time data migration and real-time data syncThe above solution architecture shows both general views for one-time data migration and real-time data synchronization using Google Cloud technologies. A one-time data migration involves initial bulk ETL of data from the source relational database to MongoDB. Google Cloud Data Fusion can be used along with Apache Sqoop or Spark SQL’s JDBC connector powered by Dataproc to extract data from the source and store it in Google Cloud Storage temporarily. Custom Spark jobs powered by Dataproc are deployed to transform the data and load into MongoDB Atlas. MongoDB has a native spark connector which will allow storing Spark DataFrame as collections.Figure 2. One-time data migrationIn most of the migrations, the source database will not be retired for a few weeks to months. In such cases, MongoDB Atlas needs to be up to date with the source database. We can use Change Data Capture (CDC) tools like Google Cloud Datastream or Debezium on Dataflow to capture the changes, which can then be pushed to message queues like Google Cloud Pub/Sub. We can write custom transformation jobs using Apache beam powered by Dataflow, Java, or Python, which can consume the data from the message queue, transform it, and push it to MongoDB Atlas using native drivers. Google Cloud Composer will help orchestrate all the workflows.Figure 3. Real-time data synchronizationCommon use cases for MongoDBBelow are some observed common patterns of MongoDB. (For a more general treatment of more patterns please check out the MongoDB use case page.)Monolith to microservice – With its flexible schema and capabilities for redundancy, automation, and scalability, MongoDB (and MongoDB Atlas, its managed services version) is very well suited for microservices architecture. Together, MongoDB Atlas and microservices on Google Cloud can help organizations better align teams, innovate faster, and meet today’s demanding development and delivery requirements with full sharding across regions and globally.Legacy modernization – Relationship databases impose a tax on a business—a Data and Innovation Recurring Tax (DIRT). By modernizing with MongoDB, you can build new business functionality 3-5x faster, scale to millions of users wherever they are on the planet, and cut costs by 70% and more—all by unshackling yourself from legacy systems and, at the same time, taking advantage of the Google Cloud ecosystem. Mainframe offload – MongoDB can help offload key applications from the mainframe to a modern data platform without impacting your core systems, and help achieve agility while also reducing costs.Real-time analytics – MongoDB makes it easy to scale to the needs of real-time analytics with Atlas on Google Cloud; coupled with Google cloud analytics, such as BigQuery, the sky’s the limit.Mobile application development- MongoDB Realm helps companies build better apps faster with edge-to-cloud sync and fully managed backend services, including triggers, functions, and GraphQL.Other reference architecturesBelow are some reference architectures that can be applied to particular requirements. For more information, visit:MongoDB Use CasesGoogle Cloud Architecture CenterAn Operational Data Warehouse requires swift response times to keep data updated to the most recent state possible, with the final goal to produce near-real-time analytics. It also has to be scalable, robust, and secure to adapt to the highest standards and be compliant with various regulations.Figure 4. Operationalized Data Warehouse (ODS + EDW)The above referenced architecture describes which Google Cloud components can be combined to ingest data from any source into an ODS supported by MongoDB Atlas and how to integrate this ODS with an Enterprise Data Warehouse (BigQuery) that enables structured data for analytical tools like Looker.Shopping Cart AnalysisFigure 5 illustrates an implementation example of the Operationalized Data Warehouse reference architecture shown previously. In this scenario, several data sources (including shopping cart information) are replicated in real time to MongoDB through the Spark Connector. Information is then processed using Dataflow as a graphical interface to generate data processing jobs that are executed over an ephemeral, managed Hadoop & Spark cluster (Dataproc). Finally, processed data can be structured and stored for fast querying in BigQuery, supporting Shopping Cart, Product Browsing, and Outreach applications.Figure 5. Shopping cart analysisRecommendation EnginesFigure 6 presents a continuation of the idea presented in the last example. Now the objective is to use MongoDB Atlas as an Operational Data Warehouse that combines structured and semistructured data (SQL and noSQL data) in real time. This works as a centralized repository that enables machine learning tools such as Spark Mlib running on Dataproc, Cloud Machine Learning (now Vertex AI), and Prediction API to analyze data and produce personalized recommendations for customers visiting an online store in real time.Data from various systems can be ingested as-is and stored and indexed in JSON format in MongoDB. Dataproc would then use MongoDB Apache Spark Connector to perform the analysis. The insight would be stored in BigQuery and distributed to applications downstream.Learn more about MongoDB and Google Cloud at cloud.google.com/mongodbRelated ArticleLooker lets you choose what works best for your dataEmbrace platform freedom with Looker. Learn about how we are expanding our features as a cloud platform to meet the unique needs of every…Read Article
Quelle: Google Cloud Platform