Thank you Partners for three years of growth and winning together

Congratulations to our fast growing ecosystem of global partners for three years of commitment to Partner Advantage, underscored by great collaboration, high energy, innovative ideas, and transformative impact. Together we’ve leveraged our program to drive growth and customer satisfaction. Year to date, there has been more than a 140% year-over-year increase in experts at our partner organizations trained (devs, technical, certifications, solutions) for 2022. This has translated into thousands of happy customers, many of whose stories are available to read in our Partner Directory. Each of you continue to inspire our shared customers and all of us at Google Cloud. And we are only getting started!We are hard at work making sure every aspect of your business with Google Cloud is smooth running, easy to navigate, and profitable. So what’s in store for 2023? Here’s a sneak peak: Expect to see more activity and focus around our Differentiation Journey as a vehicle for driving your growth and success. This includes encouraging partners to offer more in the area of high value and repeatable services, where the opportunity is large and growing fast. You can learn more about the global economic impact our partners are having in this blog post.You’ll also see Partner Advantage focusing more on solutions and customer transformation. All of which will include corresponding incentives, new benefits, and more features.Thank you again for your commitment and hard work. It’s been a fantastic three years of amazing opportunity and growth. Not a partner yet? Start your journey today!The best is yet to come!-Nina HardingRelated ArticleRead Article
Quelle: Google Cloud Platform

Google + Mandiant: Transforming Security Operations and Incident Response

Over the past two decades, Google has innovated to build some of the largest and most secure computing systems in the world. This scale requires us to deliver pioneering approaches to cloud security, which we pass on to our Google Cloud customers. We are committed to solving hard security problems like only Google can, as the tip of the spear of innovation and threat intelligence.Today we’re excited to share the next step in this journey with the completion of our acquisition of Mandiant, a leader in dynamic cyber defense, threat intelligence and incident response services. Mandiant shares our cybersecurity vision and will join Google Cloud to help organizations improve their threat, incident and exposure management.Combining Google Cloud’s existing security portfolio with Mandiant’s leading cyber threat intelligence will allow us to deliver a security operations suite to help enterprises globally stay protected at every stage of the security lifecycle. With the scale of Google’s data processing, novel analytics approaches with AI and machine learning, and a focus on eliminating entire classes of threats, Google Cloud and Mandiant will help organizations reinvent security to meet the requirements of our rapidly changing world.We will retain the Mandiant brand and continue Mandiant’s mission to make every organization secure from cyber threats and confident in their readiness.Context and threat intelligence from the frontlinesOur goal is to democratize security operations with access to the best threat intelligence and built-in threat detections and responses. Ultimately, we hope to shift the industry to a more proactive approach focused on modernizing Security Operations workflows, personnel, and underlying technologies to achieve an autonomic state of existence – where threat management functions can scale as customers’ needs change and as threats evolve.Today Google Cloud security customers use our cloud infrastructure to ingest, analyze and retain all their security telemetry across multicloud and on-premise environments. By leveraging our sub-second search across petabytes of information combined with security orchestration, automation and response capabilities, our customers can spend more time defending their organizations. The addition of Mandiant Threat Intelligence—which is compiled by their team of security and intelligence individuals spread across 22 countries, who serve customers located in 80 countries—will give security practitioners greater visibility and expertise from the frontlines. Mandiant’s experience detecting and responding to sophisticated cyber threat actors will offer Google Cloud customers actionable insights into the threats that matter to their businesses right now. We will continue to share groundbreaking Mandiant threat research to help support organizations, even for those who don’t run on Google Cloud.Advancing shared fate for security operationsGoogle Cloud operates in a shared fate model, taking an active stake in the security posture of our customers. For security operations that means helping organizations find and validate potential security issues before they become an incident. Detecting, investigating and responding to threats is only part of better cyber risk management. It’s also crucial to understand what an organization looks like from an attacker’s perspective and if an organization’s cybersecurity controls are as effective as expected. By adding Mandiant’s attack surface management capabilities to Google Cloud’s portfolio, organizations will be able to continually monitor assets for exposures, enabling intelligence and red teams to move security programs from reactive to proactive to understand what’s vulnerable, misconfigured and exposed. Once an organization’s attack surface is understood, validating existing security controls is critical. With Mandiant Security Validation, organizations will be able to continuously validate and measure the effectiveness of their cybersecurity controls across cloud and on-premise environments.Transforming security operations and incident response Security leaders and their teams often lack the resources and expertise required to keep pace with today’s ever changing threats. Organizations already harness Google’s security tools, expert advice and rich partner ecosystem to evolve their security program. Google’s Autonomic Security Operations also serves as a prescriptive solution to guide our customers through this modernization journey. With the addition of Mandiant to the Google Cloud family, we can now offer proven global expertise in comprehensive incident response, strategic readiness and technical assurance to help organizations mitigate threats and reduce business risk before, during and after an incident.In addition, Google Cloud’s security operations suite will continue to provide a central point of intelligence, analysis and operations across on-premise environments, Google Cloud and other cloud providers. Google Cloud is also deeply committed to supporting our technology and solution partners, and this acquisition will enable system integrators, resellers and managed security service providers to offer broader solutions to customers.Comments on the news“The power of stronger partnerships across the cybersecurity ecosystem is critical to driving value for clients and protecting industries around the globe. The combination of Google Cloud and Mandiant and their commitment to multicloud will further support increased collaboration, driving innovation across the cybersecurity industry and augmenting threat research capabilities. We look forward to working with them on this mission.” – Paolo Dal Cin, Global Lead, Accenture Security“Google’s acquisition of Mandiant, a leader in security advisory, consulting and incident response services will allow Google Cloud to deliver an end-to-end security operations suite with even greater capabilities and services to support customers in their security transformation across cloud and on-premise environments.” – Craig Robinson, Research VP, Security Services, IDC “Bringing together Mandiant and Google Cloud, two long-time cybersecurity leaders, will advance how companies identify and defend against threats. We look forward to the impact of this acquisition, both for the security industry and the protection of our customers.” – Andy Schworer, Director, Cyber Defense Engineering, UberWe welcome Mandiant to the Google Cloud team, and together we look forward to helping security teams achieve so much more in defense of their organizations. You can read our release and Kevin Mandia’s blog for more on this exciting news.
Quelle: Google Cloud Platform

How Google scales ad personalization with Bigtable

Cloud Bigtable is a popular and widely used key-value database available on Google Cloud. The service provides scale elasticity, cost efficiency, excellent performance characteristics, and 99.999% availability SLA. This has led to massive adoption with thousands of customers trusting Bigtable to run a variety of their mission-critical workloads.Bigtable has been in continuous production usage at Google for more than 15 years now. It processes more than 5 billion requests per second at peak and has more than 10 exabytes of data under management. It’s one of the largest semi-structured data storage services at Google. One of the key use cases for Bigtable at Google is ad personalization. This post describes the central role that Bigtable plays within ad personalization.Ad personalizationAd personalization aims to improve user experience by presenting topical and relevant ad content. For example, I often watch bread-making videos on YouTube. If ads personalization is enabled in my ad settings, my viewing history could indicate to YouTube that I’m interested in baking as a topic and would potentially be interested in ad content related to baking productsAd personalization requires large-scale data processing in near real-time for timely personalization with strict controls for user data handling and retention. System availability needs to be high, and serving latencies need to be low due to the narrow window within which decisions need to be made on what ad content to retrieve and serve. Sub-optimal serving decisions (e.g. falling back to generic ad content) could potentially impact user experience. Ad economics requires infrastructure costs to be kept as low as possible.Google’s ad personalization platform provides frameworks to develop and deploy machine learning models for relevance and ranking of ad content. The platform supports both real-time and batch personalization. The platform is built using Bigtable, allowing Google products to access data sources for ads personalization in a secure manner that is both privacy and policy compliant, all while honoring users’ decisions about what data they want to provide to GoogleThe output from personalization pipelines, such as advertising profiles are stored back in Bigtable for further consumption. The ad serving stack retrieves these advertising profiles to drive the next set of ad serving decisions.Some of the storage requirements of the personalization platform include:Very high throughput access for batch and near real-time personalizationLow latency (<20 ms at p99) lookup for reads on the critical path for ad servingFast (i.e. in the order of seconds) incremental update of advertising models in order to reduce personalization delayBigtable Bigtable’s versatility in supporting both low-cost, high-throughput access to data for offline personalization as well as consistent low-latency access for online data serving makes it an excellent fit for the ads workloads. Personalization at Google-scale requires a very large storage footprint. Bigtable’s scalability, performance consistency and low cost required to meet a given performance curve are key differentiators for these workloads. Data modelThe personalization platform stores objects in Bigtable as serialized protobufs keyed by Object ids. Typical data sizes are less than 1 MB and serving latency is less than 20 ms at p99. Data is organized as corpora, which correspond to distinct categories of data. A corpus maps to a replicated Bigtable.Within a corpus, data is organized as DataTypes, logical groupings of data. Features, embeddings, and different flavors of advertising profiles are stored as DataTypes, which map to Bigtable column families. DataTypes are defined in schemas which describe the proto structure of the data and additional metadata indicating ownership and provenance. SubTypes map to Bigtable columns and are free-form. Each row of data is uniquely identified by a RowID, which is based on the Object ID. The personalization API identifies individual values by RowID (row key), DataType (column family), SubType (column part), and Timestamp.ConsistencyThe default consistency mode for operations is eventual. In this mode, data from the Bigtable replica nearest to the user is retrieved, providing the lowest median and tail latency.Reads and writes to a single Bigtable replica are consistent. If there are multiple replicas of Bigtable in a region, traffic spillover across regions is more likely. To improve the likelihood of read-after-write consistency, the personalization platform uses a notion of row affinity. If there are multiple replicas in a region, one replica is preferentially selected for any given row, based on a hash of the Row ID. For lookups with stricter consistency requirements, the platform first attempts to read from the nearest replica and requests that Bigtable return the current low watermark (LWM) for each replica. If the nearest replica happens to be the replica where the writes originated, or if the LWMs indicate that replication has caught up to the necessary timestamp, then the service returns a consistent response. If replication has not caught up, then the service issues a second lookup—this one targeted at the Bigtable replica where writes originated. That replica could be distant and the request could be slow. While waiting for a response, the platform may issue failover lookups to other replicas in case replication has caught up at those replicas.Bigtable replicationThe Ads personalization workloads use a Bigtable replication topology with more than 20 replicas, spread across four continents. Replication helps address the high availability needs for ad serving. Bigtable’s zonal monthly uptime percentage is in excess of 99.9%, and replication coupled with a multi-cluster routing policy allows for availability in excess of 99.999%.A globe-spanning topology allows for data placement that is close to users, minimizing serving latencies. However, it also comes with challenges such as variability in network link costs and throughputs. Bigtable uses Minimum Spanning Tree-based routing algorithms and bandwidth-conserving proxy replicas to help reduce network costs. For ads personalization, reducing Bigtable replication delay is key to lowering the personalization delay (the time between a user’ action and when that action has been incorporated into advertising models to show more relevant ads to the user). Faster replication is preferred but we also need to balance serving traffic against replication traffic and make sure low-latency user-data serving is not disrupted due to incoming or outgoing replication traffic flows. Under the hood, Bigtable implements complex flow control and priority boost mechanisms to manage global traffic flows and to balance serving and replication traffic priorities. Workload IsolationAd personalization batch workloads are isolated from serving workloads by pinning a given set of workloads onto certain replicas; some Bigtable replicas exclusively drive personalization pipelines while others drive user-data serving. This model allows for a continuous and near real-time feedback loop between serving systems and offline personalization pipelines, while protecting the two workloads from contending with each other.For Cloud Bigtable users, AppProfiles and cluster-routing policies provide a way to confine and pin workloads to specific replicas to achieve coarse-grained isolation. Data residencyBy default, data is replicated to every replica—often spread out globally—which is wasteful for data that is only accessed regionally. Regionalization saves on storage and replication costs by confining data to the region where it is most likely to be accessed. Compliance with regulations mandating that data pertaining to certain subjects are physically stored within a given geographical area is also vital.The location of data can be either implicitly determined by the access location of requests or through location metadata and other product signals. Once the location for a user is determined, it is stored in a location metadata table which points to the Bigtable replicas that read requests should be routed to. Migration of data based on row-placement policies happens in the background, without downtime or serving performance regressions.ConclusionIn this blog post, we looked at how Bigtable is used within Google to support an important use case—modeling user intent for ad personalization. Over the past decade, Bigtable has scaled as Google’s personalization needs have scaled by orders of magnitude. For large-scale personalization workloads, Bigtable offers low cost storage with excellent performance characteristics. It seamlessly handles global traffic flows with simple user configurations. Its ease at handling both low-latency serving and high-throughput batch computations make it an excellent option for lambda-style data processing pipelines.We continue to drive high levels of investment to further lower costs, improve performance, and bring new features to make Bigtable an even better choice for personalization workloads.Learn moreTo get started with Bigtable, try it out with a Qwiklab and learn more about the product here.AcknowledgementsWe’d like to thank Ashish Awasthi, Ashish Chopra, Jay Wylie, Phaneendhar Vemuru, Bora Beran, Elijah Lawal, Sean Rhee and other Googlers for their valuable feedback and suggestions.Related ArticleMoloco handles 5 million+ ad requests per second with Cloud BigtableMoloco uses Cloud Bigtable to build their ad tech platform and process 5+ million ad requests per second.Read Article
Quelle: Google Cloud Platform

Cloud Wisdom Weekly: 6 tips to optimize data management and analytics

“Cloud Wisdom Weekly: for tech companies and startups” is a new blog series we’re running this fall to answer common questions our tech and startup customers ask us about how to build apps faster, smarter, and cheaper. In this installment, Google Cloud Big Data & Analytics Consultant Julianne Cuneo explores how to get started using BigQuery effectively.Working with large amounts of data – like those encountered with traditional data warehouses and data lakes – can be challenging, complex, expensive, and reliant on specialized skills that can be difficult to source. To compete in today’s customer-centric and data-driven marketplaces, these challenges are critical to overcome. Analyzing data at scale is crucial to this effort, but so is managing costs and resources. Many businesses are thus looking to the cloud to find solutions and strike the right balance. In this article, we will explore how growing tech companies and startups leverage BigQuery for innovation, and we will share tips that will help you do more with Google’s industry-leading enterprise cloud data warehouse.Optimizing data management and analytics Oftentimes, companies rush into loading data and running queries for the sake of seeing how a new technology will perform. This is reasonable for a quick proof-of-concept or evaluation, but it doesn’t necessarily set you up for success in the long term, as you’re encouraged to be more sophisticated in your approach to business, security, and budgetary needs. The below tips will help you set up a strong, scalable foundation, including specific examples of how to optimize a data platform architecture with BigQuery.1. Independently scale storage and computeWhen it comes to handling massive amounts of data, having the right storage capabilities is one of the biggest challenges. Assuming you can afford the cost associated with maintaining large volumes of information, effectively analyzing and extracting value from it can be even more daunting. A serverless architecture can help you overcome these challenges in a couple ways. First, serverless platforms such as BigQuery separate compute and storage, letting you pay independently for the resources you use, flexibly scaling up or down as your data needs change. Whereas some services bundle resources such that you get (and pay for) more compute and storage than you need, this approach makes storing large amounts of data more cost-effective and therefore more feasible Second, if you can afford to store more data, you create more potential for insights. To that end, BigQuery’s scalable compute capacity allows you to query terabytes or even petabytes of data in a single request.  Combined, these capabilities enable you to scale analytics efforts according to your needs, rather than a predefined amount of storage or compute resources.2. Carefully organize storage and datasetsProviding secure and consistent data access to the necessary people at the right cost is another crucial aspect of data management and analytics. Appropriately planning for resource optimization can save time and circumvent security, billing, and workflow problems down the road. For instance, in BigQuery’s resource organization, key design considerations include:Datasets and their objects (e.g., tables, views, ML models, etc.) only belong to a single project. This is the project to which that dataset’s storage costs will be billed. Peruse this resource to consider whether you’d want to implement a centralized data warehouse approach, allocate data marts to individual projects, or mix both approaches.Access to objects in BigQuery can be controlled at the dataset, table, row, and column level, which should also be factored into your storage design (e.g., grouping closely-related objects in the same dataset to simplify access grants).3. Optimize compute cost and performance across teams and use casesSome use cases may require precise cost controls or resource planning to meet tight service-level agreements (SLAs). In BigQuery, for instance, data only belongs to a single project, but can be queried from anywhere, with compute resources billed to the project that runs the query, regardless of data location. Therefore, to granularly track query usage, you can create individual projects for different teams (e.g., finance, sales) or use cases (e.g., BI, data science). In addition to segmenting your compute projects by team or use case for billing purposes, you should think about how you may want to control compute resources across projects for workload management. In BigQuery, you can use “slot commitments” to switch between an on-demand model and a flat-rate billing model, including mixing and matching approaches to balance on-demand efficiency with flat-rate predictability. “Slot commitments” are dedicated compute resources that can be further divided into smaller allocations (or “reservations”). These allocations can either be assigned to an individual project or shared by multiple projects, providing flexibility that allows you to reserve compute power for high-priority or compute-intensive workloads while enjoying cost savings over the on-demand query model.For example, say your company has committed to 1,000 slots. You may choose to allocate 500 to your compute-intensive data science projects, 300 to ETL, and 200 to internal BI which has a more flexible SLA. Best of all, your idle slots aren’t isolated in a silo to be left unused. If your ETL projects aren’t using all of their 300 slots, these idle resources can be seamlessly shared with your other data science or BI projects until they are needed again.4. Load and optimize your data schemasOnce you understand how your data will be organized, you can start populating your data warehouse. BigQuery provides numerous ways to ingest data through flat files in Google Cloud Storage, pre-built connectors to apps and databases through Data Transfer Service, streaming inserts, and compatibility with numerous third party data migration and ETL tools.A few simple optimizations to your table schemas can help you achieve the best results. In most cases, this means applying partitioning and/or clustering based on your expected query patterns to significantly reduce the amount of data scanned by queries.5. Unify your data investments Your data and analysis needs might involve working with unstructured and semi-structured data alongside your more broadly-understood, structured data. For this, it is helpful to think beyond just “enterprise data warehouse” and broaden your focus to include solutions that provide a true, centralized data lake. If you’re using BigQuery, the platform’s federation capabilities can seamlessly query data stored in Google services including Cloud Storage, Drive, Bigtable, Cloud SQL, and Cloud Spanner, as well as data in other clouds. BigQuery’s Storage API also gives other services such as Dataproc, Dataflow, ML, and BI tools fast access to BigQuery storage at high volumes. Features such as these can help ensure that your data efforts are part of a unified, consistent approach, rather than being splintered across platforms and teams. 6. Run queries and have fun!Once your data is available, it’s time to start querying! To make sure you don’t hit any snags, your platform should ideally provide an easy onramp that lets people get started right away. As an ANSI-compliant solution, BigQuery SQL provides the average SQL developer with the ability to leverage their existing skills right from the start. There are also numerous third-party tools that provide native connectors to BigQuery or leverage BigQuery’s JDBC/ODBC drivers to author queries on the user’s behalf. If you have numerous SQL scripts from a previous data warehouse investment, BigQuery’s Migration Service can help automate translation of jobs coming from Teradata, Redshift, and several other services. These features allow you to make data available, protected, and smartly-budgeted, and helps ensure it can easily plug into user-friendly interfaces for analysis. And if you’re making the move to BigQuery, be sure to take advantage of BigQuery’s unique features, rather than just moving existing queries and continuing to operate in the status quo. Run those large analyses that wouldn’t have been able to execute on another system. Try training a prototype machine learning model using SQL-based BigQuery ML. Query streaming data in real-time. Perform geospatial analysis with built-in GIS functions. It’s time to innovate.Building a solid data foundation takes time and planning The tips put forth in this article should help position your company for success in the near- and long-term, sparing you from the need to rearchitect your warehousing solution as your business matures. Deciding to put the time, effort, and monetary investment into any new technology requires careful evaluation, so we encourage you to get hands-on with BigQuery through quickstarts, and by visiting our Startups page or reaching out to Google Cloud experts.Related ArticleCloud Wisdom Weekly: 5 ways to reduce costs with containersUnderstand the core features you should expect of container services, including specific advice for GKE and Cloud Run.Read Article
Quelle: Google Cloud Platform

Introducing Google Cloud Backup and DR

Backup is a fundamental aspect of application protection. As such, the need for a seamlessly integrated, centralized backup service is vital when seeking to ensure resilience and recoverability for data generated by Google Cloud services or on-premises infrastructure. Regardless of whether the need to restore data is triggered by a user error, malicious activity, or some other reason, the ability to execute reliable, fast recovery from backups is a critical aspect of a resilient infrastructure. A comprehensive backup capability should have the following characteristics: 1) centralized backup management across workloads, 2) efficient use of storage to minimize costs, and 3) minimal recovery times. To effectively address these requirements backup service providers must deliver efficiency at the workload level, while also supporting a diverse spectrum of customer environments, applications, and use cases. Consequently, the implementation of a truly effective, user-friendly backup experience is no small feat.And that’s why, today, we’re excited to announce the availability of Google Cloud Backup and DR, enabling centralized backup management directly from the Google Cloud console.Helping you maximize backup valueAt Google Cloud we have a unique opportunity to solve backup challenges in ways that fully maximize the value you achieve. By building a product with our customers firmly in mind, we’ve made sure that Google Cloud Backup and DR makes it easy to set up, manage, and restore backups.As an example, we placed a high priority on delivering an intuitive, centralized backup management experience. With Google Cloud Backup and DR, administrators can effectively manage backups spanning multiple workloads. Admins can generate application- and crash-consistent backups for VMs on Compute Engine, VMware Engine or on-premises VMware, databases (such as SAP, MySQL and SQL Server), and file systems. Having a holistic view of your backups across multiple workloads means you spend less time on management and can be sure you have consistency and completeness in your data protection coverage.Google Cloud Backup and DR dashboardEven better, Google Cloud Backup and DR stores backup data in its original, application-readable format. As a result, backup data for many workloads can be made available directly from long-term backup storage (e.g., leveraging cost-effective Cloud Storage), with no need for time-consuming data movement or translation. This accelerates recovery of critical files and supports rapid resumption of critical business operations.Making sure you minimize backup TCOSimilarly, we also took care to help you minimize total cost of ownership (TCO) of your backups. With this objective in mind, we designed Google Cloud Backup and DR to implement space-efficient “incremental forever” storage technology to ensure that you pay only for what you truly need. With “incremental forever” backup, after Google Cloud Backup and DR takes an initial backup, subsequent backups only store data associated with changes relative to the prior backup. This allows backups to be captured more quickly and reduces the network bandwidth required to transmit the associated data. It also minimizes the amount of storage consumed by the backups, which benefits you via reduced storage consumption costs.In addition, there is flexibility built in to allow you to strike your desired balance between storage cost and data retention time. For example, when choosing to store backups on Google Cloud Storage, you can select an appropriate Cloud Storage class in alignment with your needs.Start reaping the benefitsThe introduction of Google Cloud Backup and DR is a reflection of our broader commitment to make cloud infrastructure easier to manage, faster, and less expensive, while also helping you build a more resilient business. By centralizing backup administration and applying cutting-edge storage and data management technologies, we’ve eliminated much of the complexity, time, and cost traditionally associated with enterprise data protection.But don’t take our word for it. See for yourself in Google Cloud Console. Take advantage of $300 in free Google Cloud credits, give Google Cloud Backup and DR a try starting in late September 2022, and enjoy the benefits of cloud-integrated backup and recovery.Related ArticleNew storage innovations to drive your next-gen applicationsLearn about the latest products and features rolling out for customers using cloud-based block, file and object storage, as well as backu…Read Article
Quelle: Google Cloud Platform

Trust Update: September 2022

If you work in compliance, privacy, or risk, you know that regulatory developments have continued to accelerate this year. As part of our commitment to be the most trusted cloud, we continue to pursue global industry standards, frameworks, and codes of conduct that tackle our customers’ foundational need for a documented baseline of addressable requirements. We have seen key updates across all regions and have worked to help organizations address these new and evolving requirements. Let’s look at the significant updates from around the world, hot topics, and the requirements we’ve recently addressed.Global developments: Residency, portability, and moreGoogle Cloud meets or suprasses the standards for a number of frameworks including ISO/IEC 22301 for business continuity management and the Minimum Viable Secure Product(MVSP), developed with industry partners such as Salesforce, Okta, and Slack. Globally, we continue to address the areas of focus we know are most critical to organizations including operational resiliency, DPIA support, and international data transfers.Highlights from EMEA Consistent with what we have observed historically, EMEA remains a region full of ample developments that expand the regulatory landscape.Digital Operational Resilience Act (DORA) adopted for financial services organizations: One of our most recent critical announcements was our preparations for addressing DORA, which will harmonize how EU financial entities must report cybersecurity incidents, test their digital operational resilience, manage Information and Communications Technology (ICT) third-party risk, and allow financial regulators to directly oversee critical ICT providers. Second annual declaration of adherence to SWIPO: As presented in our SWIPO Transparency Statement, Google Cloud continues to demonstrate our commitment to enabling data portability and interoperability. Our customers always fully control their own data – including when they need to view, delete, download, and transfer their content.Supporting our EU education customers’ privacy assessments: The recent Datatilsynet (the Danish Data Protection Authority) ruling on proper due diligence of cloud services is a helpful reminder for customers to conduct thorough risk assessments of third parties. Our latest blog reaffirms Google Cloud’s commitment to helping Education customers and the rest of our current and potential customer base conduct due diligence, including supporting privacy assessments and independent third-party attestations. The introduction of new requirements in Asia PacificWe continue to monitor the rapidly evolving regulatory landscape in Asia Pacific that has been rich with new developments and the introduction of several laws so far this year. Addressed compliance for Australia’s DTA HCF: To help support Australian government customers with data residency and local customer support capabilities, Google Cloud is now ‘certified strategic’ under the Hosting Certification Framework (HCF) administered by Australia’s Digital Transformation Agency.Privacy requirements in Japan, New Zealand, and Taiwan: Meeting privacy obligations remain a top priority for many organizations. To help, we’ve built compliance support for Japan’s Act on the Protection of Personal Information (APPI) along with New Zealand’s Privacy Act and Taiwan’s Personal Data Protection Act (PDPA). Updated U.S. industry compliance In the United States, we continue to seek effective and efficient mechanisms to help our customers address their privacy and security needs. As with every region, customers can view our compliance offerings and mapping in our filterable Compliance Resource Center. Welcoming theTrans-Atlantic Data Privacy Framework: Following the framework implementation, Google Cloud reaffirmed our commitment to helping customers meet stringent data protection requirements. This includes making the protections offered by the E.U.-U.S. data transfer framework available to customers when available. New U.S. industry compliance mappings: From public sector (DISA), to health care (MARS-E), energy (NERC) and criminal justice (CJIS), we have reviewed U.S. industry requirements and released new materials outlining how we can help customers address compliance. A focus on Financial Services in Latin AmericaLatin America remains a focus this year, with Google’s June announcement committing $1.2 billion USD over 5 years to projects in the region. Later in July, Google Cloud built on these initiatives by announcing that a new Google Cloud region is coming to Mexico. For those in one of the most heavily regulated industries like financial services, we remain focused on demonstrating our commitment to regulations in that sector. Meeting outsourcing requirements in financial services: We have new and updated compliance mappings for banking requirements in Brazil, Peru, and Colombia. Each new mapping is designed to support risk and compliance leaders’ need for compliance and reporting documentation. Using our compliance developmentsWe know developments are impactful not only for organizations that seek to meet requirements, but also for those team members tasked with ensuring their service providers adapt their approaches in response to critical industry developments. Many Google Cloud customers are already using our trust and compliance resources to facilitate internal and external conversations with their key customers, business partners, and regulators. Visit our Compliance Resource Center or continue the conversation with our sales team by visiting our Sales Center today.Related ArticleGoogle Cloud’s preparations to address the Digital Operational Resilience ActAs the EU’s proposed DORA regulation reaches a major milestone, Google Cloud details our approach to its new rules and rule changes.Read Article
Quelle: Google Cloud Platform

Optimizing terabyte-scale PostgreSQL migrations to Cloud SQL with Searce

Google Cloud allows you to move your PostgreSQL databases to Cloud SQL with Database Migration Service (DMS). DMS gives you the ability to replicate data continuously to the destination database, while the source is live in production, enabling you to migrate with minimum downtime.However, terabyte-scale migrations can be complex. For instance, if your PostgreSQL database hasLarge Objects, then you will require some downtime to migrate them manually as that is a limitation of DMS. There are few more such limitations – check outknown limitations of DMS. If not handled carefully, these steps can extend the downtime during cutover, lead to performance impact on the source instance, or even delay the project delivery date. All this may mean significant business impact. Searce is a technology consulting company, specializing in modernizing application and database infrastructure by leveraging cloud, data and AI. We empower our clients to accelerate towards the future of their business. In our journey, we have helped dozens of clients migrate to Cloud SQL, and have found terabyte-scale migrations to be the toughest for the reasons mentioned earlier. This blog centers around our work in supporting an enterprise client whose objective was to migrate dozens of terabyte scale, mission-critical PostgreSQL databases to Cloud SQL with minimum downtime. Their largest database was 20TB in size and all the databases had tables with large objects and some tables did not have primary keys. Note that DMS had a limitation of not supporting migration of tables without a primary key during the time of this project. In June 2022, DMS released an enhancement to support the migration of tables without a primary key.  In this blog, we share with you our learnings about how we simplified and optimized this migration, so that you can incorporate our best practices into your own migrations. We explore mechanisms to reduce the downtime required for operations not handled by DMS by ~98% with the use of automation scripts. We also explore database flags in PostgreSQL to optimize DMS performance and minimize the overall migration time by ~15%. Optimize DMS performance with database flagsOnce the customer made the decision to migrate PostgreSQL databases to Google Cloud SQL, we considered two key factors that would decide business impact – migration effort and migration time. To minimize effort for the migration of PostgreSQL databases, we leveraged Google Cloud’s DMS (Database Migration Service) as it is very easy to use and  it does the heavy lifting by continuously replicating data from the source database to the destination Cloud SQL instance, while the source database is live in production.How about migration time? For a terabyte-scale database, depending on the database structure, migration time can be considerably longer. Historically, we observed that DMS took around 3 hours to migrate a 1 TB database. In other cases, where the customer database structure was more complex, migration took longer. Thankfully, DMS takes care of this replication while the source database is live in production, so no downtime is required during this time. Nevertheless, our client would have to bear the cost of both the source and destination databases which for large databases, might be substantial. Meanwhile, if the database size increased, then replication could take even longer, increasing the risk of missing the customer’s maintenance window for the downtime incurred during cutover operations. Since the customer’s maintenance window was monthly, we would have to wait for 30 more days for the next maintenance window, requiring the customer to bear the cost of both the databases for another 30 days. Furthermore, from a risk management standpoint, the longer the migration timeframe, the greater the risk that something could go wrong. Hence, we started exploring options to reduce the migration time. Even the slightest reduction in migration time could significantly reduce the cost and risk. We explored options around tuning PostgreSQL’s database flags on the source database. While DMS has its own set of prerequisite flags for the source instance and database, we also found that flags like shared_buffers, wal_buffers and maintenance_work_memhelped accelerate the replication process through DMS. These flags needed to be set to a specific value to get the maximum benefit out of each of them. Once set, their cumulative impact was a reduction in time for DMS to replicate a 1 TB database by 4 hours, that is, reduction of 3.5 days for a 20 TB database. Let’s dive into each of them.Shared BuffersPostgreSQL uses two buffers – its own internal buffer and the kernel buffered IO. In other words, that data is stored in memory twice. The internal buffer is called shared_buffers, and it determines the amount of memory used by the database for the operating system cache. By default this value is set conservatively low. However, increasing this value on the source database to fit our use case helped increase the performance of read heavy operations, which is exactly what DMS does once a job has been initialized.After multiple iterations, we found that if the value was set to 55% of the database instance RAM, it boosted the replication performance (a read heavy operation) by a considerable amount and in turn reduced the time required to replicate the data.WAL BuffersPostgreSQL relies on Write-Ahead Logging (WAL) to ensure data integrity. WAL records are written to buffers and then flushed to disk. The flag wal_buffers, determines the amount of shared memory used for WAL data that has not yet been written to disk – records that are yet to be flushed. We found that increasing the value for wal_buffers from the default value of 16MB to about 3% of the database instance’s RAM significantly improved the write performance by writing fewer but larger files to the disk at each transaction commit.Maintenance Work MemPostgreSQL maintenance operations, such as VACUUM, CREATE INDEX, and ALTER TABLE ADD FOREIGN KEY, consume their own specific memory. This memory is referred to as  maintenance_work_mem. Unlike other operations, PostgreSQL maintenance operations can only be performed sequentially by the database. Setting a value significantly higher than the default value of 64 MB meant that no maintenance operation would block the DMS job. We found that maintenance_work_mem worked best at the value of 1 GB.Resize source instance to avoid performance impactEach of these three flags tune how PostgreSQL utilizes memory resources. Hence, it was imperative that before setting these flags, we needed to upsize the source database instance to accommodate them. Without upsizing the database instances, we could have caused application performance degradation, as more than half of the total database memory would be allocated to the processes managed by these flags.We calculated the memory required by the flags mentioned above, and found that each flag needed to be set to a specific percentage of the source instance’s memory, irrespective of the existing values that might be set for the flags:shared_buffers: 55% of source instance’s memorywal_buffers: 3% of source instance’s memorymaintenance_work_mem: 1 GBWe added the individual memory requirements by the flags, and found that 58% of the RAM at least will be taken up by these memory flags. For example, if a source instance used 100GB of memory, 58GB would be taken up by shared_buffers and wal_buffers, and an additional 1GB by maintenance_work_mem. As the original value of these flags was very low (~200MB), we upsized the RAM of the source database instance by 60% in order to ensure that the migration did not impact source performance on the application live in production.Avoid connection error with WAL sender timeout flagWhile using Google Cloud’s DMS, if the connection is terminated between DMS and the Cloud SQL instance during the ‘Full Dump in Progress’ phase of the DMS job, the DMS job fails and needs to be reinitiated. Encountering timeouts, especially while migrating a terabyte-scale database, would mean multiple days’ worth of migration being lost and a delay in the cutover plan. For example, if the connection of the DMS job for a 20TB database migration is lost after 10 days, the DMS job will have to be restarted from the beginning, leading to 10 days’ worth of migration effort being lost.Adjusting the WAL sender timeout flag (wal_sender_timeout) helped us avoid terminating replication connections that were inactive for a long time during the full dump phase. The default value for this flag is 60 seconds. To avoid these connections from terminating, and to avoid such high impact failures, we set the value of this flag to 0 for the duration of database migration. This would avoid connections getting terminated and allowed for smoother replication through the DMS jobs.Generally, for all the database flags we talked about here, we advised our customer to restore the default flag values once the migration completed. Reduce downtime required for DMS limitations by automation While DMS does the majority of database migration through continuous replication when the source database instance is live in production, DMS has certain migration limitations that cannot be addressed when the database is live. For PostgreSQL, the known limitations of DMS include: Any new tables created on the source PostgreSQL database after the DMS job has been initialized are not replicated to the destination PostgreSQL database.Tables without primary keys on the source PostgreSQL database are not migrated. For those tables, DMS migrated only the schema. This is no longer a limitation after the June 2022 product update.The large object (LOB) data type is not supported by DMS.Only the schema for Materialized Views is migrated; the data is not migrated.All data migrated is created under the ownership of cloudsqlexternalsync.We had to address these aspects of the database migration manually. Since our client’s database had data with the large object data type, tables without primary keys, and frequently changing table structures that cannot be migrated by DMS, we had to manually export and import that data after DMS did most of the rest of the data migration. This part of database migration required downtime to avoid data loss. For a terabyte-scale database, this data can be in the hundreds of GBs, which means higher migration time and hence higher downtime. Furthermore, when you have dozens of databases to migrate, it can be stressful and error-prone for a human to perform these operations while on the clock during the cutover window! This is where automation helped save the day! Automating the migration operations during the downtime period not only reduced the manual effort and error risk, but also provided a scalable solution that could be leveraged for the migration of 100s of PostgreSQL database instances to Cloud SQL. Furthermore, by leveraging multiprocessing and multithreading, we were able to reduce the total migration downtime for 100s of GBs of data by 98%, thereby reducing the business impact for our client. How do we get there?We laid out all the steps that need to be executed during the downtime – that is, after the DMS job has completed its replication from source to destination and before cutting over the application to the migrated database. You can see a chart mapping out the sequence of operations that are performed during the downtime period in Fig 1.Fig 1: Downtime Migration – Sequential ApproachBy automating all the downtime operations in this sequential approach, we observed that it took 13 hours for the entire downtime flow to execute for a 1 TB database. This included the migration of 250 MB in new tables, 60 GB in tables without primary keys and 150 GB in large objects.  One key observation we made was that, out of all the steps, only three steps took most of the time: migrating new tables, migrating tables without primary keys, and migrating large objects. These took the longest time because they all required dump and restore operations for their respective tables. However, these three steps did not have a hard dependency on each other as they individually targeted different tables. So we tried to run them in parallel as you can see in Fig 2. But the steps following them – ‘Refresh Materialized View’ and ‘Recover Ownership’ – had to be performed sequentially as they targeted the entire database.However, running these three steps in parallel required upsizing the Cloud SQL instances, as we wanted to have sufficient resources available for each step. This led us to increase the Cloud SQL instances’ vCPU by 50% and memory by 40%, since the export and import operations depended heavily on vCPU consumption as opposed to memory consumption.Fig 2: Downtime Migrations – Hybrid ApproachMigrating the new tables (created after the DMS job was initiated) and tables without primary keys was straightforward as we were able to leverage the native utilities offered by PostgreSQL – pg_dump and pg_restore. Both utilities process tables in parallel by using multiple threads– the higher the table count, the higher the number of threads that could be executed in parallel, allowing faster migration. With this revised approach, for the same 1 TB database, it still took 12.5 hours for the entire downtime flow to execute. This improvement reduced the cutover downtime, but we still found that we needed a 12.5 hour window to complete all the steps. We then discovered that 99% of the time of downtime was taken up by just one step: exporting and importing 150 GB of large objects. It turned out that multiple threads could not be used to accelerate the dump and restore large objects in PostgreSQL. Hence, migrating the large objects single handedly extended the downtime for migration by hours. Fortunately, we were able to come up with a workaround for that. Optimize migration of Large Object from PostgreSQL databasePostgreSQL contains a large objects facility that provides stream-style access to data stored in a special large-object structure. When large objects are stored, they are broken down into multiple chunks and stored in different rows of the database, but are connected under a single Object Identifier (OID). This OID can thus be used to access any stored Large Object. Although users can add large objects to any table in the database, under the hood, PostgreSQL physically stores all large objects within a database in a single table called pg_largeobjects.While leveraging pg_dump and pg_restore for export and import of large objects, this single table – pg_largeobject, becomes a bottleneck as the PostgreSQL utilities cannot execute multiple threads for parallel processing, since it’s just one table. Typically, the order of operations for these utilities looks something like this:1. pg_dump reads the data to be exported from the source database 2. pg_dump writes that data into the memory of the client where pg_dump is being executed 3. pg_dump writes from memory to the disk of the the client (a second write operation)4. pg_restore reads the data from the client’s disk5. pg_restore writes the data to the destination databaseNormally, these utilities would need to be executed sequentially to avoid data loss or data corruption due to conflicting processes. This leads to further increase in migration time for large objects.Our workaround for this single-threaded process involved two elements. First, with our solution, we eliminated the second write operation – write from memory to disk (point #3). Instead, once the data was read and written into memory, our program would begin the import process and write data to the destination database. Second, since pg_dump and pg_restore could not use multiple threads to process the large objects in just the pg_largeobjects table, we took it upon ourselves to develop a solution that could use multiple threads. The thread count was based on the number of OIDs in the table – pg_largeobjects, and break that single table into smaller chunks for parallel execution. This approach brought down Large Object migration operation from hours to minutes, therefore bringing down the downtime needed for all operations to be completed that DMS cannot handle, for the same 1 TB database, from 13 hours to just 18 minutes. A reduction of ~98% in the required downtime.ConclusionAfter multiple optimizations and dry runs, we were able to develop a procedure for our client to migrate dozens of terabyte-scale PostgreSQL databases to Google Cloud SQL with a minimal business impact. We developed practices to optimize DMS-based migration by 15% using database flags and reduce downtime by 98% with the help of automation and innovation. These practices can be leveraged for any terabyte-scale migration of PostgreSQL databases to Google Cloud SQL to accelerate migration, minimize downtime and avoid performance impact on mission critical applications.Related ArticleRead Article
Quelle: Google Cloud Platform

Four non-traditional paths to a cloud career (and how to navigate them)

One thing I love about cloud is that it’s possible to succeed as a cloud engineer from all kinds of different starting points. It’s not necessarily easy; our industry remains biased toward hiring people who check a certain set of boxes such as having a university computer science degree. But cloud in particular is new enough, and has such tremendous demand for qualified talent, that determined engineers can and do wind up in amazing cloud careers despite coming from all sorts of non-traditional backgrounds.But still – it’s scary to look at all the experienced engineers ahead of you and wonder “How will I ever get from where I am to where they are?”A few months ago, I asked some experts at Google Cloud to help me answer common questions people ask as they consider making the career move to cloud. We recorded our answers in a video series called Cracking the Google Cloud Career that you can watch on the Google Cloud Tech YouTube channel. We tackled questions like…How do I go from a traditional IT background to a cloud job?You have a superpower if you want to move from an old-school IT job to the cloud: You already work in tech! That may give you access to colleagues and situations that can level up your cloud skills and network right in your current position. But even if that’s not happening, you don’t have to go back and start from square one. Your existing career will give you a solid foundation of professional experience that you can layer cloud skills on top of. Check out my video to see what skills I recommend polishing up before you make the jump to cloud interviews:How do I move from a help desk job to a cloud job?The help desk is the classic entry-level tech position, but moving up sometimes seems like an insurmountable challenge. Rishab Kumar graduated from a help desk role to a Technical Solutions Specialist position at Google Cloud. In his video, he shares his story and outlines some takeaways to help you plot your own path forward.Notably, Rishab calls out the importance of building a portfolio of cloud projects: cloud certifications helped him learn, but in the job interview he got more questions about the side projects he had implemented. Watch his full breakdown here:How do I switch from a non-technical career to the cloud?There’s no law that says you have to start your tech career in your early twenties and do nothing else for the rest of your career. In fact, many of the strongest technologists I know came from previous backgrounds as disparate as plumbing, professional poker, and pest control. That’s no accident: those fields hone operational and people skills that are just as valuable in cloud as anywhere else. But you’ll still need a growth mindset and lots of learning to land a cloud job without traditional credentials or previous experience in the space. Google Cloud’s Stephanie Wong came to tech from the pageant world and has some great advice about how to build a professional network that will help you make the switch to a cloud job. In particular, she recommends joining the no-cost Google Cloud Innovators program, which gives you inside access to the latest updates on Google Cloud services alongside a community of fellow technologists from around the globe.Stephanie also points out that you don’t have to be a software engineer to work in the cloud; there are many other roles like developer relations, sales engineers and solutions architects that stay technical and hands-on without building software every day.You can check out her full suggestions for transitioning to a tech career in this video:How do I get a job in the cloud without a computer-related college degree?No matter your age or technical skill level, it can be frustrating and intimidating to see role after role that requires a bachelor’s degree in a field such as IT or computer science. I’m going to let you in on a little secret: once you get that first job and add some experience to your skills, hardly anybody cares about your educational background anymore. But some recruiters and hiring managers still use degrees as a shortcut when evaluating people for entry-level jobs.Without a degree, you’ll have to get a bit creative in assembling credentials. First, consider getting certified. Cloud certifications like the Google Cloud Associate Cloud Engineer can help you bypass degree filters and get you an interview. Not to mention, they’re a great way to get familiar with the workings of your cloud. Google Cloud’s Priyanka Vergadia suggests working toward skill badges on Google Cloud Skills Boost; each skill badge represents a curated grouping of hands-on labs within a particular technology that can help you build momentum and confidence toward certification.Second, make sure you are bringing hands-on skills to the interview. College students do all sorts of projects to bolster their education. You can do this too – but at a fraction of the cost of a traditional degree. As Priyanka points out in this video, make sure you are up to speed on Linux, networking, and programming essentials before you apply:No matter your background, I’m confident you can have a fulfilling and rewarding career in cloud as long as you get serious about these two things:Own your credibility through certification and hands-on practice, andBuild strong connections with other members of the global cloud community.In the meantime, you can watch the full Cracking the Google Cloud Career playlist on the Google Cloud Tech YouTube channel. And feel free to start your networking journey by reaching out to me anytime on Twitter if you have cloud career questions – I’m happy to help however I can.Related ArticleShow off your cloud skills by completing the #GoogleClout weekly challengeComplete the weekly #GoogleClout challenge and show off your cloud skillsRead Article
Quelle: Google Cloud Platform

Pro tools for Pros: Industry leading observability capabilities for Dataflow

Dataflow is the industry-leading unified platform offering batch and stream processing. It is a fully managed service that comes with flexible development options (from Flex Templates & Notebooks to Apache Beam SDKs for Java, Python and Go) and a rich set of built-in management tools. It comes with seamless integrations with all Google Cloud products, such as Pub/Sub, BigQuery, VertexAI, GCS, Spanner, and BigTable, as well as third-party services and products, such as Kafka and AWS S3, to best meet your data movement use cases.While our customers value these capabilities, they continue to push us to innovate and provide more value as the best batch and streaming data processing service to meet their ever-changing business needs. Observability is a key area where the Dataflow team continues to invest more based on customer feedback. Adequate visibility into the state and performance of the Dataflow jobs is essential for business critical production pipelines. In this post, we will review Dataflow’s  key observability capabilities:Job visualizers – job graphs and execution detailsNew metrics & logsNew troubleshooting tools – error reporting, profiling, insightsNew Datadog dashboards & monitorsDataflow observability at a glanceThere is no need to configure or manually set up anything; Dataflow offers observability out of the box within the Google Cloud Console, from the time you deploy your job. Observability capabilities are seamlessly integrated with Google Cloud Monitoring and Logging along with other GCP products. This integration gives you a one-stop shop for observability across multiple GCP products, which you can use to meet your technical challenges and business goals.Understanding your job’s execution: job visualizersQuestions: What does my pipeline look like? What’s happening in each step? Where’s the time spent?Solution: Dataflow’s Job graph and Execution details tabs answer these questions to help you understand the performance of various stages and steps within the jobJob graph illustrates the steps involved in the execution of your job, in the default Graph view. The graph gives you a view of how Dataflow has optimized your pipeline’s code for execution, after fusing  (optimizing) steps to stages. TheTable view informs you more about each step and the associated fused stages and time spent in each step and their statuses as the pipeline continues execution. Each step in the graph displays more information, such as the input and output collections and output data freshness; these help you analyze the amount of work done at this step (elements processed) and the throughput for it.Fig 1. Job graph tab showing the DAG for a job and the key metrics for each stage on the right.Execution Details has all the information to help you understand and debug the progress of each stage within your job. In the case of streaming jobs, you can view the data freshness of each stage. The Data freshness by stages chart includes anomaly detection: it highlights “potential slowness” and “potential stuckness” to help you narrow down your investigation to a particular stage. Learn more about using the Execution details tab for batch and streaming here.Fig 2. The execution details tab showing data freshness by stage over time, providing anomaly warnings in data freshness.Monitor your job with metrics and logsQuestions:  What’s the state and performance of my jobs? Are they healthy? Are there any errors? Solution:  Dataflow offers several metrics to help you monitor your jobs. A full list of Dataflow job metrics can be found in our metrics reference documentation. In addition to the Dataflow service metrics, you can view worker metrics, such as CPU utilization and memory usage. Lastly, you can generate Apache Beam custom metrics from your code.Job metrics is the one-stop shop to access the most important metrics for reviewing the performance of a job or troubleshooting a job. Alternatively, you can access this data from the Metrics Explorer to build your own Cloud Monitoring dashboards and alerts. Job and worker Logs are one of the first things that you can look at when you deploy a pipeline. You can access both these log types in the Logs panel on the Job details page. Job logs include information about startup tasks, fusion operations, autoscaling events, worker allocation, and more. Worker logs include information about work processed by each worker within each step in your pipeline.You can configure and modify the logging level and route the logs using the guidance provided in our pipeline log documentation. Logs are seamlessly integrated into Cloud Logging. You can write Cloud Logging queries, create log-based metrics, and create alerts on these metrics. New: Metrics for streaming JobsQuestions: Is my pipeline slowing down or getting stuck? I want to understand how my code is impacting the job’s performance. I want to see how my sources and sinks are performing with respect to my jobSolution: We have introduced several new metrics for Streaming Engine jobs that help answer these questions. Notable metrics are listed below. All of these are now instantly accessible from the Job metrics tab.The engineering teams at the Renault Group have been using Dataflow for their streaming pipelines as a core part of their digital transformation journey.”Deeper observability of our data pipelines is critical to track our application SLOs,”said Elvio Borrelli, Tech Lead – Big Data at the Renault Digital Transformation & Data team. “The new metrics, such as backlog seconds and data freshness by stage, now provide much better visibility about our end-to-end pipeline latencies and areas of bottlenecks. We can now focus more on tuning our pipeline code and data sources for the necessary throughput and lower latency”.To learn more about using these metrics in the Cloud console, please see the Dataflow monitoring interface documentation.Fig 3. The Job metrics tab showing the autoscaling chart and the various metrics categories for streaming jobs.To learn how to use these metrics to troubleshoot common symptoms within your jobs, watch this webinar on Dataflow Observability: Dataflow Observability, Monitoring, and Troubleshooting Debug job health using Cloud Error ReportingProblem: There are a couple of errors in my Dataflow job. Is it my code, data, or something else? How frequently are these happening?Solution: Dataflow offers native integration with Google Cloud Error Reporting to help you identify and manage errors that impact your job’s performance.In the Logs panel on the Job details page, the Diagnostics tab tracks the most frequently occurring errors. This is integrated with Google Cloud Error Reporting, enabling you to manage errors by creating bugs or work items or by setting up notifications. For certain types of Dataflow errors, Error Reporting provides a link to troubleshooting guides and solutions.Fig 4. The diagnostics tab in the Log panel displaying top errors and their frequency.New: Troubleshoot performance bottlenecks using Cloud ProfilerProblem: What part of my code is taking more time to process the data? What operations are consuming more CPU cycles or memory?Solution: Dataflow offers native integration with Google Cloud Profiler, which lets you profile your jobs to understand the performance bottlenecks using CPU, memory, and I/O operation profiling support.Is my pipeline’s latency high? Is it CPU intensive or is it spent time waiting for I/O operations? Or is it memory intensive? If so, which operations are driving this up? The flame graph helps you find answers to these questions. You can enable profiling for your Dataflow jobs by specifying a flag during job creation or while updating your job. To learn more see the Monitor pipeline performance documentation.Fig 5. The CPU time profiler for showing the flame graph for the Dataflow job.New: Optimize your jobs using Dataflow insightsProblem: What can Dataflow tell me about improving my job performance or reducing its costs?Solution: You can review Dataflow Insights to improve performance or to reduce costs. Insights are enabled by default on your batch and streaming jobs; they are generated by auto-analyzing your jobs’ executions.Dataflow insights is powered by the Google Active Assist’s Recommender service. It is automatically enabled for all jobs and is available free of charge. Insights include recommendations such as enabling autoscaling, increasing maximum workers, and increasing parallelism. Learn more about Dataflow insights in the Dataflow Insights documentation.Fig 6. Dataflow Insights show up in  the Jobs overview page next to the active jobs.New: Datadog Dashboards & Recommended MonitorsProblem: I would like to monitor Dataflow in my existing monitoring tools, such as Datadog.Solution: Dataflow’s metrics and logs are accessible in observability tools of your choice, via Google Cloud Monitoring and Logging APIs. Customers using Datadog can now leverage the out of the box Dataflow dashboards and recommended monitors to monitor their Dataflow jobs alongside other applications within the Datadog console. Learn more about Dataflow Dashboards and Recommended Monitors in their blog post on how to monitor your Dataflow pipelines with Datadog.Fig 7. Datadog dashboard monitoring Dataflow jobs across projectsZoomInfo, a global leader in modern go-to-market software, data, and intelligence, is partnering with Google Cloud to enable customers to easily integrate their business-to-business data into Google BigQuery. Dataflow is a critical piece of this data movement journey.“We manage several hundreds of concurrent Dataflow jobs,” said Hasmik Sarkezians, ZoomInfo Engineering Fellow. “Datadog’s dashboards and monitors allow us to easily monitor all the jobs at scale in one place. And when we need to dig deeper into a particular job, we leverage the detailed troubleshooting tools in Dataflow such as Execution details, worker logs and job metrics to investigate and resolve the issues.”What’s NextDataflow is leading the batch and streaming data processing industry with the best in class observability experiences. But we are just getting started. Over the next several months, we plan to introduce more capabilities such as:Memory observability to detect and prevent potential out of memory errors.Metrics for sources & sinks, end-to-end latency, bytes being processed by a PTransform, and more.More insights – quota, memory usage, worker configurations & sizes.Pipeline validation before job submission.Debugging user-code and data issues using data sampling.Autoscaling observability improvements.Project-level monitoring, sample dashboards, and recommended alerts.Got feedback or ideas? Shoot them over, or take this short survey.Getting StartedTo get started with Dataflow see the  Cloud Dataflow quickstarts.To learn more about Dataflow observability, review these articles:Using the Dataflow monitoring interfaceBuilding production-ready data pipelines using Dataflow: Monitoring data pipelinesBeam College: Dataflow MonitoringBeam College: Dataflow Logging Beam College: Troubleshooting and debugging Apache Beam and GCP Dataflow
Quelle: Google Cloud Platform

Spanner on a modern columnar storage engine

Google was born in the cloud. At Google, we have been running massive infrastructure that powers critical internal and external facing services for more than two decades. Our investment in this infrastructure is constant, ranging from user visible features to invisible internals that makes the infrastructure more efficient, reliable and secure. Constant updates and improvements are made into the infrastructure. With billions of users served around the globe, availability and reliability is at the core of how we operate and update our infrastructure.Spanner is Google’s massively scalable, replicated and strongly consistent database management service. With hundreds of thousands of databases running in our production instance, Spanner serves over 2 billion requests at peak and has over 6 exabytes of data under management that is the “source of the truth” for many mission critical services, including AdWords, Search, and Cloud Spanner customers. The customer workloads are diverse, and would stretch a system in various ways. Although there have been constant binary releases to Spanner, fundamental changes such as swapping out the underlying storage engine is a challenging undertaking.  In this post, we talk about our journey migrating Spanner to a new columnar storage engine. We discuss the challenges a massive scale migration faced and how we accomplished this effort in ~2-3 years with all the critical services running on top uninterrupted.The Storage EngineThe storage engine is where a database turns their data into actual bytes and stores them in underlying file systems. In a Spanner deployment, a database is hosted in one or more instance configurations, which are physical collections of resources. The instance configurations and databases comprise one or more zones or replicas that are served by a number of spanner servers. The storage engine in the server encodes the data and stores them in the underlying large scale distributed file system – Colossus.Spanner originally used a Bigtable-like storage engine based on SSTable (Sorted String Table) format stacks. This format has proven to be incredibly robust through years of large scale deployment such as in Bigtable and Spanner itself. The SSTable format is optimized for schemaless NoSQL data consisting of primarily large strings. While it is a perfect match for Bigtable, it is not the best fit for Spanner. In particular, traversing individual columns is inefficient.Ressi is the new low-level, column-oriented storage format for Spanner. It is designed from the ground up for handling SQL queries over large-scale, distributed databases with both OLTP and OLAP workloads, including maintaining and improving performance of read and write queries with key-value data in the database. Ressi includes optimizations ranging from block-level data layout, file-level organization of active and inactive data, and existence filters for storage I/O savings etc. The data organization improves storage usage and helps in large scan queries. Deployment of Ressi with very large scale services such as GMail on Spanner have shown performance improvements over multiple dimensions, such as CPU and storage I/O.The Challenges of Storage Engine MigrationImprovements and updates to Spanner are constant and we are adept at safely operating and evolving our system in a dynamic environment. However, a storage engine migration changes the foundation of a database system and presents distinct challenges, especially at a massive deployment scale.In general, in a production OLTP database system, storage engine migration needs to be done without interruption to the hosted databases, without degradation to latency and throughput, and without compromising data integrity. There had been past attempts and success stories of live database storage engine migration. However, successful attempts at the scale of Spanner with multiple exabytes of data are rare. The mission critical nature of the services and the massive scale place a very high requirement on how the migration should be handled.Reliability, Availability & Data IntegrityThe topmost requirement of the migration is maintaining service reliability, availability and data integrity throughout the migration. The challenges were paramount and unique with the massive deployment scale of Spanner:Spanner database workloads are diverse and interact with the underlying Spanner system in different ways. Successful migration of one database does not guarantee successful migration of another.Massive data migration inherently creates unusual churns in the underlying system. This may trigger latent and unanticipated behavior, causing production outages.We operate in a dynamic environment with constant new ambient changes from the customers and Spanner new feature development. Migration faced non-monotonically decreasing risk.Performance & CostAnother challenge of migrating to a new storage engine is to achieve good performance and reduce cost. Performance regression can arise during the migration from underlying churns, and/or after the migration due to certain aspects of the workloads interacting with the new storage engine. This can cause issues such as increased latency and rejected requests.Performance regression may also manifest as increased storage usage in some databases due to variances in database compressibility. This increases internal resource consumption and cost. What’s more, if additional storage is not available, it may lead to production outages.Although the new columnar storage engine improves both performance and data compression in general, due to Spanner’s massive deployment, we must watch out for the outliers.Complexity and SupportabilityExistence of dual formats not only requires more engineering effort to support, but also increases system complexity and performance variances in different zones. An obvious approach to mitigate the risk here is to achieve high migration velocity and in particular, shorten the co-existence of dual formats in the same databases.However, databases on Spanner have different sizes, spanning several orders of magnitude. As a result, the time required to migrate each database can vary by a large degree. Scheduling databases for migration can not be done one-size-fit-all. The migration effort must take into account the transitioning period where dual formats exist while trying to achieve highest velocity safely and reliably.A Systematic Principled Approach toward Migration ReliabilityWe introduced a systematic approach based on a set of reliability principles we defined. Using the reliability principles, our automation framework automatically evaluated migration candidates (i.e., instance configurations and/or databases), selecting conforming candidates for migration and flagging violations. The flagged migration candidates were specially examined and violations resolved before the candidates became eligible for migration. This effectively reduced toil and increased velocity without sacrificing production safety.The Reliability Principles & Automation ArchitectureThe reliability principles were the cornerstones of how we conducted the migration. They covered multiple aspects: from evaluating the healthiness and suitability of migration candidates, managing customer exposure to production changes, handling performance regression and data integrity, to mitigating risks in a dynamic environment with constant changes, such as new releases and feature launches within and outside of Spanner.Based on the reliability principles, we built an automation framework. Various stats and metrics were collected. Together they formed a modeled view of the state of the Spanner universe. This view was continuously updated to accurately reflect the current state of the universe.In this architectural design, the reliability principles became filters, where a migration candidate could only pass through and be selected by the migration scheduler if it satisfied the requirements. Migration scheduling was done in weekly waves to enable gradual rollout.As previously mentioned, migration candidates not satisfying the reliability principles were not ignored – they were flagged for attention and were resolved in one of two ways: override and migrate with caution, or resolve the underlying blocking issue then migrate.Migration Scheduling & Weekly RolloutMigration scheduling was the core component in managing migration risk, preventing performance regressions and ensuring data integrity.Due to the diverse customer workload and wide spectrum of deployment sizes, we adopted fine-grained migration scheduling. The scheduling algorithm observed the customer deployment as failure domains and properly staged and spaced the migration of customer instance configurations. Together with the rollout automation, they enabled an efficient migration journey while keeping risk under control.Under this framework, the migration proceeded progressively in the following dimensions:among multiple instance configurations of the same customer deployment;among the multiple zones of the same instance configuration; andamong the migration candidates in the weekly rollout wave.Customer Deployment-aware SchedulingProgressive rollout within a customer’s deployment required us to recognize the customer deployment as failure domains. We used an heuristic that indicates deployment ownership and usage. In Spanner’s case, this is also a close approximation of workload categorization as the multiple instances are typically regional instances of the same service. The categorization produced equivalent classes of deployment instances where each class is a collection of instance configurations from the same customer and with the same workload, as shown in a simplified graph:The weekly wave scheduler selected migration candidates (i.e., replicas/zones in instance configuration) from each equivalent class. Candidates from multiple equivalent classes can be chosen independently as their workloads were isolated. Blocking issues in one equivalent class would not prevent progress in other classes.Progressive Rollout of Weekly WavesTo mitigate new issues from new releases and changes from both customers and Spanner, the weekly waves were also rolled out in a progressive manner, allowing issues to surface without causing widespread impact while accelerating to increase migration velocity.Managing Reliability, Availability & PerformanceUnder the mechanisms described above, customer deployments were carefully moved through a series of state changes, preventing performance degradation and loss of availability and data integrity.At the start, an instance configuration of a customer was chosen and an initial zone/replica ( henceforth referred to as “first zone”) was migrated. This avoided potential global production impact to the customer while revealing issues should the workload interact poorly with the new storage engine. Following the first zone migration, data integrity was checked by comparing the migrated zone with other zones using Spanner’s built-in integrity check. If this check failed or performance regression occurred following the migration, the instance was restored to the previous state.We pre-estimated the post migration storage size and the reliability principle blocks instances with excessive storage increase from migrating. As a result, we did not have many unexpected storage compression regression following a migration. Regardless, the resource usage and system health was closely monitored by our monitoring infrastructure. If unexpected regression occurred, the instance was restored back to the desired state by migrating the zone back to SSTable format.Only when everything was OK would the migration of the customer deployment proceed forward, progressively by migrating more instances and/or zones, and accelerating as risk was further reduced.Project Management & Driving MetricsA massive migration effort requires effective project management and the identification of key metrics to drive progress. We drove a few key metrics, including (but not limited to):The coverage metric. This metric tracked the number and percentage of Spanner instances running the new storage engine. This was the highest priority metric. As the name indicated, this metric covered the interaction of different workloads with the new storage engine, allowing early discovery of underlying issues.The majority metric. This metric tracked the number and percentage of Spanner instances with the majority of the zones running the new storage engine. This allows catching anomalies at tipping points in a quorum based system like Spanner.The completion metric. This metric tracked the number and percentage of Spanner instances that were completely running the new storage engine. Achieving 100% on this metric was our ultimate goal.The metrics were maintained as time series, allowing examination of the trend and shifting gears as we approached the later stages of the effort.SummaryPerforming massive scale migration is an effort that encompasses strategic design, building automation, designing processes, and shifting execution gears as effort progresses. With a systematic and principled approach, we achieved a massive scale migration involving over 6 exabytes of data under management and 2 billion QPS at peak in Spanner within a short amount of time with service availability, reliability and integrity uncompromised.Many of Google’s critical services depend on Spanner and have already seen significant improvements with this migration. Furthermore, the new storage engine provides a platform for many future innovations. The best is yet to come.
Quelle: Google Cloud Platform