August 2020 - Seite 13 von 41 - Cloud Computing Köln

Editor’s note: This is the first blog in a three-part series examining the internal Google history that led to Dataflow, how Dataflow works as a Google Cloud service, and how it compares and contrasts with other products in the marketplace. Google Cloud’s Dataflow, part of our smart analytics platform, is a streaming analytics service that unifies stream and batch data processing. To get a better understanding of Dataflow, it helps to also understand its history, which starts with MillWheel. A history of DataflowLike many projects at Google, MillWheel started in 2008 with a tiny team and a bold idea. When this project started, our team (led by Paul Nordstrom), wanted to create a system that did for streaming data processing what MapReduce had done for batch data processing—provide robust abstractions and scale to massive size. In those early days, we had a handful of key internal Google customers (from Search and Ads), who were driving requirements for the system and pressure-testing the latest versions. What MillWheel did was build pipelines operating on click logs to attempt to compute real-time session information in order to better understand how to improve systems like Search for our customers. Up until this point, session information was computed on a daily basis, spinning up a colossal number of machines in the wee hours of the morning to produce results in time for when engineers logged on that morning. MillWheel aimed to change that by spreading that load over the entire day, resulting in more predictable resource usage, as well as vastly improved data freshness. Since a session can be an arbitrary length of time, this Search use case helped provide early motivation for key MillWheel concepts like watermarks and timers.Alongside this session’s use case, we started working with the Google Zeitgeist team—now Google Trends—to look at an early version of trending queries from search traffic. In order to do this, we needed to compare current traffic for a given keyword to historical traffic so that we could determine fluctuations compared to the baseline. This drove a lot of the early work that we did around state aggregation and management, as well as efficiency improvements to the system, to handle cases like first-time queries or one-and-done queries that we’d never see again.In building MillWheel, we encountered a number of challenges that will sound familiar to any developer working on streaming data processing. For one thing, it’s much harder to test and verify correctness for a streaming system, since you can’t just rerun a batch pipeline to see if it produces the same “golden” outputs for a given input. For our streaming tests, one of the early frameworks that we developed was called the “numbers” pipeline, which staggered inputs from 1 to 1e6 over different time delivery intervals, aggregated them, and verified the outputs at the end. Though it was a bit arduous to build, it more than paid for itself in the number of bugs it caught. Dataflow represents the latest innovation in a long line of precursors at Google. The engineers who built Dataflow (co-led with Frances Perry) first experimented with streaming systems by building MillWheel, which defined some of the core semantics around timers, state management, and watermarks, but proved to be challenging to use in a number of ways. A lot of these challenges were similar to the issues that led us to build Flume for users who wanted to run multiple logical MapReduce (actually map-shuffle-combine-reduce) options together. So, to meet those challenges, we experimented with a higher-level model for programming pipelines called Streaming Flume (no relation to Apache Flume). This model allowed users to reason in terms of datasets and transformations, rather than physical details like computation nodes and the streams between them.When it came time to build something for Google Cloud, we knew that we wanted to build a system that combined the best of what we’d learned with ambitious goals for the future. Our big bet with Dataflow was to take the semantics of (batch) Flume and Streaming Flume and combine them into a single system, which unified streaming and batch semantics. Under the hood, we had a number of technologies that we could build the system on top of, which we’ve successfully decoupled from the semantic model of Dataflow. That has let us continue to improve this implementation over time without requiring major rewrites to user pipelines. Along the way, we’ve created a number of publications about our work in data processing, particularly around streaming systems. Check those out here:Millwheel: Fault-Tolerant Stream Processing at Internet ScaleThe Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data ProcessingFlumeJava: Easy, Efficient Data-Parallel PipelinesHow Dataflow worksLet’s take a moment to quickly review some key concepts in Dataflow. When we say that Dataflow is a streaming system, we mean that it processes (and can emit) records as they arrive, rather than according to some fixed threshold (e.g., record count or time window). While users can impose these fixed semantics in defining what outputs they want to see, the underlying system supports streaming inputs and outputs. Within Dataflow, a key concept is the idea of event time, which is a timestamp that corresponds to the time when an event occurred (rather than the time at which it is processed). In order to support a number of interesting applications, it’s critical for a system to support event time, so that users can ask questions like “How many people logged on between 1am and 2am?”One of the architectures that Dataflow is often compared to is the Lambda Architecture, where users run parallel copies of a pipeline (one streaming, one batch) in order to have a “fast” copy of (often partial) results as well as a correct one. There are a number of drawbacks to this approach, including the obvious costs (computational and operational, as well as development costs) of running two systems instead of one. It’s also important to note that Lambda Architectures often use systems with very different software ecosystems, making it challenging to replicate complex application logic across both. Finally, it’s non-trivial to reconcile the outputs of the two pipelines at the end. This is a key problem that we’ve solved with Dataflow—users write their application logic once, and can choose whether they would like fast (but potentially incomplete) results, slow (but correct) results, or both.To help demonstrate Dataflow’s advantage over Lambda Architectures, let’s consider the use case of a large retailer with online and in-store sales. These retailers would benefit from in-store BI dashboards, used by in-store employees, that could show regional and global inventory to help shoppers find what they’re looking for, and to let the retailers know what’s been popular with their customers. The dashboards could also be used to drive inventory distribution decisions from a central or regional team. In a Lambda Architecture, these systems would likely have delays in updates that are corrected later by batch processes, but before those corrections are made, they could misrepresent availability for low-inventory items, particularly during high-volume times like the holidays. Poor results in retail can lead to bad customer experiences, but in other fields like cybersecurity, they can lead to complacency and ignored intrusion alerts. With Dataflow, this data would always be up-to-date, ensuring a better experience for customers by avoiding promises of inventory that’s not available—or in cybersecurity, an alerting system that can be trusted.That covers much of Dataflow’s origin story, but there are more interesting concepts to discuss. Be sure to check out the other blogs in our Dataflow “Under the Hood” series to learn more.
Quelle: Google Cloud Platform

21. August 2020

da Agency

Synthetic data generation with Dataflow data generator flex template

Generating synthetic data at a very high queries per second (QPS) is a challenging task that forces developers to build and launch multiple instances of a complex multi-threaded application. Having learned that this is a very common need which helps IT teams to validate system resilience during evaluations and migrations to new platforms, we decided to build a pipeline that eliminates the heavy lifting and makes synthetic data generation easier. We are excited to announce the launch of a new Dataflow Flex template called Streaming Data Generator that is capable of publishing unlimited high-volume JSON messages to a Google Cloud Pub/Sub topic. In this blog post, we will briefly discuss the use cases and how to use the template.Flex TemplatesBefore diving into the details of the Streaming Data Generator template’s functionality, let’s explore Dataflow templates at a very high level:The primary goal of Dataflow templates is to package Dataflow pipelines in the form of reusable artifacts that can be run in various channels (UI / CLI / REST API) and be used by different teams. In the initial version of templates (called traditional templates), pipelines were staged on Google Cloud Storage and could be launched from the Google Cloud Console, the gcloud command-line tool or other cloud-native Google Cloud services such as Cloud Scheduler or Cloud Functions.However, traditional templates have certain limitations:Lack of support for Dynamic DAGsMany I/Os don’t implement ValueProvider Interface, which is essential to supporting runtime parametersFlex templates overcome these limitations. Flex templates package Dataflow pipeline code, including application dependencies, as Docker images and stage the images in Google Container Registry (GCR). Metadata specification files referencing the GCR image path and parameters details will be created and stored in Google Cloud Storage. Users can invoke a pipeline through a variety of channels (UI, gcloud, REST) by referring to the spec file. Behind the scenes, the Flex template launcher service runs Docker containers with parameters supplied by the user.Streaming Data Generator OverviewThe Streaming Data Generator template can be used to publish fake JSON messages based on a user-provided schema at a specified rate (measured in messages per second) to a Google Cloud Pub/Sub topic. The JSON Data Generator library used by the pipeline supports various faker functions that can be associated with a schema field. The pipeline supports configuration parameters to specify message schema, specify the number of messages published per second (i.e., QPS), enable auto scaling, and more. Pipeline steps are shown below:The primary use case of the pipeline is to benchmark the consumption rate of Streaming pipelines and evaluate the resources (number of workers/machine types) required to meet the desired performance.Launching the PipelineThe pipeline can be launched either from the cloud console , gcloud command-line tool or REST API.To launch from Cloud Console:1. Go to the Dataflow page in the Cloud Console.2. Click “Create Job From Template.”3. Select “Streaming Data Generator” from the Dataflow template drop-down menu.4. Enter the job name.5. Enter required parameters as shown below:6. Enter optional parameters such as autoscalingAlgorithm and maxNumWorkers, if required.7. Click “Run Job.”To launch using the gcloud command-line tool, enter the following:To launch using REST API:Next StepsWe hope the template combined with Dataflow’s serverless nature will enhance your productivity and make synthetic data generation much simpler. To learn more, you can read the documentation, check out the code or get started by running a template on Google Cloud. In addition to Utility templates, the Dataflow team provides a wide variety of Batch and Streaming templates for point-to-point data transfers covering popular data sources and destinations.
Quelle: Google Cloud Platform

21. August 2020

da Agency

3 reasons to consider Cloud Spanner for your next project

A database is a key architectural component of almost every application. When you design an application, you’ll invariably need to durably store application data. Without persisting data to a shared database, there are neither options for application scalability nor for upgrades to the underlying hardware. More disastrous, any data will be immediately lost in the case of infrastructure failure. With a reliable database, though, you enable application scalability and ensure data durability and consistency, service availability, and improved system supportability. A database is a key architectural component of almost every application.Google Cloud’s Spanner database was built to fulfill needs around storing structured data for products here at Google and at our many cloud customers. Spanner is part of Google’s core infrastructure, trusted to safeguard our business—so you can, too, regardless of your industry or use case.Before Spanner, our products predominantly used sharded MySQL for database use cases where transactions were needed. The goal of the development effort, as described in the Spanner paper, was to create a data storage service for those applications that have complex, evolving schemas, or those that want strong consistency in the presence of wide-area replication.One of the first concepts that comes up when considering Spanner is its ability to scale to arbitrarily large database sizes. Spanner does indeed support Google applications (such as Gmail and YouTube) that provide features for billions of our users, so scalability must be a first-class feature. In this post, we’ll explore how Spanner is designed for applications that operate at any scale, big or small, across a variety of use cases; how it provides a low-barrier to entry for developers; and how it lowers total cost of ownership (TCO). Here’s what you need to know.Start anywhere and scale as you growSpanner can handle data volumes at a massive scale, so it’s useful for applications of many sizes, not just those large ones. Further, your organization can benefit from standardizing on a single database engine for all workloads that require an RDBMS. Spanner provides a solid foundation for all kinds of applications with its combination of familiar relational database management system (RDBMS) features, such as ANSI 2011 SQL, DML, Foreign Keys and unique features such as strong external consistency via TrueTime and high availability via native synchronous replication. We’d like to take a moment to challenge what “smaller scale” may be perceived as: that smaller applications are not important, or that they do not have lofty availability goals or the need for transactional fortitude. This categorization does not indicate that an application is any less business-critical than a massive scale application. Nor does it imply that a given application will not eventually require higher scale than at its initial rollout. While your application might have a small user base or transaction volume to start, this Spanner scalability advantage should not be overlooked. An application designed with a Spanner back end will not require a rewrite or any sort of database migration if success results in future data volume or transaction growth. For example, if you are a gaming company developing the next cool, groundbreaking game, you want to be prepared to meet user growth if the game is a runaway success on launch day.No matter the scale of your application, there are strong benefits when you choose Spanner, including transaction support, high-availability guarantees, read-only replicas, and effortless scalability. Transaction support and strong external consistencySpanner provides external consistency guarantees via TrueTime. Spanner uses this fully redundant system of atomic clocks to obtain timestamps from what amounts to a virtual, distributed global clock. Since Spanner can apply a timestamp from a globally agreed-upon source to every transaction upon commit, the transaction commit sequence is unequivocal. External consistency requires that all transactions be executed sequentially. Spanner satisfies this strong consistency guarantee. Strong consistency is required by many application types, especially those where quantities of goods or currency are maintained, and for which eventual consistency would not be at all suitable. That includes, but is not limited to, supply chain management, retail pricing and inventory management, and banking, trading, and ledger applications.If your database does not have strong consistency, transactions must be split into separate operations. If a transaction is not atomic, that means that the transaction can partially fail. Imagine that you use a digital wallet to divide expenses, such as the cost of dinner, with friends. If a money transfer from your wallet to their wallets were not handled within a strongly consistent transaction, you could find yourself in the position where half of the transaction has failed: the funds are in neither your nor your friend’s wallet. The undesirable characteristics of eventual consistency is in the name: immediately after a database operation, the overall database state is inconsistent; only eventually will the changes be served back to all requesters. In the interim, disparate client requests may return different results. If you use a social media service, for example, you have likely experienced a lag time between pressing the button to post a picture and the moment that the image is shown on your timeline. Niantic, the creators of Pokemon GO, choose Spanner specifically to avoid this type of inconsistency in their social application.You can find more detail in this blog post on strong consistency. Essentially, what we’ve learned at Google is that application code is simpler and development schedules are shorter when developers can rely on underlying data stores to handle complex transaction processing and keeping data ordered. To quote the original Spanner paper, “we believe it is better to have application programmers deal with performance problems due to overuse of transactions as bottlenecks arise, rather than always coding around the lack of transactions.”High-availability guaranteesSpanner offers up to 99.999% availability with zero downtime for planned maintenance and schema changes. Spanner is a fully managed service, which means you don’t need to do any maintenance. Automatic software updates and instance optimizations happen in the background. This is achieved without any maintenance windows. Moreover, in case there is a hardware failure, your database will seamlessly recover without downtime.A Spanner instance provides this high availability via synchronous replication between three replicas in independent zones within a single cloud region for regional instances, and between at least four replicas in independent zones across two cloud regions for multi-region instances. Spanner regional instances are available in various regions in our Asia Pacific, Americas, and Europe, Middle East and Africa geographies; multi-region instances are offered in various combinations of regions across the globe.This protects your application from both infrastructure and zone failure for regional instance configurations, and region failure for multi-regional instance configurations.Read-only replicasIf you’re working with read requests that can tolerate a minor amount of data staleness, you can take better advantage of the computing power made available by these replicas and receive results with lower average read latency. This reduction of latency can be significant if you are using a multi-region instance configuration with replicas in geographic proximity to your application client.For queries that can accept this constraint, replicas are able to provide direct responses to your stale read queries without consulting the read-write replica (the split leader). In the case of multi-region instance configurations, the replicas may be much closer geographically to the application client, which can markedly improve the read performance. This capability is comparable to horizontal scaling that’s achieved when traditional RDBMS topologies are deployed with asynchronous read replicas. However, unlike a typical relational database, Spanner delivers this feature without incurring additional operational or management overhead.Effortless horizontal upscaling and downscalingSpanner decouples compute resources from data storage, which makes it possible to increase, decrease, or reallocate the pool of processing resources without any changes to the underlying storage. This is not possible with traditional open source or cloud-based relational database engines.This means that with a single click or API call, horizontal upscaling is possible so you can serve higher operations per second capacity as required by your application, even if data throughput remains low. Moreover, the additional compute resources added can process both reads and writes. Scaling down is just as simple. Spanner provides this capability at the press of a button, as instance nodes can be added or removed easily as your needs change, and these changes take effect in just a few seconds.In other databases, both relational and NoSQL, significant effort is required to grow a cluster horizontally to support additional write capacity. Further, it may not be straightforward, or even possible, to remove the capacity once added.Spanner stands out as a general-use databaseThe relational database is based on concepts outlined in a 1970 paper written by E.F. Codd, and despite being the oldest continually used database technology, the RDBMS retains its position as the database of choice for most new projects. The relational database is trusted technology and many successful companies have published lore relating to their initial choice of MySQL or PostgreSQL. Companies choose the technology because developers know SQL, and because the relational model is flexible during the product development process. (To the point made earlier, it is worth mentioning that in many cases, these origin stories go on to discuss the extreme management effort associated with relational databases once data volumes exceed an unmanageable level.)Of course, with Spanner, there are more abstract concepts involved. Spanner is a distributed database, and its strong external consistency is provided by a robust system featuring redundant local and remote atomic clocks located on the server racks and available via GPS signal, respectively. Yet, it still presents the familiar ANSI SQL compliant interface of a relational database. As a result, application developers can quickly achieve proficiency. The database technology has proven its worth for countless applications at Google—internal and external, big and small. Spanner is firmly seated as a foundational technology that enables a low barrier of entry for developers, and thus the freedom to try new ideas. While our user bases can be extremely large and transaction volumes can be exceptionally high for some product applications, there are other less frequently used applications that serve smaller cohorts. Spanner serves as the back-end data storage service for both application categories.And Google Cloud customers across various verticals have used Spanner successfully for numerous core business use cases: gaming (Lucille Games), fintech (Vodeno), healthcare (Maxwell Plus), retail (L.L.Bean), technology (Optiva) and media and entertainment (Whisper). Here are examples of how those in various industries use Spanner:Spanner lowers TCO with a simpler experience When considering the total cost of ownership (TCO), Spanner costs less to operate. Moreover, when you consider opportunity cost, the return on investment (ROI) can be even higher. Before you solely evaluate the operating expense of Spanner using the per-hour price, compare it to other database options by contrasting holistically the various costs of an alternate choice with the value provided by Spanner.First, consider the cost of running a production-grade database. There are three cost categories: resource, operational, and opportunity. Resource cost is relatively straightforward to calculate as it is based on published list prices. Operational costs are somewhat more difficult to calculate, as this cost is equivalent to the number of team members required to complete various tasks. Opportunity cost calculation is less tangible, but should not be ignored. When you choose to expend organizational budget, in currency or in hours, toward one effort category, there will be less budget available for other opportunities.For this exercise, we’ll first discuss resource cost by comparing the list price of Spanner compared with that of a self-managed open source database running on virtual machines. Then, we’ll compare the operational burden and cost of the same environments. Finally, we’ll address some opportunity value provided by Spanner.To start, when you consider a single database engine running on a small virtual machine, Spanner may appear costly. However, it is not recommended to run a production database on a single compute node. More likely, you will be running on a medium-sized virtual machine with sufficient memory and attached persistent disk provisioned with sufficient headroom for short- to medium-term growth.Also likely is that you will have provisioned a high-availability database topology, which includes an online database replica with the same specifications as your production virtual machine. Further, you may maintain an additional replica database specifically for read-only workloads. If this is the case, you have the compute and storage topology equivalent as provided by Spanner. You have three copies of the data, and three running virtual machines: one virtual machine to manage writes, a second as a high-availability replica, and a third to serve read-only workloads. This reflects the core philosophy behind Spanner: that you should operate with at least three replicas to ensure high availability.Now, let’s consider the relative list price of Spanner to that of a database running on Compute Engine. The list price for Spanner database storage is approximately twice that of zonal persistent disk. However, since you have three copies of data stored in persistent disk, the total cost will be higher.In this topology, for the same amount of application data, Spanner database storage costs approximately one-third less than the price of traditional database storage. Additionally, with Spanner, you only pay for what you use, which saves cost since you will not need to pre-provision initially unused space. And if your data decreases in size, unlike a traditional database, no migration will be required to materialize reduced storage costs.Compute resource price comparison is a bit more complex, as performance is dependent upon your workload. You can compare the price of your three-way replicated traditional RDBMS on production size virtual machines to an equivalent count of Spanner nodes to get a sense of the relative price.However, the scenario does not end here. As you know, the operational cost of managing your own databases is not insignificant. Also, every operational task introduces an additional amount of risk to system uptime. Spanner was designed to provide a high level of service with a low level of operational overhead.In most cases, the operational cost for Spanner approaches zero. To start, Spanner reduces the operational effort required to obtain and retain database backups. Spanner requires no maintenance windows or planned downtime. There is never a need for manual corruption remediation or index rebuilding with Spanner. Nor is any effort required to increase the available storage size for your database. (Unless you deem “effort” the button click to increase the instance node count.) Most important: There is no effort required (again, unless you count the button click) to achieve horizontal or vertical scaling, since Spanner automatically provides dynamic data resharding and data replication.The Enterprise Strategy Group quantified the total cost of ownership (TCO) savings of Spanner in their report Analyzing the Economic Benefits of Google Cloud Spanner Relational Database Service. What they found was that due to the TCO savings and the benefits provided by improved flexibility and innovation, every customer they interviewed preferred Spanner over other database options. Spanner’s total cost of ownership is 78% lower than on-premises databases and 37% lower than other cloud options. With this reduction in operational effort, you can focus on other things that can make your business more successful. This is the opportunity value provided by Spanner. Getting startedSpanner is incredibly powerful, but is also incredibly simple to operate. Spanner has been battle-tested at Google, and we’re proud to provide this technology to customers. There are strong (pun intended) reasons why Spanner is a great choice for your next project, regardless of the workload scope or size. We choose to use Spanner internally at Google Cloud to guarantee object listing in Cloud Storage, and the same choice is made by our customers, such as Colopl, which chose Spanner to help bring you Dragon Quest Walk. Spanner provides familiar relational semantics and query language, and shares the powerful flexibility that has made relational databases the top choice for data storage. No matter the size of your application or your business goals, there is a good chance that Spanner would make a great choice for you as well. Learn moreTo get started with Spanner, create an instanceor try it out with a Spanner Qwiklab.
Quelle: Google Cloud Platform